combining estimators to improve performance · combining estimators to improve performance a survey...

66
© 1999 Elder & Ridgeway KDD99 T5-1 Combining Estimators Combining Estimators Combining Estimators to Improve Performance to Improve Performance A survey of “model bundling” techniques -- from boosting and bagging, to Bayesian model averaging -- creating a breakthrough in the practice of Data Mining. John F. Elder IV, Ph.D. Elder Research, Charlottesville, Virginia www.datamininglab.com Greg Ridgeway, Ph.D. University of Washington, Dept. of Statistics www.stat.washington.edu/greg

Upload: others

Post on 05-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-1 Combining Estimators

Combining EstimatorsCombining Estimatorsto Improve Performanceto Improve PerformanceA survey of “model bundling” techniques --

from boosting and bagging, to Bayesian model averaging-- creating a breakthrough in the practice of Data Mining.

John F. Elder IV, Ph.D.Elder Research, Charlottesville, Virginia

www.datamininglab.com

Greg Ridgeway, Ph.D.University of Washington, Dept. of Statistics

www.stat.washington.edu/greg

Page 2: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-2 Combining Estimators

Outline

• Why combine? A motivating example• Hidden dangers of model selection• Reducing modeling uncertainty through

Bayesian Model Averaging• Stabilizing predictors through bagging• Improving performance through boosting• Emerging theory illuminates empirical success• Bundling, in general• Latest algorithms• Closing Examples & Summary

Page 3: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-3 Combining Estimators

Reasons to combine estimators

• Decreases variability in the predictions.

• Accounts for uncertainty in the model class.

P−> Improved accuracy on new data.

Page 4: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-4 Combining Estimators

A Motivating Example:Classifying a bat’s species from its chirp• Goal: Use time-frequency features of echolocation signals

to classify bats by species in the field (avoiding captureand physical inspection).

• U. Illinois biologists gathered data: 98 signals from 19bats representing 6 species: Southeastern, Grey, LittleBrown, Indiana, Pipistrelle, Big-Eared.

• ~35 data features (dimensions) calculated from signals,such as low frequency at the 3db level, time position of thesignal peak, and amplitude ratio of 1st and 2nd harmonics.

• Turned out to have a nice level of difficulty for comparingmethods: overlap in classes, but some separability.

Page 5: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-5 Combining Estimators

Sample Projection

bw

t10Var4

Page 6: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-6 Combining Estimators

% 88

82

76

70

64

58

52

46

40

TX2step(11-12)

CART(7)

Trees Polynomial Neural Nearest Networks Networks Neighbor

ASPN(15)

35-N.net

17-N.net

8-N.net (14)

(tree vars)

Visualization &multicorrelation

Optimal (of 8)dimensions

Near.n.

8-Near.n(13)

[Fit](16)

prop.

Avg.

Avg.

OldFit

AdvisorPerceptrons

(16)

trainingtime:

1:seconds2:20 mins

hour day ε combinations

ˆ y 4

∑ ∑enabled

Perc

ent C

lass

ifie

d C

orre

ctly

Bun

dlin

g

Page 7: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-7 Combining Estimators

What is model uncertainty?

• Suppose we wish to predict y frompredictors x.

• Given a dataset of observations, D, for anew observation with predictors x* we wantto derive the predictive distribution of y*

given x* and D.

),|P( Dy ∗∗ x

Page 8: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-8 Combining Estimators

In practice…

• Although we want to use all the informationin D to make the best estimate of y* for anindividual with covariates x*…

• In practice, however, we always use

where M is a model constructed from D.

),|P( Dy ∗∗ x

),|P( My ∗∗ x

Page 9: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-9 Combining Estimators

Selecting M• The process of selecting a model usually

involves– Model class selection

• Linear regression, tree regression, neural network

– Variable selection• variable exclusion, transformation, smoothing

– Parameter estimation

• We tend to choose the one model that fitsthe data or performs best as the model.

Page 10: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-10 Combining Estimators

What’s wrong with that?

• Two models may equally fit a dataset (withrepect to some loss) but have differentpredictions.

• Competing interpretable models withequivalent performance offer ambiguiousconclusions.

• Model search dilutes the evidence. “Part ofthe evidence is spent specifying the model.”

Page 11: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-11 Combining Estimators

Bayesian Model Averaging

Goal: Account for model uncertainty

Method: Use Bayes’ Theorem and average themodels by their posterior probabilities

Properties:

• Improves predictive performance

• Theoretically elegant

• Computationally costly

Page 12: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-12 Combining Estimators

Averaging the models

Consider a set containing the K candidatemodels — M1,…, MK.

With a few probability manipulations we canmake predictions using all of them.

The probability mass for a particular prediction value of y is a weighted average of theprobability mass that each model places on that value of y. The weight is based on theposterior probability of that model given the data.

∑=k kk DMMyDy )|P(),|P(),|P( **** xx

Page 13: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-13 Combining Estimators

Bayes’ Theorem

• Mk - model

• D - data

• P(D|Mk) - integrated likelihood of Mk

• P(Mk) - prior model probability

∑ =

= K

l ll

kkk

MPMDP

MPMDPDMP

1)()|(

)()|()|(

Page 14: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-14 Combining Estimators

Challenges• The size of the model set may cause

exhaustive summation to be impossible.

• The integrated likelihood of each model isusually complex.

• Specifying a prior distribution (even a non-informative one) across the space of modelsis non-trivial.

• Proposed solutions to these challenges often involve MCMC, BICapproximation, MLE approximation, Occam’s window, Occam’s razor.

Page 15: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-15 Combining Estimators

Performance

• Survival model: Primary biliary cirrhosis– BMA vs. Stepwise regression — 2% improvement

– BMA vs. expert selected model — 10% improvement

• Linear regression: Body fat prediction– BMA provides best 90% predictive coverage.

• Graphical models– BMA yields an improvement

Page 16: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-16 Combining Estimators

BMA References

• Chris Volinsky’s BMA homepagewww.research.att.com/~volinsky/bma.html

• J. Hoeting, D. Madigan, A. Raftery, C. Volinsky(1999). “Bayesian Model Averaging: A PracticalTutorial” (to appear in Statistical Science),www.stat.colostate.edu/~jah/documents/bma2.ps

Page 17: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-17 Combining Estimators

We can always assume

Assume that we have a way of constructing apredictor, f , from a dataset D.

We want to choose the estimator of f thatminimizes J, squared loss for example.

Unstable predictors

0)|E( where,)( =+= xx εεfy

2

, ))(ˆ(E),ˆ( xx Dy fyDfJ −=

)(ˆ xDf

Page 18: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-18 Combining Estimators

Bias-variance decompositionIf we could average over all possible datasets,

let the average prediction be

The average prediction error over all datasetsthat we might see is decomposable

variancebiasnoise

))()(ˆ(E))()((EE),ˆ(E 2

,

22

++=

−+−+= xxxx xx ffffDfJ DDD ε

)(ˆE)( xx DD ff =

Page 19: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-19 Combining Estimators

Bias-variance decomposition (cont.)

• The noise cannot be reduced.

• The squared-bias term might be reducible

• The variance term is 0 if we use

But this requires having an infinite number of datasets

variancebiasnoise

))()(ˆ(E))()((EE),ˆ(E 2,

22

++=

−+−+= xxxx xx ffffDfJ DDD ε

)()(ˆ xx ff D =

Page 20: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-20 Combining Estimators

Bagging (Bootstrap Aggregating)

Goal: Variance reduction

Method: Create bootstrap replicates of thedataset and fit a model to each. Average thepredictions of each model.

Properties:

• Stabilizes “unstable” methods

• Easy to implement, parallelizable

• Theory is not fully explained

Page 21: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-21 Combining Estimators

Bagging algorithm

1. Create K bootstrap replicates of the dataset.

2. Fit a model to each of the replicates.

3. Average (or vote) the predictions of the Kmodels.

Bootstrapping simulates the stream of infinitedatasets in the bias-variance decomposition.

Page 22: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-22 Combining Estimators

Bagging Example

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 23: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-23 Combining Estimators

CART decision boundary

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 24: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-24 Combining Estimators

100 bagged trees

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 25: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-25 Combining Estimators

Bagged tree decision boundary

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 26: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-26 Combining Estimators

Regression resultsSquared error loss

010

2030

40

Boston Housing Ozone Friedman #1 Friedman #2 Friedman #3

CARTBagged CART

Page 27: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-27 Combining Estimators

Classification resultsMisclassification rates

05

1015

2025

30

Diabetes Breast Ionosphere Heart Soybean Glass Waveform

CARTBagged CART

Page 28: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-28 Combining Estimators

Bagging References

• Leo Breiman’s homepagewww.stat.berkeley.edu/users/breiman/

• Breiman, L. (1996) “Bagging Predictors,”Machine Learning, 26:2, 123-140.

• Friedman, J. and P. Hall (1999) “OnBagging and Nonlinear Estimation”www.stat.stanford.edu/~jhf

Page 29: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-29 Combining Estimators

BoostingGoal: Improve misclassification rates

Method: Sequentially fit models, each moreheavily weighting those observationspoorly predicted by the previous model

Properties:

• Bias and variance reduction

• Easy to implement

• Theory is not fully (but almost) explained

Page 30: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-30 Combining Estimators

Origin of BoostingClassification problems

{y, x}i , i = 1,…,n

y ∈ {0, 1}

The task - construct a function,

F(x) : x → {0, 1}

so that F minimizes misclassification error.

Page 31: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-31 Combining Estimators

Generic boosting algorithm

Equally weight the observations (y,x)i

For t in 1,…,T

Using the weights, fit a classifier ft(x) → yUpweight the poorly predicted observations

Downweight the well-predicted observations

Merge f1,…,fT to form the boosted classifier

Page 32: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-32 Combining Estimators

Real AdaBoostSchapire & Singer 1998

yi ∈ {-1,1}, wi = 1/NFor t in 1,…,T do

1. Estimate Pw(y = 1|x).

2. Set f

3. w and renormalize

Output the classifier F

)|1(P̂

)|1(P̂log)( 2

1

x

xx

−=

==

y

yf

w

wt

= ∑ tfF )(sign)( xx

( ))(exp itiii fyww x−←

Page 33: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-33 Combining Estimators

AdaBoost’s PerformanceFreund & Schapire [1996]

• Leo Breiman - AdaBoost with trees is the “bestoff-the-shelf classifier in the world.”

• Performs well with many base classifiers and in avariety of problem domains.

• AdaBoost is generally slow to overfit.

• Boosted naïve Bayes tied for first place in the1997 KDD Cup. (Elkan [1997])

• Boosted naïve Bayes is a scalable, interpretableclassifier (Ridgeway, et al [1998]).

Page 34: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-34 Combining Estimators

Boosting Example

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 35: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-35 Combining Estimators

After one iterationCART splits, larger points have great weight

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 36: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-36 Combining Estimators

After 3 iterations

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 37: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-37 Combining Estimators

After 20 iterations

x1

x2

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 38: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-38 Combining Estimators

Decision boundary after 100 iterations

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Page 39: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-39 Combining Estimators

Boosting as optimization

• Friedman, Hastie, Tibshirani [1998] -AdaBoost is an optimization method forfinding a classifier.

• Let y∈{-1,1}, F(x)∈(-∞,∞)

( ) ( )xeEFJ xyF |)(−=

Page 40: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-40 Combining Estimators

Criterion

• E(e–yF(x)) bounds the misclassification rate.

• The minimizer of E(e–yF(x)) coincides withthe maximizer of the expected Bernoullilikelihood.

)()0)(( xyFexyFI −<<

( ) )1log()),(( )(2 xyFeEyxpE −+−=l

Page 41: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-41 Combining Estimators

Optimization step

• Select f to minimize J…

( ) ( )( )xeEfFJ xfxFy |)()( +−=+

)(

21)()1(

)(

),(

]|)1([1

]|)1([log

xyF

w

wtt

t

eyxw

xyIE

xyIEFF

+

=

=−=

+←

Page 42: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-42 Combining Estimators

LogitBoostFriedman, Hastie, Tibshirani [1998]

• Logistic regression

• Expected log-likelihood of a regressor, F(x)

−=

)(1y probabilitwith 0

)(y probabilitwith 1

xp

xpy

( )xexyFF xF |)1log()(E)(E )(+−=l

)(1

1)(

xFexp −+

=

Page 43: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-43 Combining Estimators

Newton steps

• Iterate to optimize expected log-likelihood.

0

)(

0

)(

)()1(

)(

)()()(

2

2

=∂∂

=∂∂

+

+

+−←

f

t

f

f

tf

tt

fFJ

fFJxFxF

( )xexfxFyEfFJ xfxF |)1log())()(()( )()( ++−+=+

Page 44: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-44 Combining Estimators

LogitBoost, continued

• Newton steps for Bernoulli likelihood

• In practice the Ew(•|x) can be any regressor -trees, smoothers, etc.

• Trees are adaptive and work well for highdimensional data.

−−

+← xxpxp

xpyExFxF w ))(1)((

)()()(

))(1)(()( xpxpxw −=

Page 45: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-45 Combining Estimators

Misclassification ratesFriedman, Hastie, Tibshirani [1998]

010

2030

4050

60

Breast Ionosphere Glass Sonar Waveform

CARTAdaBoost CARTLogitBoost CART

Page 46: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-46 Combining Estimators

Boosting References

• Rob Schapire’s homepagewww.research.att.com/~schapire

• Freund, Y. and R. Schapire (1996). “Experiments with a new boostingalgorithm,” Machine Learning: Proceedings of the 13th InternationalConference, 148-156.

• Jerry Friedman’s homepagewww.stat.stanford.edu/~jhf

• Friedman, J., T. Hastie, R. Tibshirani (1998). “Additive LogisticRegression: a statistical view of boosting,” Technical report, StatisticsDepartment, Stanford University.

Page 47: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-47 Combining Estimators

In general, combining (“bundling”)estimators consists of two steps:

• Case Weights• Data Values• Guiding Parameters• Variable Subsets

1) Constructing varied models, and2) Combining their estimates

Generate component models by varying:

Combine estimates using:• Estimator Weights• Voting• Advisor Perceptrons• Partitions of Design Space, X

Page 48: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-48 Combining Estimators

Other Bundling TechniquesWe’ve Examined:• Bayesian Model Averaging: sum estimates of possible models, weighted by

posterior evidence• Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build

trees mostly); take majority vote or average• Boosting (Freund & Shapire 96) -- weight error cases by βt = (1-e(t))/e(t),

iteratively re-model; average, weighing model t by ln(βt)

Additional Example Techniques:• GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using

two inputs each, fit by Linear Regression• Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out

estimates of 1st-level (neural net) models• ARCing (Breiman 96) (Adaptive Resampling and Combining) -- Bagging

with reweighting of error cases; superset of boosting• Bumping (Tibshirani 97) -- bootstrap, select single best• Crumpling (Anderson & Elder 98) -- average cross-validations• Born-Again (Breiman 98) -- invent new X data...

Page 49: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-49 Combining Estimators

Group Method of Data Handling(GMDH)

• Try all pairs of variables (K choose 2) in quadratic polynomial nodes.

• Fit coefficients using regression.

• Keep best M nodes.

• Train model on one training data set, score on test data set. (Need a third dataset for independent confirmation of model.)

A0 + A1a + A2b + A3 a

2 + A4ab + A5b

2

A0 + A1a + A2b + A3 a

2 + A4ab + A5b

2

B0 + B1c + B2d + B3 c

2 + B4cd + B5d

2

B0 + B1c + B2d + B3 c

2 + B4cd + B5d

2

C0 + C1e + C2f + C3 e

2 + C4ef + C5f

2

C0 + C1e + C2f + C3 e

2 + C4ef + C5f

2

a

b

c

d

e

f

z1

z2

z3

z4

z5

yest

Layer 1

Layer 2

Layer 3

Page 50: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-50 Combining Estimators

Polynomial Networks(ASPN)

a N1

k N9

z1

z9 Double 16

f N6z6

d N4z4

h N8

e N5

Layer 0Layer 1

Layer 2

Single 14

Triple 21

Triple 17

Double 20

z5

z8

Multi-Linear 15

Double 19z20

z17

z14

z16

z19

z15U2 Y1

U7 Y2

Layer 3

(Normalizers)

Unitizers

z21

Z17 = 3.1 + 0.4a - .15b2

+ 0.9bc - 0.62abc + 0.5c3

Page 51: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-51 Combining Estimators

When does Bundling work?

• Breiman (1996): when the prediction method is unstable(significantly different models are constructed)

• Ali & Pazzani (1996): when there is low noise, lots ofirrelevant variables, and good individual predictors whichmake different errors

• when models are slightly overfit

• when models are from different families

Hypotheses:

Page 52: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-52 Combining Estimators

Advanced techniques

• Stochastic gradient boosting

• Adaptive bagging

• Example regression and classification results

Page 53: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-53 Combining Estimators

Stochastic Gradient BoostingGoal: Non-parametric function estimation

Method: Cast the problem as optimization anduse gradient ascent to obtain predictor

Properties:

• Bias and variance reduction

• Widely applicable

• Can make use of existing algorithms

• Many tuning parameters

Page 54: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-54 Combining Estimators

Improving boosting

• Boosting usually has the form

Improve by...

• Sub-sampling a fraction of the data at eachstep when computing the expectation.

• “Robustifying” the expectation.

• Trimming observations with small weights.

( )xxyzExFxF wtt ),()()( )()1( λ+←+

Page 55: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-55 Combining Estimators

Stochastic gradient boosting offers...

• Application to likelihood based models(GLM, Cox models)

• Bias reduction - non-linear fitting

• Massive datasets - bagging, trimming

• Variance reduction - bagging

• Interpretability - additive models

• High-dimensional regression - trees

• Robust regression

Page 56: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-56 Combining Estimators

SGB References

• Friedman, J. (1999). “Greedy function approximation: agradient boosting machine,” Technical report, Dept. ofStatistics, Stanford University.

• Friedman, J. (1999). “Stochastic gradient boosting,”Technical report, Dept. of Statistics, Stanford University.

Page 57: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-57 Combining Estimators

Adaptive BaggingGoal: Bias and variance reduction

Method: Sequentially fit bagged models,where each fits the current residuals

Properties:

• Bias and variance reduction

• No tuning parameters

Page 58: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-58 Combining Estimators

Adaptive bagging algorithm1. Fit a bagged regressor to the dataset D.

2. Predict “out-of-bag” observations.

3. Fit a new bagged regressor to the bias(error) and repeat.

For a new observation, sum the predictionsfrom each stage.

Page 59: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-59 Combining Estimators

Regression resultsSquared error loss

05

1015

2025

Abalone Robotarm

Peak20 BostonHousing

Ozone Servo F #1 F #2 F #3

BaggingAdaptive bagging

Page 60: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-60 Combining Estimators

Classification resultsMisclassification rates

05

1015

2025

Breast Ionosphere Sonar Heart Germancredit

Votes Liver Diabetes

AdaBoostAdaptive bagging

Page 61: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-61 Combining Estimators

Relative Performance Examples: 5 Algorithms on 6 Datasets(John Elder, Elder Research & Stephen Lee, U. Idaho, 1997)

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German Credit Waveform Investment

Neural Network

Logistic Regression

Linear Vector Quantization

Projection Pursuit Regression

Decision Tree

Page 62: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-62 Combining Estimators

Essentially every Bundling method improves performance

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Diabetes Gaussian Hypothyroid German Credit Waveform Investment

Advisor Perceptron

AP weighted average

Vote

Average

Page 63: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-63 Combining Estimators

Application Ex.: Direct Marketing(Elder Research 1996-1998)

• Model respondants to direct marketing as binary variable:0 (no response), 1 (response).

• Create models using several (here, 5) different algorithms,all employing the same candidate model inputs.

• Rank-order model responses:

– Give highest-probability response value a rank of 1,second highest value 2, etc.

– For bundling, combine model ranks (not estimates) intoa new consensus estimate (which is again ranked).

• Report number of response cases missed (in top portion).

Page 64: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-64 Combining Estimators

50

55

60

65

70

75

80

0 1 2 3 4 5

#Models Combined (averaging output rank)

#Cas

es M

isse

dMarketing Application Performance

Bundled Trees

Stepwise Regression

Polynomial Network

Neural Network

MARS

NT

NS STMT

PS

PTNP

MS

MN

MP

SNT

MPN

SPTSMTPNTSPNMPT MNT

SMNSMP

SPNT

SMPT

SMNT

SMPNMPNT

Page 65: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-65 Combining Estimators

55

60

65

70

75

1 2 3 4 5

No. Models

Missed

Median (and Mean) Error Reduced with each Stage of Combination

in combination

NOTE: Fewermisses is better

Page 66: Combining Estimators to Improve Performance · Combining Estimators to Improve Performance A survey of “model bundling” techniques --from boosting and bagging, to Bayesian model

© 1999 Elder & Ridgeway KDD99 T5-66 Combining Estimators

Why Bundling works

• (semi-) Independent Estimators

• Bayes Rule - weighing evidence

• Shrinking (ex.: stepwise LR)

• Smoothing (ex.: decision trees)

• Additive modeling and maximum likelihood(Friedman, Hastie, & Tibshirani 8/20/98)

… Open research area.Meanwhile, we recommend bundling competing candidatemodels both within, and between, model families.

...and in a multitude of counselors there is safety.Proverbs 24:6b