tong zhang -...
TRANSCRIPT
Standard model for supervised learning
Goal: want to predict Y based on Xprediction function f (x)
Data: (X ,Y ) are randomly drawn from an underlying distribution DTraining data: iid samples from D:
Sn = {(X1,Y1), . . . , (Xn,Yn)}
Learning: construct prediction function f from training datawant to minimize future loss over D:
E(X ,Y )∼DL(f (X ),Y )
T. Zhang (Rutgers) Model Combination 2 / 19
Learning Algorithm
Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn
Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.
Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance
Model averaging/combination:many algorithms, how to combine?goal: better performance
T. Zhang (Rutgers) Model Combination 3 / 19
Learning Algorithm
Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn
Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.
Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance
Model averaging/combination:many algorithms, how to combine?goal: better performance
T. Zhang (Rutgers) Model Combination 3 / 19
Learning Algorithm
Learning algorithm Agiven training data Sn = {(Xi ,Yi )}i=1,...,n.learn prediction rule f = A(Sn) from Sn
Examples: linear logistic regression, SVM, ridge regression, etclearning linear prediction f (x) = βT x from training data.
Model selection:many algorithms, which one to choose?goal: good interpretability and/or better performance
Model averaging/combination:many algorithms, how to combine?goal: better performance
T. Zhang (Rutgers) Model Combination 3 / 19
Model Combination
Also called ensemble or blending or averagingEqually weighted averaging (voting).Exponentially weighted model averaging.Weight optimization using stacking.Bagging.Additive model and boosting
T. Zhang (Rutgers) Model Combination 4 / 19
Equally weighted averaging (voting)
Models {Mj}j=1,...,L; prediction functions fj(x) from training data.The voting ensemble is
f (x) =1L
L∑j=1
fj(x).
For convex loss functions: reducing model estimation variance.Least squares:
EX ,Y
1L
L∑j=1
fj(X )− Y
2
︸ ︷︷ ︸ensemble performance
=1L
L∑j=1
EX ,Y (fj(X )− Y )2
︸ ︷︷ ︸average single model performance
− 1L
L∑j=1
EX ,Y (fj(X )− f (X ))2
︸ ︷︷ ︸model diversity
T. Zhang (Rutgers) Model Combination 5 / 19
Model Selection versus Averaging
Model selection: works better if one model is significantly moreaccurate than other models
no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different
some ambiguity of which single model is better
Key to the success of model ensemble (averaging):all models are reasonably high performancemodels are diverse (they have different predictions)
T. Zhang (Rutgers) Model Combination 6 / 19
Model Selection versus Averaging
Model selection: works better if one model is significantly moreaccurate than other models
no ambiguity of which single model is betterEqually weighted averaging: works better if all models have similarprediction accuracy, but are different
some ambiguity of which single model is betterKey to the success of model ensemble (averaging):
all models are reasonably high performancemodels are diverse (they have different predictions)
T. Zhang (Rutgers) Model Combination 6 / 19
Ideas to Generate Diverse Models
In competitions, to improve performance, one often creates a diverseset of models with similar performance.
How to do so?
different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)
Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection
T. Zhang (Rutgers) Model Combination 7 / 19
Ideas to Generate Diverse Models
In competitions, to improve performance, one often creates a diverseset of models with similar performance.
How to do so?different learning algorithmstuning parameter variationdifferent variations of featuresrandomization (e.g. bagging)
Average diverse models nearly always help a littlerelated reason: when models are ambiguous, model combination ismore stable than model selection
T. Zhang (Rutgers) Model Combination 7 / 19
Example
Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble trees
Test error:single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)
Larger gains on some other data: create a more diverse ensemble
T. Zhang (Rutgers) Model Combination 8 / 19
Example
Winning strategy in a recent predictive modeling competition overmore than 200 teams.Performance is measured by a certain error metric: the smaller thebetter – test data size: 60,000Ensemble: generate 62 models using randomization and variationsof ensemble treesTest error:
single best: 0.688 (the performance already wins the competition)combination of five models: 0.685 (0.4% error reduction)combination of thirty models: 0.682 (0.9% error reduction)combination of sixty-two models: 0.680 (1.2% error reduction)
Larger gains on some other data: create a more diverse ensemble
T. Zhang (Rutgers) Model Combination 8 / 19
Exponentially Weighted Model Averaging
Observe data (Xi ,Yi) and want to find a combination to minimize acertain error ERROR(f )
Consider model averaging with model outputs f1(x), . . . , fL(x), and
f (x) = w1 f1(x) + · · ·wL fL(x)
Exponentially weighted model averaging:
wj =e−λ ERROR(fj )∑Li=1 e−λ ERROR(fj )
Give larger weight to smaller error (error can be either on trainingor validation data)λ =∞: model selection (find the one with smallest error)λ = 0: equally weighted model averaging
Good theoretical property; related to Bayesian model averaging
T. Zhang (Rutgers) Model Combination 9 / 19
Stacking
Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:
f (x) =L∑
j=1
wj fj(x), wj ≥ 0,∑
j
wj = 1
to minimize the validation error:
minw
VALIDATION ERROR
L∑j=1
wj fj
.more complex than exponentially weighted model averaging
More generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.
T. Zhang (Rutgers) Model Combination 10 / 19
Stacking
Split data into training + validation.L models fj(x) obtained from training data.On validation data define VALIDATION ERROR(fj), find combination:
f (x) =L∑
j=1
wj fj(x), wj ≥ 0,∑
j
wj = 1
to minimize the validation error:
minw
VALIDATION ERROR
L∑j=1
wj fj
.more complex than exponentially weighted model averagingMore generally: treat {fj(x)} as features and use any learningalgorithm to train a combination model with features {fj(x)} onvalidation data.
T. Zhang (Rutgers) Model Combination 10 / 19
Example
Winning strategy of an old competition: CoNLL 2003 Englishnamed entity recognition.Performance measured by F -measure (similar to accuracy): thehigher the betterTraining four classifiers using different algorithmsBest single: 89.94Equally weighted averaging: 91.56Stacking: training a linear combination of the classifiers: 91.63
T. Zhang (Rutgers) Model Combination 11 / 19
Generating Diverse Ensemble: Bagging
How to generate ensemble candidates from a single learningalgorithm A?
Answer: baggingbuild fi from bootstrapped training samples:
drawn n samples S in from training data Sn, with replacement.
fi = A(S in).
form voted classifiers: f = 1L
∑Li=1 fi .
When does it help.unstable classifier: fi very different from each other.performance of fi are similar.
How does it help.variance reduction: ensemble is more stable although each classifieruses less data.
T. Zhang (Rutgers) Model Combination 12 / 19
Generating Diverse Ensemble: Bagging
How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging
build fi from bootstrapped training samples:drawn n samples S i
n from training data Sn, with replacement.fi = A(S i
n).
form voted classifiers: f = 1L
∑Li=1 fi .
When does it help.unstable classifier: fi very different from each other.performance of fi are similar.
How does it help.variance reduction: ensemble is more stable although each classifieruses less data.
T. Zhang (Rutgers) Model Combination 12 / 19
Generating Diverse Ensemble: Bagging
How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging
build fi from bootstrapped training samples:drawn n samples S i
n from training data Sn, with replacement.fi = A(S i
n).
form voted classifiers: f = 1L
∑Li=1 fi .
When does it help.unstable classifier: fi very different from each other.performance of fi are similar.
How does it help.variance reduction: ensemble is more stable although each classifieruses less data.
T. Zhang (Rutgers) Model Combination 12 / 19
Generating Diverse Ensemble: Bagging
How to generate ensemble candidates from a single learningalgorithm A?Answer: bagging
build fi from bootstrapped training samples:drawn n samples S i
n from training data Sn, with replacement.fi = A(S i
n).
form voted classifiers: f = 1L
∑Li=1 fi .
When does it help.unstable classifier: fi very different from each other.performance of fi are similar.
How does it help.variance reduction: ensemble is more stable although each classifieruses less data.
T. Zhang (Rutgers) Model Combination 12 / 19
Other Ideas to Generate Ensemble Candidates
Use boosting: can be either regarded as a single model or as anensemble method
Use different learning algorithms (different loss functions,regularizations)Use different variations of data-processing
different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps
Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor
Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)
T. Zhang (Rutgers) Model Combination 13 / 19
Other Ideas to Generate Ensemble Candidates
Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)
Use different variations of data-processingdifferent transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps
Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor
Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)
T. Zhang (Rutgers) Model Combination 13 / 19
Other Ideas to Generate Ensemble Candidates
Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing
different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps
Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor
Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)
T. Zhang (Rutgers) Model Combination 13 / 19
Other Ideas to Generate Ensemble Candidates
Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing
different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps
Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor
Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)
T. Zhang (Rutgers) Model Combination 13 / 19
Other Ideas to Generate Ensemble Candidates
Use boosting: can be either regarded as a single model or as anensemble methodUse different learning algorithms (different loss functions,regularizations)Use different variations of data-processing
different transformations of featuresdifferent subsets (selection) of features – unstable so choose frommultiple ensemble helps
Use different tuning parameter configurations:e.g. varying λ for ridge regression or K for nearest neighbor
Use randomizationtry to randomly select training data (bagging)try to randomly select features or randomly select basis functionstry to randomization the algorithm especially if it contains somecombinatoric elements (e.g. random forest for decision tree building)
T. Zhang (Rutgers) Model Combination 13 / 19
Decision Tree
X
A < 2
B < 2
X B < 4
Y X
true
true
true false
false
false
A and B: featuresX and Y : predicted class labelsTraining: greedily test features to split at each node
T. Zhang (Rutgers) Model Combination 14 / 19
Remarks on Decision Tree
Advantages:interpretablehandle non-homogeneous features easilyfinds non-linear interactions
Disadvantage:usually not the most accurate classifier by itself
T. Zhang (Rutgers) Model Combination 15 / 19
Improving Decision Tree Performance
Improve accuracy through tree ensemble learning:bagging
generate bootstrap samples.train one tree per bootstrap sample.take equally weighted average of the trees.
random forestbagging with additional randomization.
boosting
T. Zhang (Rutgers) Model Combination 16 / 19
Bagging and Random Forest
Baggingbuild fi from bootstrapped training samples:Form voted classifiers: f = 1
m
∑mi=1 fi .
Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.
T. Zhang (Rutgers) Model Combination 17 / 19
Bagging and Random Forest
Baggingbuild fi from bootstrapped training samples:Form voted classifiers: f = 1
m
∑mi=1 fi .
Random forestgenerate bootstrap samples.build one tree per bootstrap sampleincrease diversity via additional randomization: randomly pick a subsetof features to split at each nodetake equally weighted average of the trees.
T. Zhang (Rutgers) Model Combination 17 / 19
Why Bagging and RF work
Tree unstable, thus different random trees are very differentMay generate deep trees, overfitting the data with each treeEqually weighted averaging reduces the variance.Related to Bayesian model averaging over all possible trees.
T. Zhang (Rutgers) Model Combination 18 / 19
Summary
Model selection: find the best single modelbetter interpretability but may be unstable under ambiguity
Model averaging/combination:find the best combination of modelsvarious methods
Model averaging is often more effective than model selectionensemble can improve stability: ensemble is stable even if individualmodels are not
Key issue in model averaging: generate diverse models with goodperformance
T. Zhang (Rutgers) Model Combination 19 / 19
Summary
Model selection: find the best single modelbetter interpretability but may be unstable under ambiguity
Model averaging/combination:find the best combination of modelsvarious methods
Model averaging is often more effective than model selectionensemble can improve stability: ensemble is stable even if individualmodels are not
Key issue in model averaging: generate diverse models with goodperformance
T. Zhang (Rutgers) Model Combination 19 / 19