boosting
DESCRIPTION
Tony Jebara Advanced Machine Learning. Boosting methodsTRANSCRIPT
Tony Jebara, Columbia University
Advanced Machine Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Boosting • Combining Multiple Classifiers
• Voting
• Boosting
• Adaboost
• Based on material by Y. Freund, P. Long & R. Schapireco taught with:
Freund and Schapire had the big breakthrough algorithm
idea of taking many weak learners, combining to get a strong learner
daniel zhu class only on this
jebara
Tony Jebara, Columbia University
Combining Multiple Learners • Have many simple learners• Also called base learners or weak learners which have a classification error of <0.5 • Combine or vote them to get a higher accuracy• No free lunch: there is no guaranteed best approach here• Different approaches:
Voting combine learners with fixed weight
Mixture of Experts adjust learners and a variable weight/gate fn
Boosting actively search for next base-learners and vote
Cascading, Stacking, Bagging, etc.
better way to do it is a online learning framework.
decision stump, weak learner that looks at one feature, maybe a little better than random
new method is "regret minimiztion" related to multi-arm bandits (daniel Zhu course) and online learning
not just voting on fixed bunch o f experts, you have billions or trillions of experts, all have weight zero, and pick one and add itQ: will you be able to do grad
descent like with kernel learning with switches
face binding, boosted cascade viola jones, catch face without having to scan entire image
Tony Jebara, Columbia University
Voting • Have T classifiers• Average their prediction with weights
• Like mixture of experts but weight is constant with input
h
tx( )
f x( ) = αth
tx( )t=1
T∑ whereαt≥ 0and α
tt=1
T∑ = 1
y
x
Expert Expert Expert
α α α
instead of voting (lin comb of decisions), could do gating (trust doctor 1 on children, doctor 2 is best on elders). Big thing is that you have to make assumptions to do well
Tony Jebara, Columbia University
Mixture of Experts • Have T classifiers or experts and a gating fn • Average their prediction with variable weights• But, adapt parameters of the gating function and the experts (fixed total number T of experts)
h
tx( )
f x( ) = αt
x( )htx( )t=1
T∑ whereαt
x( )≥ 0and αt
x( )t=1
T∑ = 1
Output
Input
Gating Network
Expert Expert Expert
α
tx( )
slightly smartier thing to do, change alpha (confidence in classifier), change allocation in alpha depending on the input (child v elder). Pretty much the same as neural network (why figures are the same), same formula as before, but now a depends on x
Tony Jebara, Columbia University
Boosting • Actively find complementary or synergistic weak learners• Train next learner based on mistakes of previous ones.• Average prediction with fixed weights• Find next learner by training on weighted versions of data.
weights
training data
x
1,y
1( ),…, xN,y
N( ){ }
w
1,…,w
N{ }
h
tx( )weak learner
ensemble of learners
weak rule
prediction α
th
tx( )t=1
T∑
wistep −h
tx
i( )yi( )i=1
N∑w
ii=1
N∑<
12− γ
α
1,h
1( ),…, αT,h
T( ){ }
we are searching for a massive pool of learners. No gating function here, find someone whos good at what you're not good at. Find the next learner by looking at what example we messsed up on, and pick and train expert for previously mixtaken labels. Idea is that eventually this will converge. Greedily adding the next best expert (some nivce theory that you'll mind the equiv of the best possible large margin solution, lets you throw in anything as a weak learner.
we have a lot of data. All examples initially have equal weights. After one round, will weight down correct answer, and increase weights on incorrect answers.
how well the weak learner performs, has gamma (non neg)error. Average loss, empirical risk. Correct is step func of a negative number and step func maps it to zero, otherwise pay 1 dollar. 1/2 is a little bit better than random guessing. 1/2 is not even a learner.
Q: at the end, do we have similar results toSVMs? few support vectors have large lagrange multipliers, rest are not
Tony Jebara, Columbia University
AdaBoost • Most popular weighting scheme• Define margin for point i as• Find an ht and find weight αt to min the cost function
sum exp-margins
weights
training data
x
1,y
1( ),…, xN,y
N( ){ }
w
1,…,w
N{ }
h
tx( )weak learner
ensemble of learners
weak rule
prediction α
th
tx( )t=1
T∑
wistep −h
tx
i( )yi( )i=1
N∑w
ii=1
N∑<
12− γ
α
1,h
1( ),…, αT,h
T( ){ }
y
iα
th
tx
i( )t=1
T∑
exp −y
iα
th
tx
i( )( )i=1
N∑
logic boost, bernstein boost. Adaboost focuses on margin. Margin is a quantity of risk. How much is committee voting for answer.want to minimize the error, use a classifier h that minimizes error. Instead of working on ugly step functions, lets work with an upper bound on the step function and minimize that. That is what ada boost adds (the smoothing). Can imagine hinge loss like svm. There are other boosting methods that introduce another way to smooth the step functions (another benefit here: each x is convex, so sum of exponentials is also convex).
this is the margin, want this h_t(x_i) to be the right sign and large
Tony Jebara, Columbia University
AdaBoost • Choose base learner & αt to• Recall error of base classifier ht must be
• For binary h, Adaboost puts this weight on weak learners: (instead of the
more general rule)
• Adaboost picks the following for the weights on data forthe next round (here Z is the normalizer to sum to 1)
min
αt ,htexp −y
iα
th
tx
i( )( )i=1
N∑
εt
=w
istep −h
jx
i( )yi( )i=1
N∑w
ii=1
N∑<
12− γ
αt
= 12ln
1− εt
εt
⎛
⎝⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟
wit+1 =
wit exp −α
ty
ih
tx
i( )( )Z
t
Q: are you still getting a linear classifier (like 1 layer neural network), at least for each dimension that you work with?
minimizing exponentiated margins for all classifiers
missing w (here)?
Q: is each weak learner always just a binary classifier?
Tony Jebara, Columbia University
Decision Trees
X>3
Y>5 -1
+1 -1
X
Y
3
5
+1
-1
-1
weak learner classed stump, a classifier that cuts along 1 feature only. Simplest possible classifier - axis aligned cuts.Q: Does it matter what order the stumps are? I think so... so you don't have a deterministic way to create a decision tree?
Tony Jebara, Columbia University
-0.2
Decision tree as a sum
X
Y -0.2
Y>5
+0.2 -0.3
X>3
-0.1 +0.1 +0.1 -0.1
+0.2
-0.3
+1
-1
-1 sign
Tony Jebara, Columbia University
An alternating decision tree
X
Y
+0.1 -0.1
+0.2
-0.3
sign
-0.2
Y>5
+0.2 -0.3
X>3
-0.1 +0.1
+0.7
Y<1
0.0 +0.7
+1
-1
-1
+1
find best stump, and then add it - add boosting
Tony Jebara, Columbia University
Example: Medical Diagnostics
• Cleve dataset from UC Irvine database.
• Heart disease diagnostics (+1=healthy,-1=sick)
• 13 features from tests (real valued and discrete).
• 303 instances.
Tony Jebara, Columbia University
Ad-Tree Example
Q: would it be crazy to build a GM off of this tree from boosting?
Tony Jebara, Columbia University
Cross-validated accuracy Learning algorithm
Number of splits
Average test error
Test error variance
ADtree 6 17.0% 0.6%
C5.0 27 27.2% 0.5%
C5.0 + boosting 446 20.2% 0.5%
Boost Stumps 16 16.5% 0.8%
don't worry about heirarchy of applying stumps. just apply all stumps to all datapoints
Tony Jebara, Columbia University
AdaBoost Convergence • Rationale?• Consider bound on the training error:
• Adaboost is essentially doing gradient descent on this.• Convergence?
Remp
= 1N
step −yif x
i( )( )i=1
N∑≤ 1
Nexp −y
if x
i( )( )i=1
N∑= exp −y
iα
th
tx
i( )t=1
T∑( )i=1
N∑= Z
tt=1
T∏
Loss
CorrectMargin Mistakes
Brownboost
Logitboost
0-1 loss
exp bound on step
definition of f(x)
recursive use of Z
from previous slide; boosting is helping, why? We are bounding the error rate. Step is always less than exp (but equal at the origin)
was the normalizer at a particular time, this is product of all Z_t's ever.
through recursive definition, is actually equal to the above sumexponentiated margin loss
Greedy one at a time decrease
Tony Jebara, Columbia University
AdaBoost Convergence • Convergence? Consider the binary ht case.
Remp≤ Z
tt=1
T∏ = wit exp −α
ty
ih
tx
i( )( )i=1
N∑t=1
T∏
= wit exp ln
εt
1− εt
⎛
⎝⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟
12yiht xi( )⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟i=1
N∑t=1
T∏
= wit ε
t
1− εt
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
yiht xi( )
i=1
N∑t=1
T∏
= wit ε
t
1− εt
+ wit 1− ε
t
εt
incorrect∑correct∑⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟t=1
T∏
= 2 εt
1− εt( )t=1
T∏ ≤ 1− 4γt2( )t=1
T∏ ≤ exp −2 γt2
t∑( )
R
emp≤ exp −2 γ
t2
t∑( )≤ exp −2T γ2( )So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!
plug in alpha update from slide 8
gamma stronger than coin flip, all error weights are less than 1/2
T increases as I add learners, T increases with iterations, we get less and less error exponentially. As long as we garuntee that each weak learner is better than random (has error rate gamma better than random)
Tony Jebara, Columbia University
Curious phenomenon Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters larger VC dimension we would think, and overfitting. But we never overfit, b/c of the margin
Tony Jebara, Columbia University
Explanation using margins
Margin
0-1 loss
0 training error, but we've maximized margin. We do additional iterations after getting 0 training error
Tony Jebara, Columbia University
Explanation using margins
Margin
0-1 loss
No examples with small margins!!
margin theta
Tony Jebara, Columbia University
Experimental Evidence
this is a CDF
Tony Jebara, Columbia University
AdaBoost Generalization • Also, a VC analysis gives a generalization bound:
(where d is VC of base classifier)
• But, more iterations overfitting!• A margin analysis is possible, redefine margin as:
Then have
R ≤ Remp
+OTdN
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
marf
x,y( ) =y α
th
tx( )t∑
αtt∑
R ≤ 1N
step θ −marf
xi,y
i( )( )i=1
N
∑ +Od
N θ2
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
generalization garuntee, N is num examples. Td is my total VC dimension. VC dimension isn't giving us reassurance, true risk will be bounded
margina analysis instead - slightly redefined margin definition. Numerator is original margin, now normalize, and then we have another garuntee
looks like an empirical risk, but not quite. Theta is margin threshold (if i don't beat some margin threshold, then im getting errors)
boost forever, turn anything you want into a max margin classifier (so don't stick in an SVM).
Tony Jebara, Columbia University
AdaBoost Generalization
Margin Θ
dm
Θ
• Suggests this optimization problem:
Tony Jebara, Columbia University
AdaBoost Generalization • Proof Sketch
stability type proof, can wiggle margin within support, and still properly classify. proof is not through VC arguments. VC tells you to stop boosting as soon as possible. but margin analyiss says keep boosting - which is the thing you see in practice
Tony Jebara, Columbia University
UCI Results
Database Other Boosting Errorreduction
Cleveland 27.2 (DT) 16.5 39%
Promoters 22.0 (DT) 11.8 46%
Letter 13.8 (DT) 3.5 74%
Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40%
% test error rates
boosted cascasde: turns out if you have a million classifiers, who wants to use a million? its a pain. instead, use a few. Classify image patch to be face/not face. Learn this stump that puts a bunch of strips on faces, and adds white stip pixels with black strip pixels. Very weak classifer, lets do a bunch of these. problem - 24x24 pixel region means 160k stumps to train (perceptrons basically). We can do it, and we see error rate droping. Eventually you have a bunch of boosted stumps. Next scan a image and locate the face. Problem will be scanning very many regions in the image, and each classification in a single region means you use a very large committee of stumps (prob of boosting is you have a large committee that you get answers from). Smart thing, is you ask the first stump, and if he says not-a-face, then you stop asking. Continue asking until first rejection. We know how to use the committee, only query the first guys. Whats nice is its much faster (2003 even).
Boosted bayesian networks, boosted whatever, theres an industry of consistently seeing improvement in performance with anything