boosting

Tony Jebara, Columbia University

Advanced Machine Learning & Perception

Instructor: Tony Jebara


Boosting • Combining Multiple Classifiers

• Voting

• Boosting

• Adaboost

• Based on material by Y. Freund, P. Long & R. Schapireco taught with:

Freund and Schapire had the big breakthrough algorithm

idea of taking many weak learners, combining to get a strong learner

daniel zhu class only on this

jebara


Combining Multiple Learners • Have many simple learners• Also called base learners or weak learners which have a classification error of <0.5 • Combine or vote them to get a higher accuracy• No free lunch: there is no guaranteed best approach here• Different approaches:

Voting combine learners with fixed weight

Mixture of Experts adjust learners and a variable weight/gate fn

Boosting actively search for next base-learners and vote

Cascading, Stacking, Bagging, etc.

better way to do it is a online learning framework.

decision stump, weak learner that looks at one feature, maybe a little better than random

new method is "regret minimiztion" related to multi-arm bandits (daniel Zhu course) and online learning

not just voting on fixed bunch o f experts, you have billions or trillions of experts, all have weight zero, and pick one and add itQ: will you be able to do grad

descent like with kernel learning with switches

face binding, boosted cascade viola jones, catch face without having to scan entire image


Voting • Have T classifiers• Average their prediction with weights

• Like mixture of experts but weight is constant with input

h

tx( )

f x( ) = αth

tx( )t=1

T∑ whereαt≥ 0and α

tt=1

T∑ = 1

y

x

Expert Expert Expert

α α α

instead of voting (lin comb of decisions), could do gating (trust doctor 1 on children, doctor 2 is best on elders). Big thing is that you have to make assumptions to do well


Mixture of Experts • Have T classifiers or experts and a gating fn • Average their prediction with variable weights• But, adapt parameters of the gating function and the experts (fixed total number T of experts)

h

tx( )

f x( ) = αt

x( )htx( )t=1

T∑ whereαt

x( )≥ 0and αt

x( )t=1

T∑ = 1

Output

Input

Gating Network

Expert Expert Expert

α

tx( )

slightly smartier thing to do, change alpha (confidence in classifier), change allocation in alpha depending on the input (child v elder). Pretty much the same as neural network (why figures are the same), same formula as before, but now a depends on x


Boosting • Actively find complementary or synergistic weak learners• Train next learner based on mistakes of previous ones.• Average prediction with fixed weights• Find next learner by training on weighted versions of data.

weights

training data

x

1,y

1( ),…, xN,y

N( ){ }

w

1,…,w

N{ }

h

tx( )weak learner

ensemble of learners

weak rule

prediction α

th

tx( )t=1

T∑

wistep −h

tx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

α

1,h

1( ),…, αT,h

T( ){ }

we are searching for a massive pool of learners. No gating function here, find someone whos good at what you're not good at. Find the next learner by looking at what example we messsed up on, and pick and train expert for previously mixtaken labels. Idea is that eventually this will converge. Greedily adding the next best expert (some nivce theory that you'll mind the equiv of the best possible large margin solution, lets you throw in anything as a weak learner.

we have a lot of data. All examples initially have equal weights. After one round, will weight down correct answer, and increase weights on incorrect answers.

how well the weak learner performs, has gamma (non neg)error. Average loss, empirical risk. Correct is step func of a negative number and step func maps it to zero, otherwise pay 1 dollar. 1/2 is a little bit better than random guessing. 1/2 is not even a learner.

Q: at the end, do we have similar results toSVMs? few support vectors have large lagrange multipliers, rest are not


AdaBoost • Most popular weighting scheme• Define margin for point i as• Find an ht and find weight αt to min the cost function

sum exp-margins

weights

training data

x

1,y

1( ),…, xN,y

N( ){ }

w

1,…,w

N{ }

h

tx( )weak learner

ensemble of learners

weak rule

prediction α

th

tx( )t=1

T∑

wistep −h

tx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

α

1,h

1( ),…, αT,h

T( ){ }

y

iα

th

tx

i( )t=1

T∑

exp −y

iα

th

tx

i( )( )i=1

N∑

logic boost, bernstein boost. Adaboost focuses on margin. Margin is a quantity of risk. How much is committee voting for answer.want to minimize the error, use a classifier h that minimizes error. Instead of working on ugly step functions, lets work with an upper bound on the step function and minimize that. That is what ada boost adds (the smoothing). Can imagine hinge loss like svm. There are other boosting methods that introduce another way to smooth the step functions (another benefit here: each x is convex, so sum of exponentials is also convex).

this is the margin, want this h_t(x_i) to be the right sign and large


AdaBoost • Choose base learner & αt to• Recall error of base classifier ht must be

• For binary h, Adaboost puts this weight on weak learners: (instead of the

more general rule)

• Adaboost picks the following for the weights on data forthe next round (here Z is the normalizer to sum to 1)

min

αt ,htexp −y

iα

th

tx

i( )( )i=1

N∑

εt

=w

istep −h

jx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

αt

= 12ln

1− εt

εt

⎛

⎝⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟

wit+1 =

wit exp −α

ty

ih

tx

i( )( )Z

t

Q: are you still getting a linear classifier (like 1 layer neural network), at least for each dimension that you work with?

minimizing exponentiated margins for all classifiers

missing w (here)?

Q: is each weak learner always just a binary classifier?


Decision Trees

X>3

Y>5 -1

+1 -1

X

Y

3

5

+1

-1

-1

weak learner classed stump, a classifier that cuts along 1 feature only. Simplest possible classifier - axis aligned cuts.Q: Does it matter what order the stumps are? I think so... so you don't have a deterministic way to create a decision tree?


-0.2

Decision tree as a sum

X

Y -0.2

Y>5

+0.2 -0.3

X>3

-0.1 +0.1 +0.1 -0.1

+0.2

-0.3

+1

-1

-1 sign

artsyinc

Callout

does no cutting, just gives -0.2 to every datapoint

artsyinc

Callout

this is adding to the previous -0.2 weight, not replacing it


An alternating decision tree

X

Y

+0.1 -0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2 -0.3

X>3

-0.1 +0.1

+0.7

Y<1

0.0 +0.7

+1

-1

-1

+1

find best stump, and then add it - add boosting


Example: Medical Diagnostics

• Cleve dataset from UC Irvine database.

• Heart disease diagnostics (+1=healthy,-1=sick)

• 13 features from tests (real valued and discrete).

• 303 instances.


Ad-Tree Example

Q: would it be crazy to build a GM off of this tree from boosting?


Cross-validated accuracy Learning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting 446 20.2% 0.5%

Boost Stumps 16 16.5% 0.8%

don't worry about heirarchy of applying stumps. just apply all stumps to all datapoints


AdaBoost Convergence • Rationale?• Consider bound on the training error:

• Adaboost is essentially doing gradient descent on this.• Convergence?

Remp

= 1N

step −yif x

i( )( )i=1

N∑≤ 1

Nexp −y

if x

i( )( )i=1

N∑= exp −y

iα

th

tx

i( )t=1

T∑( )i=1

N∑= Z

tt=1

T∏

Loss

CorrectMargin Mistakes

Brownboost

Logitboost

0-1 loss

exp bound on step

definition of f(x)

recursive use of Z

from previous slide; boosting is helping, why? We are bounding the error rate. Step is always less than exp (but equal at the origin)

was the normalizer at a particular time, this is product of all Z_t's ever.

through recursive definition, is actually equal to the above sumexponentiated margin loss

Greedy one at a time decrease

artsyinc

Callout

penalty im paying even if i get correct classifier


AdaBoost Convergence • Convergence? Consider the binary ht case.

Remp≤ Z

tt=1

T∏ = wit exp −α

ty

ih

tx

i( )( )i=1

N∑t=1

T∏

= wit exp ln

εt

1− εt

⎛

⎝⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟

12yiht xi( )⎛

⎝

⎜⎜⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟

⎛

⎝

⎜⎜⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟i=1

N∑t=1

T∏

= wit ε

t

1− εt

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

yiht xi( )

i=1

N∑t=1

T∏

= wit ε

t

1− εt

+ wit 1− ε

t

εt

incorrect∑correct∑⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟t=1

T∏

= 2 εt

1− εt( )t=1

T∏ ≤ 1− 4γt2( )t=1

T∏ ≤ exp −2 γt2

t∑( )

R

emp≤ exp −2 γ

t2

t∑( )≤ exp −2T γ2( )So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!

plug in alpha update from slide 8

gamma stronger than coin flip, all error weights are less than 1/2

T increases as I add learners, T increases with iterations, we get less and less error exponentially. As long as we garuntee that each weak learner is better than random (has error rate gamma better than random)


Curious phenomenon Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters larger VC dimension we would think, and overfitting. But we never overfit, b/c of the margin


Explanation using margins

Margin

0-1 loss

0 training error, but we've maximized margin. We do additional iterations after getting 0 training error


Explanation using margins

Margin

0-1 loss

No examples with small margins!!

margin theta


Experimental Evidence

this is a CDF


AdaBoost Generalization • Also, a VC analysis gives a generalization bound:

(where d is VC of base classifier)

• But, more iterations overfitting!• A margin analysis is possible, redefine margin as:

Then have

R ≤ Remp

+OTdN

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

marf

x,y( ) =y α

th

tx( )t∑

αtt∑

R ≤ 1N

step θ −marf

xi,y

i( )( )i=1

N

∑ +Od

N θ2

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

generalization garuntee, N is num examples. Td is my total VC dimension. VC dimension isn't giving us reassurance, true risk will be bounded

margina analysis instead - slightly redefined margin definition. Numerator is original margin, now normalize, and then we have another garuntee

looks like an empirical risk, but not quite. Theta is margin threshold (if i don't beat some margin threshold, then im getting errors)

boost forever, turn anything you want into a max margin classifier (so don't stick in an SVM).


AdaBoost Generalization

Margin Θ

dm

Θ

• Suggests this optimization problem:


AdaBoost Generalization • Proof Sketch

stability type proof, can wiggle margin within support, and still properly classify. proof is not through VC arguments. VC tells you to stop boosting as soon as possible. but margin analyiss says keep boosting - which is the thing you see in practice


UCI Results

Database Other Boosting Errorreduction

Cleveland 27.2 (DT) 16.5 39%

Promoters 22.0 (DT) 11.8 46%

Letter 13.8 (DT) 3.5 74%

Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

boosted cascasde: turns out if you have a million classifiers, who wants to use a million? its a pain. instead, use a few. Classify image patch to be face/not face. Learn this stump that puts a bunch of strips on faces, and adds white stip pixels with black strip pixels. Very weak classifer, lets do a bunch of these. problem - 24x24 pixel region means 160k stumps to train (perceptrons basically). We can do it, and we see error rate droping. Eventually you have a bunch of boosted stumps. Next scan a image and locate the face. Problem will be scanning very many regions in the image, and each classification in a single region means you use a very large committee of stumps (prob of boosting is you have a large committee that you get answers from). Smart thing, is you ask the first stump, and if he says not-a-face, then you stop asking. Continue asking until first rejection. We know how to use the committee, only query the first guys. Whats nice is its much faster (2003 even).

Boosted bayesian networks, boosted whatever, theres an industry of consistently seeing improvement in performance with anything

boosting

Documents

t x ht x t

t wheret x

tht x t

t classifiersaverage

tony jebara tony jebara

y x expert expert expert

columbia university

input htx f x