boosting

25
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara

Upload: ilya-kavalerov

Post on 14-Dec-2015

214 views

Category:

Documents


1 download

DESCRIPTION

Tony Jebara Advanced Machine Learning. Boosting methods

TRANSCRIPT

Page 1: Boosting

Tony Jebara, Columbia University

Advanced Machine Learning & Perception

Instructor: Tony Jebara

Page 2: Boosting

Tony Jebara, Columbia University

Boosting • Combining Multiple Classifiers

• Voting

• Boosting

• Adaboost

• Based on material by Y. Freund, P. Long & R. Schapireco taught with:

Freund and Schapire had the big breakthrough algorithm

idea of taking many weak learners, combining to get a strong learner

daniel zhu class only on this

jebara

Page 3: Boosting

Tony Jebara, Columbia University

Combining Multiple Learners • Have many simple learners• Also called base learners or weak learners which have a classification error of <0.5 • Combine or vote them to get a higher accuracy• No free lunch: there is no guaranteed best approach here• Different approaches:

Voting combine learners with fixed weight

Mixture of Experts adjust learners and a variable weight/gate fn

Boosting actively search for next base-learners and vote

Cascading, Stacking, Bagging, etc.

better way to do it is a online learning framework.

decision stump, weak learner that looks at one feature, maybe a little better than random

new method is "regret minimiztion" related to multi-arm bandits (daniel Zhu course) and online learning

not just voting on fixed bunch o f experts, you have billions or trillions of experts, all have weight zero, and pick one and add itQ: will you be able to do grad

descent like with kernel learning with switches

face binding, boosted cascade viola jones, catch face without having to scan entire image

Page 4: Boosting

Tony Jebara, Columbia University

Voting • Have T classifiers• Average their prediction with weights

• Like mixture of experts but weight is constant with input

h

tx( )

f x( ) = αth

tx( )t=1

T∑ whereαt≥ 0and α

tt=1

T∑ = 1

y

x

Expert Expert Expert

α α α

instead of voting (lin comb of decisions), could do gating (trust doctor 1 on children, doctor 2 is best on elders). Big thing is that you have to make assumptions to do well

Page 5: Boosting

Tony Jebara, Columbia University

Mixture of Experts • Have T classifiers or experts and a gating fn • Average their prediction with variable weights• But, adapt parameters of the gating function and the experts (fixed total number T of experts)

h

tx( )

f x( ) = αt

x( )htx( )t=1

T∑ whereαt

x( )≥ 0and αt

x( )t=1

T∑ = 1

Output

Input

Gating Network

Expert Expert Expert

α

tx( )

slightly smartier thing to do, change alpha (confidence in classifier), change allocation in alpha depending on the input (child v elder). Pretty much the same as neural network (why figures are the same), same formula as before, but now a depends on x

Page 6: Boosting

Tony Jebara, Columbia University

Boosting • Actively find complementary or synergistic weak learners• Train next learner based on mistakes of previous ones.• Average prediction with fixed weights• Find next learner by training on weighted versions of data.

weights

training data

x

1,y

1( ),…, xN,y

N( ){ }

w

1,…,w

N{ }

h

tx( )weak learner

ensemble of learners

weak rule

prediction α

th

tx( )t=1

T∑

wistep −h

tx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

α

1,h

1( ),…, αT,h

T( ){ }

we are searching for a massive pool of learners. No gating function here, find someone whos good at what you're not good at. Find the next learner by looking at what example we messsed up on, and pick and train expert for previously mixtaken labels. Idea is that eventually this will converge. Greedily adding the next best expert (some nivce theory that you'll mind the equiv of the best possible large margin solution, lets you throw in anything as a weak learner.

we have a lot of data. All examples initially have equal weights. After one round, will weight down correct answer, and increase weights on incorrect answers.

how well the weak learner performs, has gamma (non neg)error. Average loss, empirical risk. Correct is step func of a negative number and step func maps it to zero, otherwise pay 1 dollar. 1/2 is a little bit better than random guessing. 1/2 is not even a learner.

Q: at the end, do we have similar results toSVMs? few support vectors have large lagrange multipliers, rest are not

Page 7: Boosting

Tony Jebara, Columbia University

AdaBoost • Most popular weighting scheme• Define margin for point i as• Find an ht and find weight αt to min the cost function

sum exp-margins

weights

training data

x

1,y

1( ),…, xN,y

N( ){ }

w

1,…,w

N{ }

h

tx( )weak learner

ensemble of learners

weak rule

prediction α

th

tx( )t=1

T∑

wistep −h

tx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

α

1,h

1( ),…, αT,h

T( ){ }

y

th

tx

i( )t=1

T∑

exp −y

th

tx

i( )( )i=1

N∑

logic boost, bernstein boost. Adaboost focuses on margin. Margin is a quantity of risk. How much is committee voting for answer.want to minimize the error, use a classifier h that minimizes error. Instead of working on ugly step functions, lets work with an upper bound on the step function and minimize that. That is what ada boost adds (the smoothing). Can imagine hinge loss like svm. There are other boosting methods that introduce another way to smooth the step functions (another benefit here: each x is convex, so sum of exponentials is also convex).

this is the margin, want this h_t(x_i) to be the right sign and large

Page 8: Boosting

Tony Jebara, Columbia University

AdaBoost • Choose base learner & αt to• Recall error of base classifier ht must be

• For binary h, Adaboost puts this weight on weak learners: (instead of the

more general rule)

• Adaboost picks the following for the weights on data forthe next round (here Z is the normalizer to sum to 1)

min

αt ,htexp −y

th

tx

i( )( )i=1

N∑

εt

=w

istep −h

jx

i( )yi( )i=1

N∑w

ii=1

N∑<

12− γ

αt

= 12ln

1− εt

εt

⎝⎜⎜⎜⎜

⎟⎟⎟⎟⎟

wit+1 =

wit exp −α

ty

ih

tx

i( )( )Z

t

Q: are you still getting a linear classifier (like 1 layer neural network), at least for each dimension that you work with?

minimizing exponentiated margins for all classifiers

missing w (here)?

Q: is each weak learner always just a binary classifier?

Page 9: Boosting

Tony Jebara, Columbia University

Decision Trees

X>3

Y>5 -1

+1 -1

X

Y

3

5

+1

-1

-1

weak learner classed stump, a classifier that cuts along 1 feature only. Simplest possible classifier - axis aligned cuts.Q: Does it matter what order the stumps are? I think so... so you don't have a deterministic way to create a decision tree?

Page 10: Boosting

Tony Jebara, Columbia University

-0.2

Decision tree as a sum

X

Y -0.2

Y>5

+0.2 -0.3

X>3

-0.1 +0.1 +0.1 -0.1

+0.2

-0.3

+1

-1

-1 sign

artsyinc
Callout
does no cutting, just gives -0.2 to every datapoint
artsyinc
Callout
this is adding to the previous -0.2 weight, not replacing it
Page 11: Boosting

Tony Jebara, Columbia University

An alternating decision tree

X

Y

+0.1 -0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2 -0.3

X>3

-0.1 +0.1

+0.7

Y<1

0.0 +0.7

+1

-1

-1

+1

find best stump, and then add it - add boosting

Page 12: Boosting

Tony Jebara, Columbia University

Example: Medical Diagnostics

• Cleve dataset from UC Irvine database.

• Heart disease diagnostics (+1=healthy,-1=sick)

• 13 features from tests (real valued and discrete).

• 303 instances.

Page 13: Boosting

Tony Jebara, Columbia University

Ad-Tree Example

Q: would it be crazy to build a GM off of this tree from boosting?

Page 14: Boosting

Tony Jebara, Columbia University

Cross-validated accuracy Learning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting 446 20.2% 0.5%

Boost Stumps 16 16.5% 0.8%

don't worry about heirarchy of applying stumps. just apply all stumps to all datapoints

Page 15: Boosting

Tony Jebara, Columbia University

AdaBoost Convergence • Rationale?• Consider bound on the training error:

• Adaboost is essentially doing gradient descent on this.• Convergence?

Remp

= 1N

step −yif x

i( )( )i=1

N∑≤ 1

Nexp −y

if x

i( )( )i=1

N∑= exp −y

th

tx

i( )t=1

T∑( )i=1

N∑= Z

tt=1

T∏

Loss

CorrectMargin Mistakes

Brownboost

Logitboost

0-1 loss

exp bound on step

definition of f(x)

recursive use of Z

from previous slide; boosting is helping, why? We are bounding the error rate. Step is always less than exp (but equal at the origin)

was the normalizer at a particular time, this is product of all Z_t's ever.

through recursive definition, is actually equal to the above sumexponentiated margin loss

Greedy one at a time decrease

artsyinc
Callout
penalty im paying even if i get correct classifier
Page 16: Boosting

Tony Jebara, Columbia University

AdaBoost Convergence • Convergence? Consider the binary ht case.

Remp≤ Z

tt=1

T∏ = wit exp −α

ty

ih

tx

i( )( )i=1

N∑t=1

T∏

= wit exp ln

εt

1− εt

⎝⎜⎜⎜⎜

⎟⎟⎟⎟⎟

12yiht xi( )⎛

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟i=1

N∑t=1

T∏

= wit ε

t

1− εt

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

yiht xi( )

i=1

N∑t=1

T∏

= wit ε

t

1− εt

+ wit 1− ε

t

εt

incorrect∑correct∑⎛

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟t=1

T∏

= 2 εt

1− εt( )t=1

T∏ ≤ 1− 4γt2( )t=1

T∏ ≤ exp −2 γt2

t∑( )

R

emp≤ exp −2 γ

t2

t∑( )≤ exp −2T γ2( )So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!

plug in alpha update from slide 8

gamma stronger than coin flip, all error weights are less than 1/2

T increases as I add learners, T increases with iterations, we get less and less error exponentially. As long as we garuntee that each weak learner is better than random (has error rate gamma better than random)

Page 17: Boosting

Tony Jebara, Columbia University

Curious phenomenon Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters larger VC dimension we would think, and overfitting. But we never overfit, b/c of the margin

Page 18: Boosting

Tony Jebara, Columbia University

Explanation using margins

Margin

0-1 loss

0 training error, but we've maximized margin. We do additional iterations after getting 0 training error

Page 19: Boosting

Tony Jebara, Columbia University

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

margin theta

Page 20: Boosting

Tony Jebara, Columbia University

Experimental Evidence

this is a CDF

Page 21: Boosting

Tony Jebara, Columbia University

AdaBoost Generalization • Also, a VC analysis gives a generalization bound:

(where d is VC of base classifier)

• But, more iterations overfitting!• A margin analysis is possible, redefine margin as:

Then have

R ≤ Remp

+OTdN

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

marf

x,y( ) =y α

th

tx( )t∑

αtt∑

R ≤ 1N

step θ −marf

xi,y

i( )( )i=1

N

∑ +Od

N θ2

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟

generalization garuntee, N is num examples. Td is my total VC dimension. VC dimension isn't giving us reassurance, true risk will be bounded

margina analysis instead - slightly redefined margin definition. Numerator is original margin, now normalize, and then we have another garuntee

looks like an empirical risk, but not quite. Theta is margin threshold (if i don't beat some margin threshold, then im getting errors)

boost forever, turn anything you want into a max margin classifier (so don't stick in an SVM).

Page 22: Boosting

Tony Jebara, Columbia University

AdaBoost Generalization

Margin Θ

dm

Θ

• Suggests this optimization problem:

Page 23: Boosting

Tony Jebara, Columbia University

AdaBoost Generalization • Proof Sketch

stability type proof, can wiggle margin within support, and still properly classify. proof is not through VC arguments. VC tells you to stop boosting as soon as possible. but margin analyiss says keep boosting - which is the thing you see in practice

Page 24: Boosting

Tony Jebara, Columbia University

UCI Results

Database Other Boosting Errorreduction

Cleveland 27.2 (DT) 16.5 39%

Promoters 22.0 (DT) 11.8 46%

Letter 13.8 (DT) 3.5 74%

Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

Page 25: Boosting

boosted cascasde: turns out if you have a million classifiers, who wants to use a million? its a pain. instead, use a few. Classify image patch to be face/not face. Learn this stump that puts a bunch of strips on faces, and adds white stip pixels with black strip pixels. Very weak classifer, lets do a bunch of these. problem - 24x24 pixel region means 160k stumps to train (perceptrons basically). We can do it, and we see error rate droping. Eventually you have a bunch of boosted stumps. Next scan a image and locate the face. Problem will be scanning very many regions in the image, and each classification in a single region means you use a very large committee of stumps (prob of boosting is you have a large committee that you get answers from). Smart thing, is you ask the first stump, and if he says not-a-face, then you stop asking. Continue asking until first rejection. We know how to use the committee, only query the first guys. Whats nice is its much faster (2003 even).

Boosted bayesian networks, boosted whatever, theres an industry of consistently seeing improvement in performance with anything