on-line learning and boosting

22
On-line learning and Boosting Overview of “A Decision- Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering

Upload: yaphet

Post on 07-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

On-line learning and Boosting. Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering. Hedge - Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On-line learning and Boosting

On-line learning and Boosting

Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by

Freund and Schapire (1997).

Tim Miller

University of Minnesota

Department of Computer Science and Engineering

Page 2: On-line learning and Boosting

Hedge - Motivation

Generalization of Weighted Majority Algorithm

Given a set of expert predictions, minimize mistakes over time

Slight emphasis in motivation on possibility of treating wt as a prior.

Page 3: On-line learning and Boosting

Hedge Algorithm

Parameters , w, T For 1..T

1. Choose allocation p (probability distribution formed from weights)

2. Receive loss vector l3. Suffer loss p l

4. Set new weight vector to w l

Page 4: On-line learning and Boosting

Hedge Analysis

Does not perform “too much worse” than best strategy: LHedge() ( - ln (w1) – Li ln ) · Z Z = 1 / (1 - )

Is it possible to do better?

Page 5: On-line learning and Boosting

Boosting

If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them

Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them

Page 6: On-line learning and Boosting

Definitions

Given labeled data < x, c(x) >, where c is the target concept, c: X {0, 1}.

c CC,, the concept class Strong PAC-learning algorithm: For

parameters ,, hypothesis has error less than with probability (1-)

Weak algorithm: (0.5 - ), > 0

Page 7: On-line learning and Boosting

AdaBoost Algorithm

Input: Sequence of N labeled examples Distribution D over the N examples Weak learning algorithm (called WeakLearn) Number of iterations T

Page 8: On-line learning and Boosting

AdaBoost contd.

Initialize: w1 = D For t =1..T

1. Form probability distribution p from w

2. Call WeakLearn with distribution p

3. Calculate error t = i=1..N pi | ht(xi) – yi |

4. Set t = t / (1 - t)

5. Multiplicatively adjust weights (w)by

t 1-|ht(xi)–yi|

Page 9: On-line learning and Boosting

AdaBoost Output

Output (+1) if: t=1..T (log 1/t) ht(x) ½ t=1..T log 1/t 0 otherwise Computes a weighted average

Page 10: On-line learning and Boosting

AdaBoost Analysis

Note of “dual” relationship with Hedge Strategies Examples Trials Weak hypotheses Hedge increases weight for successful strategies,

AdaBoost increases weight for difficult examples

AdaBoost has dynamic

Page 11: On-line learning and Boosting

AdaBoost Bounds

2T t=1..T sqrt(t(1 - t))

Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome)

AdaBoost takes advantage of gains from best hypotheses

Page 12: On-line learning and Boosting

Multi-class Setting

k > 2 output labels, i.e. Y = {1, 2, …, k} Error: Probability of incorrect prediction Two algorithms:

AdaBoost.M1 – More direct AdaBoost.M2 – Somewhat complex constraints on weak

learners

Could also just divide into “one vs. one” or “one vs. all” categories

Page 13: On-line learning and Boosting

AdaBoost.M1

Requires each classifier to have error less than 50% (stronger requirement than binary case)

Similar to regular AdaBoost algorithm except: Error is 1 if ht(xi) yi

Can’t use algorithms with error > 0.5 Algorithm outputs vector of length k with values

between 0 & 1

Page 14: On-line learning and Boosting

AdaBoost.M1 Analysis

2T t=1..T sqrt(t(1 - t))

Same as bounds for regular AdaBoost Proof converts multi-class problem to a

binary setup Can we improve this algorithm?

Page 15: On-line learning and Boosting

AdaBoost.M2

More expressive, more complex constraints on weak hypotheses

Defines idea of “Pseudo-Loss” Pseudo-loss of each weak hypothesis must be

better than chance Benefit: Allows contributions from

hypotheses with accuracy < 0.5

Page 16: On-line learning and Boosting

Pseudo-loss

Replaces straightforward loss of AdaBoost.M1 plossq(h,i) =

0.5 ( 1 – h(xi,yi) + yyi q(i,y) h(xi,y)

Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average.

Makes use of information in entire hypothesis vector, not just prediction

Page 17: On-line learning and Boosting

AdaBoost.M2 Details

Extra init: wti,y = D(i) / (k-1)

For each iteration t = 1 to T Wt

i = yyi wti,y

qt(i,y) = wti,y / Wt

i

Dt(i) = Wti / i=1..N Wt

i

WeakLearn gets D as well as q Calculate t as shown above t = t / (1 - t) wt

i,y· t (0.5)(1 + ht(xi,yi) – ht(xi,y))

Page 18: On-line learning and Boosting

Error Bounds

(k – 1) 2Tt=1..T sqrt(t(1 - t)) Where is traditional error and the t are pseudo-

losses

Page 19: On-line learning and Boosting

Regression Setting

Instead of picking from a discrete set of output labels, choose a continuous value

More formally Y = [0, 1] Minimize the mean squared error:

E[(h(x) – y)2]

Reduce to binary classification and use AdaBoost!

Page 20: On-line learning and Boosting

How it works (roughly)

For each example in training set, create continuum of associated instances xtilde(xi, y) where y [0, 1].

Label is 1 if y yi

Mapping to an infinite training set – need to convert discrete distributions to density functions

Page 21: On-line learning and Boosting

AdaBoost.R Bounds

2T t=1..T sqrt(t (1 - t))

Page 22: On-line learning and Boosting

Conclusions

Starting from a on-line learning perspective, it is possible to generalize to boosting

Boosting can take weak learners and convert them to strong learners

This paper presented several algorithms to do boosting, with proofs of error bounds