online importance weight aware updateslowrank.net/nikos/pubs/uai11_slides.pdf · having an example...

27
Online Importance Weight Aware Updates Nikos Karampatziakis John Langford July 17th 2011 Nikos & John Online Importance Weight Aware Updates

Upload: others

Post on 24-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Online Importance Weight AwareUpdates

Nikos Karampatziakis John Langford

July 17th 2011

Nikos & John Online Importance Weight Aware Updates

Page 2: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

What this talk is about

Importance weights encode relative importance ofexamplesThey appear in many settings:

I BoostingI Covariate shift correctionI Learning reductionsI Active learning

Online gradient descent popular, sound optimizer

Interplay between importance weights and OGD

Nikos & John Online Importance Weight Aware Updates

Page 3: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

What this talk is about

Importance weights encode relative importance ofexamplesThey appear in many settings:

I BoostingI Covariate shift correctionI Learning reductionsI Active learning

Online gradient descent popular, sound optimizer

Interplay between importance weights and OGD

Nikos & John Online Importance Weight Aware Updates

Page 4: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

What this talk is about

Importance weights encode relative importance ofexamplesThey appear in many settings:

I BoostingI Covariate shift correctionI Learning reductionsI Active learning

Online gradient descent popular, sound optimizer

Interplay between importance weights and OGD

Nikos & John Online Importance Weight Aware Updates

Page 5: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Online Gradient Descent – Linear Model

Loss `(w>t xt , yt)

wt+1 = wt − ηt∂`(p, yt)

∂p

∣∣∣∣p=w>

t xt

xt

Naive approach: define loss it`(w>t xt , yt)

wt+1 = wt − ηt it∂`(p, yt)

∂p

∣∣∣∣p=w>

t xt

xt

Nikos & John Online Importance Weight Aware Updates

Page 6: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Online Gradient Descent – Linear Model

Loss `(w>t xt , yt)

wt+1 = wt − ηt∂`(p, yt)

∂p

∣∣∣∣p=w>

t xt

xt

Naive approach: define loss it`(w>t xt , yt)

wt+1 = wt − ηt it∂`(p, yt)

∂p

∣∣∣∣p=w>

t xt

xt

Nikos & John Online Importance Weight Aware Updates

Page 7: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Failure of Naive Approach

yw>t x

yw>t x

−η(∇`)>x

w>t+1x yw>

t x

−6η(∇`)>x

w>t+1x ??

Nikos & John Online Importance Weight Aware Updates

Page 8: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Failure of Naive Approach

yw>t x

yw>t x

−η(∇`)>x

w>t+1x

yw>t x

−6η(∇`)>x

w>t+1x ??

Nikos & John Online Importance Weight Aware Updates

Page 9: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Failure of Naive Approach

yw>t x yw>t x

−η(∇`)>x

w>t+1x

yw>t x

−6η(∇`)>x

w>t+1x ??

Nikos & John Online Importance Weight Aware Updates

Page 10: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Our principle

PrincipleHaving an example with importance weight i should beequivalent to having the example i times in the dataset.

Present the t-th example it times in a row.

Cumulative effect of this process?

Limit of this process?

Nikos & John Online Importance Weight Aware Updates

Page 11: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Our principle

PrincipleHaving an example with importance weight i should beequivalent to having the example i times in the dataset.

Present the t-th example it times in a row.

Cumulative effect of this process?

Limit of this process?

Nikos & John Online Importance Weight Aware Updates

Page 12: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Multiple updates

yw>t x

−η(∇`)>x

w>t+1x

yw>t x w>

t+1x

Nikos & John Online Importance Weight Aware Updates

Page 13: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Multiple updates

yw>t x

−η(∇`)>x

w>t+1x

yw>t x w>

t+1x

Nikos & John Online Importance Weight Aware Updates

Page 14: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Importance Invariant Updates

Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt

wt+1 is in span of wt and xt . Letting p = w>t xt

wt+1 = wt − s(p, it)xt

Scaling needs to satisfy:

∂s

∂i= η

∂`(ρ, y)

∂ρ

∣∣∣∣ρ=(wt−s(p,i)x)>x

s(ρ, 0) = 0

Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)OGD: just an Euler integrator — Paul Mineiro

Nikos & John Online Importance Weight Aware Updates

Page 15: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Importance Invariant Updates

Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt

wt+1 is in span of wt and xt . Letting p = w>t xt

wt+1 = wt − s(p, it)xt

Scaling needs to satisfy:

∂s

∂i= η

∂`(ρ, y)

∂ρ

∣∣∣∣ρ=(wt−s(p,i)x)>x

s(ρ, 0) = 0

Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)

OGD: just an Euler integrator — Paul Mineiro

Nikos & John Online Importance Weight Aware Updates

Page 16: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Importance Invariant Updates

Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt

wt+1 is in span of wt and xt . Letting p = w>t xt

wt+1 = wt − s(p, it)xt

Scaling needs to satisfy:

∂s

∂i= η

∂`(ρ, y)

∂ρ

∣∣∣∣ρ=(wt−s(p,i)x)>x

s(ρ, 0) = 0

Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)OGD: just an Euler integrator — Paul Mineiro

Nikos & John Online Importance Weight Aware Updates

Page 17: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Closed Form for Many Losses

Loss `(p, y) Update s(p, i)

Squared (y − p)2 p−yx>x

(1− e−iηx

>x)

Logistic log(1 + e−yp) W (e iηx>x+yp+eyp )−iηx>x−eyp

yx>x

Hinge max(0, 1− yp) −y min(iη, 1−ypx>x

)Other losses with closed form updates: exponential,logarithmic, Hellinger, τ -quantile, and others.

Interesting even if i = 1 (satisfies regret guarantee)

Safety: update never overshoots.

Nikos & John Online Importance Weight Aware Updates

Page 18: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Other ways of dealing with largeimportance weights

Sampling: slow, inefficient

Go through the data many times: very slow

Solve wt+1 = argmin 12 ||w − wt ||2 + iη`(w>xt , yt)

I Known as implicit updatesI Qualitatively very similarI Safe, regret guaranteeI Typically not closed form, not invariant

Replace ` above with quadratic approximationI works for some losses

Nikos & John Online Importance Weight Aware Updates

Page 19: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Experiments with small weights

Do the importance invariant updates help when it = 1?

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97

stan

dard

invariant

astro - logistic loss

Almost eliminates search for good learning rateNikos & John Online Importance Weight Aware Updates

Page 20: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Experiments with Active Learning

Importance Weighted Active Learningw1 = 0for t = 1, . . . ,T

1 Receive unlabeled example xt .2 Predict sign(w>t xt).3 Choose a probability of labeling pt .

I Let w ′ = argminw :sign(w>xt)6=sign(w>t xt) ||w − wt ||2.

I Let ∆t be error rate difference between wt and w ′t .

I pt = min{

1,O((

1∆2

t+ 1

∆t

)log tt

)}4 With probability pt get label yt , and update wt with

(xt , yt ,1pt

). Otherwise wt+1 = wt

Nikos & John Online Importance Weight Aware Updates

Page 21: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Estimating Difference in Error Rates

Suppose xt doesn’t have the label preferred by wt

Let ya = − sign(w>t xt).

Find it s.t. wt+1 = w ′t after (xt , ya, it) update

For example, for logistic loss with invariant updates

it =1− eyaw

>t xt − yaw

>t xt

ηt

Then ∆t ≈ it/t

Nikos & John Online Importance Weight Aware Updates

Page 22: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Estimating Difference in Error Rates

Suppose xt doesn’t have the label preferred by wt

Let ya = − sign(w>t xt).

Find it s.t. wt+1 = w ′t after (xt , ya, it) update

For example, for logistic loss with invariant updates

it =1− eyaw

>t xt − yaw

>t xt

ηt

Then ∆t ≈ it/t

Nikos & John Online Importance Weight Aware Updates

Page 23: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Active Learning Results

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.2 0.4 0.6 0.8 1

erro

r

fraction of labels queried

astrophysics

invariantimplicit

gradient multiplicationpassive

Nikos & John Online Importance Weight Aware Updates

Page 24: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Active Learning Results

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.2 0.4 0.6 0.8 1

erro

r

fraction of labels queried

astrophysics

invariantimplicit

gradient multiplicationpassive

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

0.095

0.1

0 0.2 0.4 0.6 0.8 1

erro

r

fraction of labels queried

rcv1

invariantimplicit

gradient multiplicationpassive

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.2 0.4 0.6 0.8 1

erro

r

fraction of labels queried

spam

invariantimplicit

gradient multiplicationpassive

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.2 0.4 0.6 0.8 1

erro

r

fraction of labels queried

webspam

invariantimplicit

gradient multiplicationpassive

Nikos & John Online Importance Weight Aware Updates

Page 25: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

How fast is it?

Demo on RCV1 (≈780K docs 77 features/doc) . . .

As fast as (passive) online gradient descent

I Active learning takes 2.6 sec.I Passive online gradient descent takes 2.5 secI 91K queries (11%)

Nikos & John Online Importance Weight Aware Updates

Page 26: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

How fast is it?

Demo on RCV1 (≈780K docs 77 features/doc) . . .

As fast as (passive) online gradient descentI Active learning takes 2.6 sec.I Passive online gradient descent takes 2.5 secI 91K queries (11%)

Nikos & John Online Importance Weight Aware Updates

Page 27: Online Importance Weight Aware Updateslowrank.net/nikos/pubs/uai11_slides.pdf · Having an example with importance weight i should be equivalent to having the example i times in the

Conclusions

New updates from first principlesYou should use them because they

I work well, even for i = 1I are robust w.r.t. learning rateI are fast (closed form)I have useful properties (invariance, safety)I are simple to code

Check out implementation in Vowpal Wabbithttp://github.com/JohnLangford/vowpal wabbit

Nikos & John Online Importance Weight Aware Updates