online importance weight aware updateslowrank.net/nikos/pubs/uai11_slides.pdf · having an example...
TRANSCRIPT
Online Importance Weight AwareUpdates
Nikos Karampatziakis John Langford
July 17th 2011
Nikos & John Online Importance Weight Aware Updates
What this talk is about
Importance weights encode relative importance ofexamplesThey appear in many settings:
I BoostingI Covariate shift correctionI Learning reductionsI Active learning
Online gradient descent popular, sound optimizer
Interplay between importance weights and OGD
Nikos & John Online Importance Weight Aware Updates
What this talk is about
Importance weights encode relative importance ofexamplesThey appear in many settings:
I BoostingI Covariate shift correctionI Learning reductionsI Active learning
Online gradient descent popular, sound optimizer
Interplay between importance weights and OGD
Nikos & John Online Importance Weight Aware Updates
What this talk is about
Importance weights encode relative importance ofexamplesThey appear in many settings:
I BoostingI Covariate shift correctionI Learning reductionsI Active learning
Online gradient descent popular, sound optimizer
Interplay between importance weights and OGD
Nikos & John Online Importance Weight Aware Updates
Online Gradient Descent – Linear Model
Loss `(w>t xt , yt)
wt+1 = wt − ηt∂`(p, yt)
∂p
∣∣∣∣p=w>
t xt
xt
Naive approach: define loss it`(w>t xt , yt)
wt+1 = wt − ηt it∂`(p, yt)
∂p
∣∣∣∣p=w>
t xt
xt
Nikos & John Online Importance Weight Aware Updates
Online Gradient Descent – Linear Model
Loss `(w>t xt , yt)
wt+1 = wt − ηt∂`(p, yt)
∂p
∣∣∣∣p=w>
t xt
xt
Naive approach: define loss it`(w>t xt , yt)
wt+1 = wt − ηt it∂`(p, yt)
∂p
∣∣∣∣p=w>
t xt
xt
Nikos & John Online Importance Weight Aware Updates
Failure of Naive Approach
yw>t x
yw>t x
−η(∇`)>x
w>t+1x yw>
t x
−6η(∇`)>x
w>t+1x ??
Nikos & John Online Importance Weight Aware Updates
Failure of Naive Approach
yw>t x
yw>t x
−η(∇`)>x
w>t+1x
yw>t x
−6η(∇`)>x
w>t+1x ??
Nikos & John Online Importance Weight Aware Updates
Failure of Naive Approach
yw>t x yw>t x
−η(∇`)>x
w>t+1x
yw>t x
−6η(∇`)>x
w>t+1x ??
Nikos & John Online Importance Weight Aware Updates
Our principle
PrincipleHaving an example with importance weight i should beequivalent to having the example i times in the dataset.
Present the t-th example it times in a row.
Cumulative effect of this process?
Limit of this process?
Nikos & John Online Importance Weight Aware Updates
Our principle
PrincipleHaving an example with importance weight i should beequivalent to having the example i times in the dataset.
Present the t-th example it times in a row.
Cumulative effect of this process?
Limit of this process?
Nikos & John Online Importance Weight Aware Updates
Multiple updates
yw>t x
−η(∇`)>x
w>t+1x
yw>t x w>
t+1x
Nikos & John Online Importance Weight Aware Updates
Multiple updates
yw>t x
−η(∇`)>x
w>t+1x
yw>t x w>
t+1x
Nikos & John Online Importance Weight Aware Updates
Importance Invariant Updates
Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt
wt+1 is in span of wt and xt . Letting p = w>t xt
wt+1 = wt − s(p, it)xt
Scaling needs to satisfy:
∂s
∂i= η
∂`(ρ, y)
∂ρ
∣∣∣∣ρ=(wt−s(p,i)x)>x
s(ρ, 0) = 0
Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)OGD: just an Euler integrator — Paul Mineiro
Nikos & John Online Importance Weight Aware Updates
Importance Invariant Updates
Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt
wt+1 is in span of wt and xt . Letting p = w>t xt
wt+1 = wt − s(p, it)xt
Scaling needs to satisfy:
∂s
∂i= η
∂`(ρ, y)
∂ρ
∣∣∣∣ρ=(wt−s(p,i)x)>x
s(ρ, 0) = 0
Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)
OGD: just an Euler integrator — Paul Mineiro
Nikos & John Online Importance Weight Aware Updates
Importance Invariant Updates
Gradients point to the same direction ∇` = ∂`(p,yt)∂p xt
wt+1 is in span of wt and xt . Letting p = w>t xt
wt+1 = wt − s(p, it)xt
Scaling needs to satisfy:
∂s
∂i= η
∂`(ρ, y)
∂ρ
∣∣∣∣ρ=(wt−s(p,i)x)>x
s(ρ, 0) = 0
Invariance:s(p, a + b) = s(p, a) + s(p − s(p, a)||x ||2, b)OGD: just an Euler integrator — Paul Mineiro
Nikos & John Online Importance Weight Aware Updates
Closed Form for Many Losses
Loss `(p, y) Update s(p, i)
Squared (y − p)2 p−yx>x
(1− e−iηx
>x)
Logistic log(1 + e−yp) W (e iηx>x+yp+eyp )−iηx>x−eyp
yx>x
Hinge max(0, 1− yp) −y min(iη, 1−ypx>x
)Other losses with closed form updates: exponential,logarithmic, Hellinger, τ -quantile, and others.
Interesting even if i = 1 (satisfies regret guarantee)
Safety: update never overshoots.
Nikos & John Online Importance Weight Aware Updates
Other ways of dealing with largeimportance weights
Sampling: slow, inefficient
Go through the data many times: very slow
Solve wt+1 = argmin 12 ||w − wt ||2 + iη`(w>xt , yt)
I Known as implicit updatesI Qualitatively very similarI Safe, regret guaranteeI Typically not closed form, not invariant
Replace ` above with quadratic approximationI works for some losses
Nikos & John Online Importance Weight Aware Updates
Experiments with small weights
Do the importance invariant updates help when it = 1?
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
stan
dard
invariant
astro - logistic loss
Almost eliminates search for good learning rateNikos & John Online Importance Weight Aware Updates
Experiments with Active Learning
Importance Weighted Active Learningw1 = 0for t = 1, . . . ,T
1 Receive unlabeled example xt .2 Predict sign(w>t xt).3 Choose a probability of labeling pt .
I Let w ′ = argminw :sign(w>xt)6=sign(w>t xt) ||w − wt ||2.
I Let ∆t be error rate difference between wt and w ′t .
I pt = min{
1,O((
1∆2
t+ 1
∆t
)log tt
)}4 With probability pt get label yt , and update wt with
(xt , yt ,1pt
). Otherwise wt+1 = wt
Nikos & John Online Importance Weight Aware Updates
Estimating Difference in Error Rates
Suppose xt doesn’t have the label preferred by wt
Let ya = − sign(w>t xt).
Find it s.t. wt+1 = w ′t after (xt , ya, it) update
For example, for logistic loss with invariant updates
it =1− eyaw
>t xt − yaw
>t xt
ηt
Then ∆t ≈ it/t
Nikos & John Online Importance Weight Aware Updates
Estimating Difference in Error Rates
Suppose xt doesn’t have the label preferred by wt
Let ya = − sign(w>t xt).
Find it s.t. wt+1 = w ′t after (xt , ya, it) update
For example, for logistic loss with invariant updates
it =1− eyaw
>t xt − yaw
>t xt
ηt
Then ∆t ≈ it/t
Nikos & John Online Importance Weight Aware Updates
Active Learning Results
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 0.2 0.4 0.6 0.8 1
erro
r
fraction of labels queried
astrophysics
invariantimplicit
gradient multiplicationpassive
Nikos & John Online Importance Weight Aware Updates
Active Learning Results
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 0.2 0.4 0.6 0.8 1
erro
r
fraction of labels queried
astrophysics
invariantimplicit
gradient multiplicationpassive
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0.085
0.09
0.095
0.1
0 0.2 0.4 0.6 0.8 1
erro
r
fraction of labels queried
rcv1
invariantimplicit
gradient multiplicationpassive
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 0.2 0.4 0.6 0.8 1
erro
r
fraction of labels queried
spam
invariantimplicit
gradient multiplicationpassive
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 0.2 0.4 0.6 0.8 1
erro
r
fraction of labels queried
webspam
invariantimplicit
gradient multiplicationpassive
Nikos & John Online Importance Weight Aware Updates
How fast is it?
Demo on RCV1 (≈780K docs 77 features/doc) . . .
As fast as (passive) online gradient descent
I Active learning takes 2.6 sec.I Passive online gradient descent takes 2.5 secI 91K queries (11%)
Nikos & John Online Importance Weight Aware Updates
How fast is it?
Demo on RCV1 (≈780K docs 77 features/doc) . . .
As fast as (passive) online gradient descentI Active learning takes 2.6 sec.I Passive online gradient descent takes 2.5 secI 91K queries (11%)
Nikos & John Online Importance Weight Aware Updates
Conclusions
New updates from first principlesYou should use them because they
I work well, even for i = 1I are robust w.r.t. learning rateI are fast (closed form)I have useful properties (invariance, safety)I are simple to code
Check out implementation in Vowpal Wabbithttp://github.com/JohnLangford/vowpal wabbit
Nikos & John Online Importance Weight Aware Updates