regret to the best vs. regret to the average

Regret to the Best vs.

Regret to the Average

Eyal Even-Dar Michael Kearns Yishay Mansour Jennifer Wortman

Upenn + Tel Aviv Univ.Slides: Csaba

Motivation

Expert algorithms attempt to control regret to the return of the best expert

Regret to the average return? Same bound! Weak???

EW: wi1=1, wit=wi,t-1e git , pit=wit/Wt, Wt = i wit

E1: 1 0 1 0 1 0 1 0 1 0 …E2: 0 1 0 1 0 1 0 1 0 1 …

GA,T=T/2-cT1/2

GT+ = GT

- = GT0 = T/2

RT+ · cT1/2, RT

0· c T1/2

Notation - gains

git2 [0,1] - gains

g=(git) - sequence of gains

GiT(g)= t=1T git - cumulated gains

G0T(g)=(i GiT(g))/N - average gain

G-T(g)=mini GiT(g) - worst gain

G+T(g)=maxi GiT(g) - best gain

GDT(g)=i Di GiT(g) - weighted avg. gain

Notation - algorithms

wit – unnormalized weights

pit=wit/Wt, – normalized weightsWt = i wit

gA,t=i pit git – gain of A

GAT(g)= t gA,t – cumulated gain of A

Notation - regret

regret to the.. R+

T(g) = (G+T(g) – GA,T(g)) Ç 1 – best

R-T(g) = (G-

T(g) – GA,T(g)) Ç 1 – worst

R0T(g) = (G0

T(g) – GA,T(g)) Ç 1 – avg

RDT(g) = (GD

T(g) – GA,T(g)) Ç 1 – dist.

Goal

Algorithm A is “nice” if .. R+

A,T · O(T1/2)

R0A,T · 1

Program: Examine existing algorithms (“difference

algorithms”) – lower bound Show “nice” algorithms Show that no substantial further improvement is

possible

“Difference” algorithms

Def:A is a difference algorithm if for N=2, git2 {0,1}, p1t = f(dt), p2t = 1-f(dt), dt = G1t-G2t

Examples: EW: wit = e Git

FPL: Choose argmaxi ( Git+Zit )

Prod: wit = s (1+ gis) = (1+)Git

A lower bound for difference algorithms

Theorem:If A is a difference algorithm then there exist some series, g, g’ (tuned to A), such that

R+AT (g) R0

AT (g’) ¸ R+AT (g) R-

AT (g’) = (T)

For R+AT = maxg R+

AT(g), R-AT = maxg R-

AT(g),

R0AT = maxg R0

AT(g),

R+AT R0

AT ¸ R+AT R-

AT = (T)

Proof

Assume T is even, p11 · ½

: first time t when p1t¸ 2/3 ) R+AT(g) ¸ /3

9 2 {2,3,..,} s.t. p1-p1-1 ¸ 1/(6)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 …0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

g:

Proof/2 p1-p1-1 ¸ 1/(6)

G+T=G-

T=G0=T/2

GAT(g’)· + (T-2)/2 (1-1/(6)) R-

AT(g’) ¸ (T-2)/(12) ) R+

AT(g)R-AT(g’)¸ (T-2)/36

1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1g’:

p1,t=p1,

p1,t+1=p1,-1

Gain: · 1-1/(6)p1t=p1,T-t

Gain: p1t+1-p1t=1

Tightness

We know that for difference algorithms

R+AT R0

AT ¸ R+AT R-

AT = (T) Can a (difference) algorithm achieve this? Theorem: EW=EW(), with appropriately

tuned =(), 0· · 1/2 has

R+EW,T· T1/2+ (1+ln N)

R0EW,T· T1/2-

Breaking the frontier

What’s wrong with the difference algorithms? They are designed to find the best expert with

low regret (fast) ..they don’t pay attention to the average gain

and how it compares with the best gain

BestWorst(A)

G+T-G-

T: the spread of cumulated gain Idea: Stay with the average, until the spread

becomes large. Then switch to learning (using algorithm A).

When the spread is large enough, G0

T=GBW(A),T À G-T

) “Nothing” to loose Spread threshold: NR; where R=RT,N is a

bound on the regret of A.

BestWorst(A)

Theorem: R+BW(A),T = O(NR), GBW(A),T¸ G-{T}

Proof:At the time of switch, GBW(A) ¸ (G++ (N-1)+G-)/N. Since G+¸ G-+NR,

GBW(A)¸ G- + R.

PhasedAgression(A,R,D)

for k=1:log2(R) do=2k-1/RA.reset(); s:=0 // local time, new phase

while (G+s-GD

s<2R) do

qs := A.getNormedWeights( gs-1 )

ps := qs + (1-) Dend

endA.reset()run A until time T

PA(A,R,D) – Theorem

Theorem:Let A be any algorithm with regret R = RT,N to the best expert, D any distribution.Then for PA=PA(A,R,D),

R+PA,T· 2R(log R+1)

RDPA,T· 1

Proof Consider local time s during phase k. D and A share the gains & the regret

G+s-GPA,s < 2k-1/R£ R + (1-2k-1/R) £ 2R < 2R

GDs-GPA,s· 2k-1/R £ R =2k-1

What happens at the end of the phase?

GPA,s-GD,s ¸ 2k-1/R £ (G+

s-R-GDs)

¸ 2k-1/R £ (G+s-GD

s-R+GDsGD

s)¸ 2k-1/R £ R = 2k-1.

What if PA ends in phase k at time T:

G+T-GPA,T · 2R k · 2R (log R + 1)

GDT-GPA,T· 2k-1 - j=1

k-1 2j-1= 2k-1(2k-1-1)=1

General lower bounds

Theorem:

R+A,T=O(T1/2) ) R0

A,T=(T1/2)

R+A,T· (Tlog(T))1/2/10 ) R0

A,T=(T), where ¸ 0.02

Compare this with

R+PA,T· 2R(log R+1), RD

PA,T· 1,

where R=(T log N)1/2

Conclusions

Achieving constant regret to the average is a reasonable goal.

“Classical” algorithms do not have this property, but satisfy R+

AT R0AT ¸ (T).

Modification: Learn only when it makes sense; ie. when the best is much better than the average

PhasedAgression: Optimal tradeoff Can we remove dependence on T?

regret to the best vs. regret to the average

Documents

r g sr

r log r

regret r

r0at r

gbwa g r

g r0at g r

tfor r

rlog r