improved algorithms for linear stochastic...

The Linear Bandit problemOptimism in Face of Uncertainty

A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm

Improved Algorithms for Linear Stochastic Banditsa NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba

Szepesvari

Emilie Kaufmann

SMILE NIPS’11 Defrief session, February 16th 2012

Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari



The linear bandit problem

Setting

set of arms D ⊂ Rd

unknown parameter θ∗ ∈ Rd

best arm (unknown) x∗ = argmaxx∈D

〈x | θ∗〉

At time t pick an arm Xt ∈ D and observe reward :

Yt = 〈Xt | θ∗〉+ ηt

where the noise ηt is centered (E[ηt |Ft−1] = 0) andsubgaussian:

E[eληt | Ft−1

]≤ exp

(λ2

2

)(1)

with Ft = σ(X1, ...,Xt ,Xt+1, η1, ..., ηt−1, ηt)




The linear bandit problem

Goal: Design an algorithm (ie a sequential choice of the arms)minimizing the regret :

Rn =n∑

t=1

〈x∗ | θ∗〉 −n∑

t=1

〈Xt | θ∗〉 =n∑

t=1

〈x∗ − Xt | θ∗〉

In the article: a regret bound for the regret Rn of the OFULalgorithm, holding with high probability

Some applications:

D ⊂ {0, 1}d : online shortest path problem

D is finite : marketing ($$)




A special case : the multiarmed bandit problem

The multiarmed bandit problem in the gaussian case

K independent arms

arm j is an i.i.d. sequence of N (µj , 1) (known variance)

if arm It is chosen, the reward is drawn from N (µj , 1)

Why does it fit our setting ?

D = {e1, ..., eK} the canonical basis of RK

θ∗ = (µ1, ..., µK )T

if arm It is drawn at time t we get the reward

Yt = 〈eIt | θ∗〉+ ηt

with ηt an i.i.d sequence of N (0, 1)

⇒ Generalize UCB optimistic heuristic for the Linear Case




Some ideas to solve the problem

With X1:t =

XT1

.XTt

∈ Rt×d , Y1:t =

Y1

.Yt

, ηt =

η1.ηt

∈ Rt :

Y1:t = X1:tθ∗ + ηt

First idea: use a regularized least-square estimate of θ∗:

θt = (Vt)−1 XT

1:tY1:t

whereVt = λI + XT

1:tX1:t

and choose Xt+1 = argmaxx∈D

〈x | θt〉

⇒ Exploitation only !




Some ideas to solve the problem

With X1:t =

XT1

.XTt

∈ Rt×d , Y1:t =

Y1

.Yt

, ηt =

η1.ηt

∈ Rt :

Y1:t = X1:tθ∗ + ηt

First idea: use a regularized least-square estimate of θ∗:

θt = (Vt)−1 XT

1:tY1:t

whereVt = λI + XT

1:tX1:t

and choose Xt+1 = argmaxx∈D

〈x | θt〉 ⇒ Exploitation only !




Optimism in Face of Uncertainty

A generic optimistic algorithm for the linear bandit:

Build a confidence ellipsoid around θt :

Ct(δ) ={θ ∈ Rd : ||θ − θt ||Vt ≤ βt(δ)

}such that P (∀t ∈ N, θ∗ ∈ Ct(δ)) ≥ 1− δat time t choose Xt such that :

(Xt , θt) = argmax(x ,θ)∈D×Ct−1(δ)

〈x | θ〉




Optimism in Face of Uncertainty

The previous optimization problem rewrites:

Xt = argmaxX∈D

〈X , θt〉+ βt(δ)||X ||V−1t

For the mutiarmed bandit case:

Vt = Diag(Nt(1), ...,Nt(K )) where Nt(j) is the number ofdraws of arm j up to time t

Denotes Rj(t) the sum of reward gathered from arm j up totime t

Arm chosen at time t is the one with highest index:

Rt(j)

Nt(j)+

βt(δ)√Nt(j)

⇒ βt(δ) is an exploration rate... and should belogarithmic!




Building the confidence region

Previously: (in a non-regularized setting)

Dani, Kakade 2008 :

βt(δ) =

√max

(128d ln(t) ln (t2/δ) ,

64

9ln2

(t2

δ

))Rusmeviechentong, Tsitsiklis 2009 :

βt(δ) = K√

log(t)√

d log(t) + log(t2/δ)

Empirically smaller confidence region seems to work !

In the article:

a new (elegant) technique for building a smaller confidenceregion

better empirical performance !




A self-normalized inequality for a vector-valued martingale

Finding the martingale

Let St = XT1:tη1:t =

∑tk=1 ηkXk . St is a Ft martingale and:

∀x ∈ Rd , |x ′(θt − θ∗)| ≤ ||x ||Vt−1

(||St ||Vt

−1 +√λ||θ∗||

)Therefore, finding βt(δ) such that

P(∀t ∈ N, ||St ||Vt

−1 ≤ βt(δ))≥ 1− δ

leads to:

P(∀t ∈ N, ||θt − θ∗||Vt ≤ βt(δ) +

√λ||θ∗||

)≥ 1− δ

For the bandit case:

||St ||V−1t

=k∑

j=1

Rt(j)− µjNt(j)√λ+ Nt(j)




A self-normalized inequality for a vector-valued martingale

Bounding its tail

Theorem

Let Xt and ηt be two real-valued stochastic process such that Xt isFt−1 mesurable, ηt is Ft mesurable and satisfies (1) Noting

Vt = V +t∑

s=1

XsXTs and St =

t∑s=1

ηsXs

Then, for any δ > 0, with probability at least 1− δ,

∀t ≥ 0 ||St ||2V−1t≤ 2 log

(det(Vt)

1/2 det(V )−1/2

δ

)

Generalizes a technique by De La Pena (2004)Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari



Sketch of the proof (1/2)

∀λ ∈ Rd ,

Mλt = exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )

)is a supermartingale such that for all stopping time τ ,E[Mλ

τ ] ≤ 1

Average over λ by setting a ’prior’ Λ ∼ N(0,V−1

)on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and

E[Mt(Λ)|F∞] =

∫Rd

exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )

)f (λ)dλ

where f denotes the pdf of N(0,V−1

)





∀λ ∈ Rd ,

Mλt = exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )


τ ] ≤ 1


)

on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and

E[Mt(Λ)|F∞] =

∫Rd

exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )

)f (λ)dλ


)





∀λ ∈ Rd ,

Mλt = exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )


τ ] ≤ 1


)on the one hand E[Mt(Λ)] ≤ 1

on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and

E[Mt(Λ)|F∞] =

∫Rd

exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )

)f (λ)dλ


)





∀λ ∈ Rd ,

Mλt = exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )


τ ] ≤ 1


)on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and

E[Mt(Λ)|F∞] =

∫Rd

exp

(〈λ | St〉 −

1

2||λ||2(Vt−V )

)f (λ)dλ


)Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari




Direct calculation show that

E[Mt(Λ)|F∞] =

(det(V )

det(Vt)

)1/2

exp

(1

2||St ||2V−1

t

)

Then E[(

det(V )det(Vt)

)1/2exp

(12 ||St ||2V−1

t

)]≤ 1

Using Markov inequality yields:

P

(||St ||2V−1

t> 2 log

(det(Vt)

1/2

δ det(V )1/2

))

=P

(exp

(1

2||St ||2V−1

t

)>

det(Vt)1/2

δ det(V )1/2

)≤ δ

For a bound uniform in t : use stopping times




Final result for the regret

The OFUL algorithm

Finally we use a confidence region based on the exploration rate

βt(δ) =

√2 log

(det(Vt)1/2λ−d/2

δ

)+√λS

Theorem

Assume that ||θ∗|| ≤ S and for all x ∈ D, ||x || ≤ L,〈x | θ∗〉 ∈ [−1, 1]. Then with probability at least 1− δ, the regretof the OFUL algorithm satisfies: ∀n ≥ 0

Rn ≤ 4√

nd log (λ+ nL/d)

(√λS +

√2 log

1

δ+ d log

(1 +

nL

λd

))Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari



Discussion

optimality ? Here Rn = O(d√

n) w.h.p

lower bound for the worst case regret O(√

dn) for multiarmedbanditlower bound O(d

√n) for the linear bandit?

regularized versus non-regularized approaches

empirical performance


improved algorithms for linear stochastic...

Documents