linear bandits: from theory to applications · linear bandits are a powerful and well-understood...

Linear Bandits: From Theory to Applications

Claire Vernade

DeepMind – Foundations Team

credits : Csaba Szepesvari, Tor Lattimore for their blog

www.banditalgs.com

Sequential Decision Making

1

Real World Sequential Decision Making

2

Table of contents

1. Linear Bandits

2. Real-World Setting: Delayed Feedback

3

Linear Bandits

Linear Bandits

1. In round t, observe action set At ⊂ Rd .

2. The learner chooses At ∈ At and receives Xt , satisfying

E[Xt |A1,A1, . . . ,At ,At ] = 〈At , θ∗〉 := fθ∗(At)

for some unknown θ∗.

3. Light-tailed noise:

Xt − 〈At , θ∗〉 = ηt ∼ N (0, 1)

Goal: Keep regret

Rn = E

[n∑

t=1

maxa∈At

〈a, θ∗〉 − Xt

]small.

4

Real-World setting

Typical setting: a user, represented by its feature vector ut , shows up and

we have a finite set of (correlated) actions (a1, . . . , aK ).

Some function Φ joins these vectors pairwise to create a contextualized

action set:

∀i ∈ [K ], Φ(ut , ai ) = at,i ∈ Rd At = {at,1, . . . , at,K}.

No assumption is to be made on the joining function Φ as the bandit

may take over the decision step from that contextualized action set.

So, it is equivalent to At ∼ Π(Rd) some arbitrary distribution, or

A1, . . . ,An fixed arbitrarily by the environment.

5

Toolbox of the optimist

Say, reward in round t is Xt , action in round t is At ∈ Rd :

Xt = 〈At , θ∗〉+ ηt ,

We want to estimate θ∗:regularized least-squares estimator:

θt = V−1t

t∑s=1

AsXs ,

V0 = λI , Vt = V0 +t∑

s=1

AsA>s .

Choice of confidence regions (ellipsoids) Ct :

Ct.

={θ ∈ Rd : ‖θ − θt−1‖2

Vt−1≤ βt

}.

where, for A positive definite, ‖x‖2A = x>Ax .

6

LinUCB

“Choose the best action in the best environment amongst the plausible

ones.”

Choose Ct with suitable (βt)t and let

At = argmaxa∈A

maxθ∈Ct〈a, θ〉 .

Or, more concretely, for each action a ∈ A, compute the ”optimistic

index”

Ut(a) = maxθ∈Ct〈a, θ〉 .

Maximising a linear function over a convex closed set, the solution is

explicit:

At = argmaxa

Ut(a) = argmaxa〈a, θt〉+

√βt ‖a‖V−1

t−1.

7

Optimism in the Face of Uncertainty Principle

8

Regret Bound

Assumptions:

1. Bounded scalar mean reward: |〈a, θ∗〉| ≤ 1 for any a ∈ ∪tAt .

2. Bounded actions: for any a ∈ ∪tAt , ‖a‖2 ≤ L.

3. Honest confidence intervals: There exists a δ ∈ (0, 1) such that with

probability 1− δ, for all t ∈ [n], θ∗ ∈ Ct for some choice of (βt)t≤n.

Theorem (LinUCB Regret)

Let the conditions listed above hold. Then with probability 1− δ the

regret of LinUCB satisfies

Rn ≤

√8dnβn log

(dλ+ nL2

dλ

).

9

Proof

Jensen’s inequality shows that

Rn =n∑

t=1

〈A∗t − At , θ〉 :=n∑

t=1

rt ≤

√√√√nn∑

t=1

r2t

where A∗t.

= argmaxa∈At〈a, θ∗〉.

Let θt be the vector that realizes the maximum over the ellipsoid:

θt ∈ Ct s.t. 〈At , θt〉 = Ut(At).

From the definition of LinUCB,

〈A∗t , θ∗〉 ≤ Ut(A∗t ) ≤ Ut(At) = 〈At , θt〉 .

Then,

rt ≤ 〈At , θt − θ∗〉 ≤ ‖At‖V−1t−1‖θt − θ∗‖Vt−1 ≤ 2 ‖At‖V−1

t−1

√βt .

10

Elliptical Potential Lemma

So we now have a new upper bound,

Rn =n∑

t=1

rt ≤

√√√√nn∑

t=1

r2t ≤ 2

√√√√nβn

n∑t=1

(1 ∧ ‖At‖2V−1t−1

) .

Lemma (Abbasi-Yadkori et al. (2011))

Let x1, . . . , xn ∈ Rd , Vt = V0 +∑t

s=1 xsx>s , t ∈ [n], and L ≥ maxt ‖xt‖2.

Then,

n∑t=1

(1 ∧ ‖xt‖2

V−1t−1

)≤ 2 log

(detVn

detV0

)≤ d log

(trace(V0) + nL2

d det1/d(V0)

).

11

Confidence Ellipsoids

Assumptions: ‖θ∗‖ ≤ S , and let (As )s , (ηs )s be so that for any 1 ≤ s ≤ t, ηs |Fs−1 ∼ subG(1),

where Fs = σ(A1, η1, . . . ,As−1, ηs−1,As )

Fix δ ∈ (0, 1). Letβt+1 =

√λS +

√2 log

(1δ

)+ log

(det Vt(λ)λd

)≤√λS +

√2 log

(1δ

)+ log

(λd+nL2

dλ

),

and Ct+1 ={θ ∈ Rd : ‖θt − θ∗‖Vt(λ) ≤ βt+1

}.

Theorem

Ct+1 is a confidence set for θ∗ at level 1− δ:

P (θ∗ ∈ Ct+1) ≥ 1− δ .

Proof : See Chapter 20 of Bandit Algorithms (www.banditalgs.com)

12

History

• Abe and Long [4] introduced stochastic linear bandits into machine

learning literature.

• Auer [6] was the first to consider optimism for linear bandits (LinRel,

SupLinRel). Main restriction: |At | < +∞.

• Confidence ellipsoids: Dani et al. [8] (ConfidenceBall2),

Rusmevichientong and Tsitsiklis [11] (Uncertainty Ellipsoid Policy),

Abbasi-Yadkori et al. [3] (OFUL).

• The name LinUCB comes from Chu et al. [7].

• Alternative routes:

• Explore then commit for action sets with smooth boundary.

Abbasi-Yadkori [1], Abbasi-Yadkori et al. [2], Rusmevichientong and

Tsitsiklis [11].

• Phased elimination

• Thompson sampling

13

Summary

Theorem (LinUCB Regret)

Let the conditions listed above hold. Then with probability 1− δ the

regret of LinUCB satisfies

Rn ≤

√√√√8dnβn log

(trace(V0) + nL2

d det1d (V0)

)= O(d

√n).

Linear bandits are an elegant model of the exploration-exploitation

dilemma when actions are correlated.

The main ingredients of the regret analysis are:

• bounding the instantaneous regret using the definition of optimism;

• a maximal concentration inequality holding for a randomized,

sequential design;

• the Elliptical Potential Lemma.14

Real-World Setting: Delayed

Feedback

In a real-world application, rewards are delayed ...

15

In a real-world application, rewards are delayed ... and censored.

16

Delayed Linear Bandits

Modified setting: at round t ≥ 1,

• receive contextualized action set At = {a1, . . . , aK} and choose

action At ∈ At ,

• two random variables are generated but not observed:

Xt ∼ B(θ>At) and Dt ∼ D(τ),

• at t + Dt the reward Xt of action At is disclosed ...

• ...unless Dt > m : If the delay is too long, the reward is discarded.

New parameter: 0 < m < T is the cut-off time of the system. If the

delay is longer, the reward is never received. The delay distribution D(τ)

characterizes the proportion of converting actions: τm = p(Dt ≤ m).

17

A new estimator

We now have :

Vt =t−1∑s=1

AsA>s bt =

t−1∑s=1

AsXs1{Ds ≤ m}

where bt contains additional non-identically distributed samples:

bt =t−m∑s=1

AsXs1{Ds ≤ m}+t−1∑

s=t−m+1

AsXs1{Ds ≤ t − s}

”Conditionally biased” least squares estimator includes every received

feedback

θbt = V−1t bt

Baseline: use previous estimator but discard last m steps

θdisct = V−1

t−mbt−m with E[θdisct |Ft ] ≈ τmθ

18

Confidence interval and the D-LinUCB policy

We remark that

θbt − τmθ = θbt − θdisct+m + θdisc

t+m − τmθ= θbt − θdisc

t+m︸︷︷︸finite bias

+ θdisct+m − τmθ︸︷︷︸

same as before

For the new Ct , we have new optimistic indices

At = argmaxa∈A

maxθ∈Ct〈a, θ〉 .

But now, the solution has an extra (vanishing) bias term

At = argmaxa〈a, θt〉+

√βt ‖a‖V−1

t−1+ m ‖a‖V−2

t−1.

D-LinUCB: Easy, straightforward, harmless modification of LinUCB, with

regret guarantees in the delayed feedback setting.19

Regret bound

Theorem (D-LinUCB Regret)

Under the same conditions as before, with V0 = λI , with probability

1− δ the regret of D-LinUCB satisfies

Rn ≤ τ−1m

√√√√8dnβn log

(trace(V0) + nL2

d det1d (V0)

)+

dm

(λ− 1)τ−1m

log

(1 +

n

d(λ− 1)

).

20

Simulations

We fix n = 3000 and generate geometric delays with E[Dt ] = 100. In a

real setting, this would correspond to an experiment that lasts 3h, with

average delays of 6 minutes.

Then, we let the cut off vary m ∈ 250, 500, 1000, i.e. waiting time of

15min, 30min and 1h, respectively.

0 1000 2000 3000Round t

0

25

50

75

100

125

150

Reg

retR

(T)

WaiLinUCB

DeLinUCB

0 1000 2000 3000Round t

0

25

50

75

100

125

150

Reg

retR

(T)

WaiLinUCB

DeLinUCB

0 1000 2000 3000Round t

0

50

100

150

200

Reg

retR

(T)

WaiLinUCB

DeLinUCB

Figure 1: Comparison of the simulated behaviors of D-LinUCB and

(waiting)LinUCB

21

Conclusions

• Linear Bandits are a powerful and well-understood way of solving the

exploration-exploitation trade-off in a metric space;

• The techniques have been extended to Generalized Linear models by

Filippi et al. [9]

• and to kernel regression Valko et al. [12, 13].

• Yet, including constraints and external sources of noise in real-world

application is challenging.

• Some use cases challenge the bandit model assumptions...

• ... and then it’s time to open the box of MDP’s (e.g. UCRL abd

KL-UCRL Auer et al. [5], Filippi et al. [10]).

Thanks!

22

References i

References

[1] Yasin Abbasi-Yadkori. Forced-exploration based algorithms for playing in bandits with large

action sets. PhD thesis, University of Alberta, 2009.

[2] Yasin Abbasi-Yadkori, Andras Antos, and Csaba Szepesvari. Forced-exploration based

algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning

with Limited Feedback, 2009.

[3] Yasin Abbasi-Yadkori, Csaba Szepesvari, and David Tax. Improved algorithms for linear

stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages

2312–2320, 2011.

[4] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic

concepts. In ICML, pages 3–11, 1999.

[5] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research, 11:1563–1600, 2010.

[6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of

Machine Learning Research, 3(Nov):397–422, 2002.

[7] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear

payoff functions. In AISTATS, volume 15, pages 208–214, 2011.

23

References ii

[8] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under

bandit feedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366,

2008.

[9] S. Filippi, O. Cappe, A. Garivier, and Cs. Szepesvari. Parametric bandits: The generalized

linear case. pages 586–594.

[10] Sarah Filippi, Olivier Cappe, and Aurelien Garivier. Optimism in reinforcement learning and

kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communication,

Control, and Computing (Allerton), pages 115–122. IEEE, 2010.

[11] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics

of Operations Research, 35(2):395–411, 2010.

[12] Michal Valko, Nathaniel Korda, Remi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time

analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.

[13] Michal Valko, Remi Munos, Branislav Kveton, and Tomas Kocak. Spectral bandits for

smooth graph functions. In International Conference on Machine Learning, pages 46–54,

2014.

24

linear bandits: from theory to applications · linear bandits are a powerful and well-understood...

Documents