improved algorithms for linear stochastic...
TRANSCRIPT
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Improved Algorithms for Linear Stochastic Banditsa NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba
Szepesvari
Emilie Kaufmann
SMILE NIPS’11 Defrief session, February 16th 2012
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
The linear bandit problem
Setting
set of arms D ⊂ Rd
unknown parameter θ∗ ∈ Rd
best arm (unknown) x∗ = argmaxx∈D
〈x | θ∗〉
At time t pick an arm Xt ∈ D and observe reward :
Yt = 〈Xt | θ∗〉+ ηt
where the noise ηt is centered (E[ηt |Ft−1] = 0) andsubgaussian:
E[eληt | Ft−1
]≤ exp
(λ2
2
)(1)
with Ft = σ(X1, ...,Xt ,Xt+1, η1, ..., ηt−1, ηt)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
The linear bandit problem
Goal: Design an algorithm (ie a sequential choice of the arms)minimizing the regret :
Rn =n∑
t=1
〈x∗ | θ∗〉 −n∑
t=1
〈Xt | θ∗〉 =n∑
t=1
〈x∗ − Xt | θ∗〉
In the article: a regret bound for the regret Rn of the OFULalgorithm, holding with high probability
Some applications:
D ⊂ {0, 1}d : online shortest path problem
D is finite : marketing ($$)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
A special case : the multiarmed bandit problem
The multiarmed bandit problem in the gaussian case
K independent arms
arm j is an i.i.d. sequence of N (µj , 1) (known variance)
if arm It is chosen, the reward is drawn from N (µj , 1)
Why does it fit our setting ?
D = {e1, ..., eK} the canonical basis of RK
θ∗ = (µ1, ..., µK )T
if arm It is drawn at time t we get the reward
Yt = 〈eIt | θ∗〉+ ηt
with ηt an i.i.d sequence of N (0, 1)
⇒ Generalize UCB optimistic heuristic for the Linear Case
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
A special case : the multiarmed bandit problem
The multiarmed bandit problem in the gaussian case
K independent arms
arm j is an i.i.d. sequence of N (µj , 1) (known variance)
if arm It is chosen, the reward is drawn from N (µj , 1)
Why does it fit our setting ?
D = {e1, ..., eK} the canonical basis of RK
θ∗ = (µ1, ..., µK )T
if arm It is drawn at time t we get the reward
Yt = 〈eIt | θ∗〉+ ηt
with ηt an i.i.d sequence of N (0, 1)
⇒ Generalize UCB optimistic heuristic for the Linear Case
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Some ideas to solve the problem
With X1:t =
XT1
.XTt
∈ Rt×d , Y1:t =
Y1
.Yt
, ηt =
η1.ηt
∈ Rt :
Y1:t = X1:tθ∗ + ηt
First idea: use a regularized least-square estimate of θ∗:
θt = (Vt)−1 XT
1:tY1:t
whereVt = λI + XT
1:tX1:t
and choose Xt+1 = argmaxx∈D
〈x | θt〉
⇒ Exploitation only !
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Some ideas to solve the problem
With X1:t =
XT1
.XTt
∈ Rt×d , Y1:t =
Y1
.Yt
, ηt =
η1.ηt
∈ Rt :
Y1:t = X1:tθ∗ + ηt
First idea: use a regularized least-square estimate of θ∗:
θt = (Vt)−1 XT
1:tY1:t
whereVt = λI + XT
1:tX1:t
and choose Xt+1 = argmaxx∈D
〈x | θt〉 ⇒ Exploitation only !
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Optimism in Face of Uncertainty
A generic optimistic algorithm for the linear bandit:
Build a confidence ellipsoid around θt :
Ct(δ) ={θ ∈ Rd : ||θ − θt ||Vt ≤ βt(δ)
}such that P (∀t ∈ N, θ∗ ∈ Ct(δ)) ≥ 1− δat time t choose Xt such that :
(Xt , θt) = argmax(x ,θ)∈D×Ct−1(δ)
〈x | θ〉
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Optimism in Face of Uncertainty
The previous optimization problem rewrites:
Xt = argmaxX∈D
〈X , θt〉+ βt(δ)||X ||V−1t
For the mutiarmed bandit case:
Vt = Diag(Nt(1), ...,Nt(K )) where Nt(j) is the number ofdraws of arm j up to time t
Denotes Rj(t) the sum of reward gathered from arm j up totime t
Arm chosen at time t is the one with highest index:
Rt(j)
Nt(j)+
βt(δ)√Nt(j)
⇒ βt(δ) is an exploration rate... and should belogarithmic!
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Optimism in Face of Uncertainty
The previous optimization problem rewrites:
Xt = argmaxX∈D
〈X , θt〉+ βt(δ)||X ||V−1t
For the mutiarmed bandit case:
Vt = Diag(Nt(1), ...,Nt(K )) where Nt(j) is the number ofdraws of arm j up to time t
Denotes Rj(t) the sum of reward gathered from arm j up totime t
Arm chosen at time t is the one with highest index:
Rt(j)
Nt(j)+
βt(δ)√Nt(j)
⇒ βt(δ) is an exploration rate... and should belogarithmic!
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Building the confidence region
Previously: (in a non-regularized setting)
Dani, Kakade 2008 :
βt(δ) =
√max
(128d ln(t) ln (t2/δ) ,
64
9ln2
(t2
δ
))Rusmeviechentong, Tsitsiklis 2009 :
βt(δ) = K√
log(t)√
d log(t) + log(t2/δ)
Empirically smaller confidence region seems to work !
In the article:
a new (elegant) technique for building a smaller confidenceregion
better empirical performance !
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
A self-normalized inequality for a vector-valued martingale
Finding the martingale
Let St = XT1:tη1:t =
∑tk=1 ηkXk . St is a Ft martingale and:
∀x ∈ Rd , |x ′(θt − θ∗)| ≤ ||x ||Vt−1
(||St ||Vt
−1 +√λ||θ∗||
)Therefore, finding βt(δ) such that
P(∀t ∈ N, ||St ||Vt
−1 ≤ βt(δ))≥ 1− δ
leads to:
P(∀t ∈ N, ||θt − θ∗||Vt ≤ βt(δ) +
√λ||θ∗||
)≥ 1− δ
For the bandit case:
||St ||V−1t
=k∑
j=1
Rt(j)− µjNt(j)√λ+ Nt(j)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
A self-normalized inequality for a vector-valued martingale
Bounding its tail
Theorem
Let Xt and ηt be two real-valued stochastic process such that Xt isFt−1 mesurable, ηt is Ft mesurable and satisfies (1) Noting
Vt = V +t∑
s=1
XsXTs and St =
t∑s=1
ηsXs
Then, for any δ > 0, with probability at least 1− δ,
∀t ≥ 0 ||St ||2V−1t≤ 2 log
(det(Vt)
1/2 det(V )−1/2
δ
)
Generalizes a technique by De La Pena (2004)Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (1/2)
∀λ ∈ Rd ,
Mλt = exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)is a supermartingale such that for all stopping time τ ,E[Mλ
τ ] ≤ 1
Average over λ by setting a ’prior’ Λ ∼ N(0,V−1
)on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and
E[Mt(Λ)|F∞] =
∫Rd
exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)f (λ)dλ
where f denotes the pdf of N(0,V−1
)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (1/2)
∀λ ∈ Rd ,
Mλt = exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)is a supermartingale such that for all stopping time τ ,E[Mλ
τ ] ≤ 1
Average over λ by setting a ’prior’ Λ ∼ N(0,V−1
)on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and
E[Mt(Λ)|F∞] =
∫Rd
exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)f (λ)dλ
where f denotes the pdf of N(0,V−1
)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (1/2)
∀λ ∈ Rd ,
Mλt = exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)is a supermartingale such that for all stopping time τ ,E[Mλ
τ ] ≤ 1
Average over λ by setting a ’prior’ Λ ∼ N(0,V−1
)
on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and
E[Mt(Λ)|F∞] =
∫Rd
exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)f (λ)dλ
where f denotes the pdf of N(0,V−1
)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (1/2)
∀λ ∈ Rd ,
Mλt = exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)is a supermartingale such that for all stopping time τ ,E[Mλ
τ ] ≤ 1
Average over λ by setting a ’prior’ Λ ∼ N(0,V−1
)on the one hand E[Mt(Λ)] ≤ 1
on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and
E[Mt(Λ)|F∞] =
∫Rd
exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)f (λ)dλ
where f denotes the pdf of N(0,V−1
)
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (1/2)
∀λ ∈ Rd ,
Mλt = exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)is a supermartingale such that for all stopping time τ ,E[Mλ
τ ] ≤ 1
Average over λ by setting a ’prior’ Λ ∼ N(0,V−1
)on the one hand E[Mt(Λ)] ≤ 1on the other hand E[Mt(Λ)] = E[E[Mt(Λ)|F∞]] and
E[Mt(Λ)|F∞] =
∫Rd
exp
(〈λ | St〉 −
1
2||λ||2(Vt−V )
)f (λ)dλ
where f denotes the pdf of N(0,V−1
)Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (2/2)
Direct calculation show that
E[Mt(Λ)|F∞] =
(det(V )
det(Vt)
)1/2
exp
(1
2||St ||2V−1
t
)
Then E[(
det(V )det(Vt)
)1/2exp
(12 ||St ||2V−1
t
)]≤ 1
Using Markov inequality yields:
P
(||St ||2V−1
t> 2 log
(det(Vt)
1/2
δ det(V )1/2
))
=P
(exp
(1
2||St ||2V−1
t
)>
det(Vt)1/2
δ det(V )1/2
)≤ δ
For a bound uniform in t : use stopping times
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (2/2)
Direct calculation show that
E[Mt(Λ)|F∞] =
(det(V )
det(Vt)
)1/2
exp
(1
2||St ||2V−1
t
)
Then E[(
det(V )det(Vt)
)1/2exp
(12 ||St ||2V−1
t
)]≤ 1
Using Markov inequality yields:
P
(||St ||2V−1
t> 2 log
(det(Vt)
1/2
δ det(V )1/2
))
=P
(exp
(1
2||St ||2V−1
t
)>
det(Vt)1/2
δ det(V )1/2
)≤ δ
For a bound uniform in t : use stopping times
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (2/2)
Direct calculation show that
E[Mt(Λ)|F∞] =
(det(V )
det(Vt)
)1/2
exp
(1
2||St ||2V−1
t
)
Then E[(
det(V )det(Vt)
)1/2exp
(12 ||St ||2V−1
t
)]≤ 1
Using Markov inequality yields:
P
(||St ||2V−1
t> 2 log
(det(Vt)
1/2
δ det(V )1/2
))
=P
(exp
(1
2||St ||2V−1
t
)>
det(Vt)1/2
δ det(V )1/2
)≤ δ
For a bound uniform in t : use stopping times
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (2/2)
Direct calculation show that
E[Mt(Λ)|F∞] =
(det(V )
det(Vt)
)1/2
exp
(1
2||St ||2V−1
t
)
Then E[(
det(V )det(Vt)
)1/2exp
(12 ||St ||2V−1
t
)]≤ 1
Using Markov inequality yields:
P
(||St ||2V−1
t> 2 log
(det(Vt)
1/2
δ det(V )1/2
))
=P
(exp
(1
2||St ||2V−1
t
)>
det(Vt)1/2
δ det(V )1/2
)≤ δ
For a bound uniform in t : use stopping times
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Sketch of the proof (2/2)
Direct calculation show that
E[Mt(Λ)|F∞] =
(det(V )
det(Vt)
)1/2
exp
(1
2||St ||2V−1
t
)
Then E[(
det(V )det(Vt)
)1/2exp
(12 ||St ||2V−1
t
)]≤ 1
Using Markov inequality yields:
P
(||St ||2V−1
t> 2 log
(det(Vt)
1/2
δ det(V )1/2
))
=P
(exp
(1
2||St ||2V−1
t
)>
det(Vt)1/2
δ det(V )1/2
)≤ δ
For a bound uniform in t : use stopping times
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Final result for the regret
The OFUL algorithm
Finally we use a confidence region based on the exploration rate
βt(δ) =
√2 log
(det(Vt)1/2λ−d/2
δ
)+√λS
Theorem
Assume that ||θ∗|| ≤ S and for all x ∈ D, ||x || ≤ L,〈x | θ∗〉 ∈ [−1, 1]. Then with probability at least 1− δ, the regretof the OFUL algorithm satisfies: ∀n ≥ 0
Rn ≤ 4√
nd log (λ+ nL/d)
(√λS +
√2 log
1
δ+ d log
(1 +
nL
λd
))Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari
The Linear Bandit problemOptimism in Face of Uncertainty
A Self-Normalized Tail Inequality for Vector-Valued MartingalesRegret of the OFUL algorithm
Discussion
optimality ? Here Rn = O(d√
n) w.h.p
lower bound for the worst case regret O(√
dn) for multiarmedbanditlower bound O(d
√n) for the linear bandit?
regularized versus non-regularized approaches
empirical performance
Emilie Kaufmann Improved Algorithms for Linear Stochastic Bandits a NIPS article by Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari