no-regret learning in unknown games with correlated payoffs

1
• At each time : - player picks action - other players pick actions - player receives reward t i a i t i a i t i r i (a i t , a i t ) ETH Zürich No-Regret Learning in Unknown Games with Correlated Payoffs Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause Motivation Main Results Summary P . Auer, N. Cesa-Bianchi, Y. Freund, R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 2003. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997. {sessap, ilijab, maryamk, krausea}@ethz.ch Consider learning to play an unknown repeated game. - Bandit algorithms: slow convergence rates. - Full-information algorithms: improved rates but are often unrealistic. Under some regularity assumptions and a new feedback model, we propose GP-MW algorithm. GP-MW improves upon bandit regret guarantees while not relying on full-information feedback. Experiments • For and Lipschitz rewards: a i ∈ℝ d i Full-information Proposed model Feedback: Regret: ( T log K i ) ( T log K i + γ T T ) Hedge [Freund and Schapire ’97] Unrealistic feedback, since is unknown r i ( , ) {r i (a, a i t ), a i } r i (a i t , a i t )+ ϵ i t , a i t GP-MW [This paper] Set-Up • After time steps, player incurs regret: T i R i (T ) = max ai T t=1 r i (a, a i t ) T t=1 r i (a i t , a i t ) • Each time , player observes: t i This work was gratefully supported by Swiss National Science Foundation, under the grant SNSF 200021_172781, and by the ERC grant 815943. Key Idea Reward function is unknown r i : 1 ×⋯× N [0,1] 1) , -sub-Gaussian ˜ r i t = r i (a i t , a i t )+ ϵ i t ϵ i t σ i 2) a i t (noisy bandit feedback) (actions of the other players) Use Gaussian Process (GP) confidence bounds to emulate the full-information feedback: Regularity (smoothness) assumption: has a bounded RKHS norm w.r.t. a kernel function r i ( ) k i ( , ) Player can use the observed data to build a shrinking Upper Confidence Bound on : i {a i τ , a i τ r i τ } t1 τ=0 r i ( ) UCB t ( ) = μ t ( )+ β 1/2 t σ t ( ) and are the posterior mean and covariance functions computed using standard GP regression. μ t ( ) σ t ( ) Random zero-sum games: • Player-1 uses Hedge, Exp3.P , or GP-MW. Player-2 plays random actions. • Player-1 uses GP-MW, Player-2 uses Exp3.P Repeated traffic routing: Sequential movie recommendation: Sioux Falls Network - 528 agents, = 5 possible routes for each agent K i At every round each agent observes: 1) Incurred travel time, subject to noise 2) Total occupancy on each edge (i.e., ) a i t + a i t At every round: - We select a movie - Adversary selects a user - Rating: a t u t r (a t , u t )= a t r u t + ϵ Don’t know a-priori who will see our recommendations. [https://grouplens.org/datasets/movielens/100k/.] [1] [2] player i other players For : t = 1,…, T Initialize mixed strategy : - Sample action: a i t w t - Observe: noisy reward & opponents actions ˜ r i t a i t - Compute optimistic full-info. feedback : r t ∈ℝ K i r t [k] = min{UCB t (a k , a i t ) , 1} , k = 1,…, K i - Update mixed strategy : w t+1 [k] w t [k] exp ( η r t [k] ) , k = 1,…, K i - Update GP model based on the new observed data w 1 =[ 1 / K i , …, 1 / K i ] ∈ℝ K i Bandit r i (a t , a i t ) ( TK i log K i ) Scales badly with K i Exp3 [Auer et al. ’02] Average Regret of Player-1 Average Regrets GP-MW leads to a smaller regret (shades = 1 std.) ± Simulated with BPR congestion model - Agents want to miminize traveltimes: r i (a i , a i )= traveltime i (a i , a i ) GP-MW leads to smaller regret and reduces the congestion Q-Learning Better Replies with Inertia [Chapman et al. 2013] a t = arg max a min u UCB t (a, u) a t = arg max a max u UCB t (a, u) Bayesian Optimization baselines: GP-MW is no-regret for this task - StableOpt [Bogunovic et al. 2018] : - GP-UCB [Srinivas et al. 2010] : γ T = max x 1 ,…,x T I(r T ; r i ) Mutual information btw. and r i ( ) r T =[r i (x t )+ ϵ] T t=1 R i (T )= ( d i T log(d i T )+ γ T T ) grows with domain’s dimension . E.g., for SE kernels γ T d γ T = ((log T ) d+1 ) GP-MW algorithm for player i Def. Maximum information gain: Theorem. Assume . If player uses GP-MW , with and . Then, w.p. , r i k i B i β t = B + 2γ t1 + log(2/ δ ) η = (8 log K i )/ T (1 δ ) R i (T )= ( T log K i + B Tγ T + γ T T ) [http://www.bgu.ac.il/ bargera/tntp/] GP-MW with polynomial kernels of degree 2,4,or 6. 33rd Conference on Neural Information Processing Systems (NeurIPS), 2019.

Upload: others

Post on 19-Jul-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: No-Regret Learning in Unknown Games with Correlated Payoffs

• At each time : - player picks action - other players pick actions - player receives reward

ti ai

t ∈ 𝒜ia−i

ti ri(ai

t , a−it )

ETH Zürich

No-Regret Learning in Unknown Games with Correlated PayoffsPier Giuseppe Sessa, Ilija Bogunovic, Mar yam Kamgarpour, Andreas Krause

Motivation Main Results

Summary

P. Auer, N. Cesa-Bianchi, Y. Freund, R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 2003.Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997.

{sessap, i l i jab, mar yamk, krausea}@ethz.ch

Consider learning to play an unknown repeated game. - Bandit algorithms: slow convergence rates. - Full-information algorithms: improved rates but are often unrealistic.Under some regularity assumptions and a new feedback model, we propose GP-MW algorithm. GP-MW improves upon bandit regret guarantees while not relying on full-information feedback.

Experiments

• For and Lipschitz rewards:ai ∈ ℝdi

Full-information Proposed model

Feedback:

Regret: 𝒪( T log Ki) 𝒪( T log Ki + γT T)Hedge

[Freund and Schapire ’97]

Unrealistic feedback, since is unknown ri( ⋅ , ⋅ )

{ri(a, a−it ), ∀a ∈ 𝒜i} ri(ai

t , a−it ) + ϵi

t , a−it

GP-MW[This paper]

Set-Up

• After time steps, player incurs regret:T i

Ri(T) = maxa∈𝒜i

T

∑t=1

ri(a, a−it ) −

T

∑t=1

ri(ait , a−i

t )

• Each time , player observes:t i

This work was gratefully supported by Swiss National Science Foundation, under the grant SNSF 200021_172781, and by the ERC grant 815943.

Key Idea

• Reward function is unknownri : 𝒜1 × ⋯ × 𝒜N → [0,1]

1) , -sub-Gaussianr̃it = ri(ai

t , a−it ) + ϵi

t ϵit σi

2) a−it

(noisy bandit feedback)

(actions of the other players)

Use Gaussian Process (GP) confidence bounds to emulate the full-information feedback:

• Regularity (smoothness) assumption: has a bounded RKHS norm w.r.t. a kernel function ri( ⋅ ) ki( ⋅ , ⋅ )

• Player can use the observed data to build a shrinking Upper Confidence Bound on :

i {aiτ, a−i

τ , r̃iτ}t−1

τ=0ri( ⋅ )

UCBt( ⋅ ) = μt( ⋅ ) + β1/2t σt( ⋅ )

• and are the posterior mean and covariance functions computed using standard GP regression.μt( ⋅ ) σt( ⋅ )

• Random zero-sum games:

• Player-1 uses Hedge, Exp3.P, or GP-MW.Player-2 plays random actions.

• Player-1 uses GP-MW, Player-2 uses Exp3.P

• Repeated traffic routing:

• Sequential movie recommendation:

Sioux Falls Network

- 528 agents, = 5 possible routes for each agentKi

At every round each agent observes: 1) Incurred travel time, subject to noise2) Total occupancy on each edge (i.e., )ai

t+a−it

At every round: - We select a movie - Adversary selects a user - Rating:

atut

r(at, ut) = at⊤rut+ ϵ

Don’t know a-priori who will see our recommendations.

[https://grouplens.org/datasets/movielens/100k/.]

[1][2]

player iother players

For :t = 1,…, TInitialize mixed strategy :

- Sample action: ait ∼ wt

- Observe: noisy reward & opponents actions r̃it a−i

t

- Compute optimistic full-info. feedback : rt ∈ ℝKi

rt[k] = min{UCBt(ak, a−it ) , 1} , k = 1,…, Ki

- Update mixed strategy :

wt+1[k] ∝ wt[k] ⋅ exp (η ⋅ rt[k]) , k = 1,…, Ki

- Update GP model based on the new observed data

w1 = [1/Ki, …,1 /Ki

] ∈ ℝKi

The text font of “Automatic Control Laboratory” is DIN Medium

C = 100, M = 83, Y = 35, K = 22

C = 0, M = 0, Y = 0, K = 60

Logo on dark background

K = 100

K = 60

pantone 294 C

pantone Cool Grey 9 C

Bandit

ri(at, a−it )

𝒪( TKi log Ki)

Scales badly with Ki

Exp3[Auer et al. ’02]

Average Regret of Player-1 Average Regrets

GP-MW leads to a smaller regret

(shades = 1 std.)±

Simulated with BPR congestion model

- Agents want to miminize traveltimes:

ri(ai, a−i) = − traveltimei(ai, a−i)

GP-MW leads to smaller regret and reduces the congestion

Q-Learning Better Replies with Inertia [Chapman et al. 2013]

at = arg maxa

minu

UCBt(a, u)

at = arg maxa

maxu

UCBt(a, u)Bayesian Optimization baselines:

GP-MW is no-regret for this task

- StableOpt [Bogunovic et al. 2018] :- GP-UCB [Srinivas et al. 2010] :

γT = maxx1,…,xT

I(rT; ri)Mutual information btw.

and ri( ⋅ ) rT = [ri(xt) + ϵ]Tt=1

Ri(T ) = 𝒪( diT log(diT ) + γT T )

• grows with domain’s dimension . E.g., for SE kernels γT d γT = 𝒪((log T )d+1)

GP-MW algorithm for player i

Def. Maximum information gain:

Theorem. Assume . If player uses GP-MW, with and . Then, w.p. ,

∥ri∥ki ≤ B iβt = B + 2γt−1 + log(2/δ) η = (8 log Ki)/T (1 − δ)

Ri(T) = 𝒪 ( T log Ki + B TγT + γT T)

[http://www.bgu.ac.il/ bargera/tntp/]

GP-MW with polynomial kernels of degree 2,4,or 6.

33rd Conference on Neural Information Processing Systems (NeurIPS), 2019.