nonstochastic multi-armed bandits with graph-structured feedback noga alon, tau nicolo cesa-bianchi,...

Nonstochastic Multi-Armed BanditsWith Graph-Structured Feedback

Noga Alon, TAUNicolo Cesa-Bianchi, MilanClaudio Gentile, InsubriaShie Mannor, TechnionYishay Mansour, TAU and MSROhad Shamir, Weizmann

Nonstochastic sequential decision-making

• K actions and T time steps• lt(a) – loss of action a at time t• At time t– player picks action Xt

– incurs loss lt(Xt) – observe feedback on losses• Multi-arm bandit: only lt(Xt)

• Experts (full information): lt(j) for any j

Nonstochastic sequential decision-making

• Goal:– minimize losses– benchmark: The best

single action• The action j that

minimizes the loss

– no stochastic assumptions on losses

• Regret

• Known regret bounds:– MAB

– Experts

actionbest

lossplayer

ttT jXER

)(min)]([

Motivation – observablity

undirected directed

undirected observation graph

• MAB: no edges • Experts: clique

Modeling

Directed vs Undirected

• Different types of dependencies

• Different measures– Independent set– Dominating set– Max Acyclic Subgraph

Informed vs Uniformed

• When does the learner observes the graph– Before– After

• only the neighbors

Our Results

Uniformed setting• Undirected graph• Uniformed setting

– Only the neighbors of the node– Independent sets

• Directed graph– Max Acyclic Subgraph (not tight)– Random Erdos-Renyi graphs

Informed setting• Directed graphs

• Regret characterization– dominating sets and ind. set

• Both expectation and high prob.

)ln)((~

EXP3-SET

))(ˆexp(]Pr[1

s st aaX

• Online Algorithm

otherwise 0

obseved )( if ] observed is )(Pr[

)()(ˆ a

)lnK(G)(

(G)lnKT observed] is )(|Pr[2

tttT aaX

• Theorem

)()](ˆ[ aaE tt

EXP3-Set Regret – key lemma

• Lemma

Note:MAB: Q=KFull info. Q=1

• Proof: Build an i.s. S– consider action a with

minimal Pr[a observed]– Add a to S– Delete a and its

neighbors

• Note

aXQ )(

]observed Pr[

1 ]observed Pr[

]observed Pr[

Dominating set – directed graph

EXP3-DOM

• Simplified version– fixed graph G– D is dominating set

• log approx

• Main modification– add probabilities to D

• induce observability

• probabilities:

• Select Xt using pt

• Observe lt(a) for a in SXt,t

• weights

)1( ,, DaI

|)|/)(ˆexp(,1, Daww ttata

][] Pr[

)()(ˆ

t tSiI

aobserve

EXP3-DOM

• Simple example• Transitive observability– tournament

• action 1 observes all actions– D={1}

• EXP3-DOM• Sample action 1 with

prob γ– action 1 is the

exploration

• Otherwise run a MAB– specifically EXP3-SET

• Intuition– action 1 replaces

mixture with uniform

Conclusion

• Observability model– Between MAB and Experts

• more work to be done

• Uninformed setting– Undirected graph

• Informed setting– Directed graph

• [Kocak, Neu, Valko and R. Muno] improved uniformed

Thank You

Outline

• Model and motivation• symmetric observability• non-symmetric observability

EXP3-DOM: key lemma

• Lemma– G directed graph, – d-

i indegree of i, – α=α(G)

• Turan’s Theorem– undirected graph G(V,E)

• Proof: high level– shrink graph

• GK,Gk-1, …

– delete nodes

• step s: – delete max indegree node

• From Turan’s theorem

||2);(

EXP3-DOM: key lemma (proof)

• Completing the proof

• Note, due to edge elimination

)1ln(22

2 ,,11 ,

1,,1 KiKi dd

EXP3-DOM- Key lemma (modified)

• Lemma (what we really need!)• G(V,E) directed graph– INi indegree of i – r size dominating set; and α size ind. set– p distribution over V• pi≥β

iINj ji

EXP3 –DOM: changing graphs

• Simple– all dom. set same size– approx. same size

• Problem– different size dom. set

• can be 1 or K

• Solution– keep log levels

• depend on log2 (Dt)

– algorithm per level

• Complications– parameters depend on

level– setting the learning rate

• need a delicate doubling

• Main tech. challenge– handle dynamic

adversary.

EXP3-DOM

• receive obs. graph– find dominating set Dt

• logarithmic approximation

• Run the right copy– Let bt = log2 (Dt)

– run copy bt

• log copies

• For Copy bt – param. depend on bt

• probabilities:

• Select Xt using p

• Observe lt(a) for a in SXt,t

• weights

][)1( ,, t

tata DaI

)2/)(ˆexp(,1,tb

ttata aww

][] Pr[

)()(ˆ

t tSiI

aobserve

EXP3-DOM – main Theorem

• Theorem:

• tuning γb

))ln()(ln]||4[)((ln1

KTKQDEKORT

Independent set

• Independent set α(G) • [Mannor & Shamir 2012]

• Tight Regret

– α(G) “replaces” K

• Cons:– requires to observe G– solves an LP each step

KGT ln)(

nonstochastic multi-armed bandits with graph-structured feedback noga alon, tau nicolo cesa-bianchi,...

Documents

murder most foul – g30 koh kah xuan (2p4) shie yu hao...

1 robust regression and lasso - arxiv · 1 robust...

malicious attacks nicole hamilton, dennis meng, alex shie,...

optimizing memory placement using evolutionary graph ... ·...

florante at laura ppt_mam shie

early nonstochastic effects/late stochastic effects sherer...

nonstochastic multi-armed bandits with graph ... - tau

the nonstochastic multiarmed bandit problem

nonstochastic information concepts for estimation...

advances in seismic response spectrum compatible ...advances...

regret analysis of stochastic and nonstochastic multi-armed...

regret analysis of stochastic and nonstochastic multi...

nonstochastic reprogramming from a privileged …...

regret analysis of stochastic and nonstochastic multi-armed

robust regression and...

toyo ito by chan shie wah - university of malaya, malaysia

online learning in complex environments aditya gopalan ( ece...

shie project, modena (it) 7.10. - 11.10. 2009

m.t.cardemil 2010 conductas mÉdico-obstÉtricas y rol de...

extraordinary dogs, transforming lives · ambassador dogs...