nonstochastic multi-armed bandits with graph-structured feedback noga alon, tau nicolo cesa-bianchi,...
Post on 18-Dec-2015
224 Views
Preview:
TRANSCRIPT
Nonstochastic Multi-Armed BanditsWith Graph-Structured Feedback
Noga Alon, TAUNicolo Cesa-Bianchi, MilanClaudio Gentile, InsubriaShie Mannor, TechnionYishay Mansour, TAU and MSROhad Shamir, Weizmann
Nonstochastic sequential decision-making
• K actions and T time steps• lt(a) – loss of action a at time t• At time t– player picks action Xt
– incurs loss lt(Xt) – observe feedback on losses• Multi-arm bandit: only lt(Xt)
• Experts (full information): lt(j) for any j
3
Nonstochastic sequential decision-making
• Goal:– minimize losses– benchmark: The best
single action• The action j that
minimizes the loss
– no stochastic assumptions on losses
• Regret
• Known regret bounds:– MAB
– Experts
actionbest
T
ttj
lossplayer
T
ttT jXER
1
1
)(min)]([
TK
KT ln
9
Modeling
Directed vs Undirected
• Different types of dependencies
• Different measures– Independent set– Dominating set– Max Acyclic Subgraph
Informed vs Uniformed
• When does the learner observes the graph– Before– After
• only the neighbors
10
Our Results
Uniformed setting• Undirected graph• Uniformed setting
– Only the neighbors of the node– Independent sets
• Directed graph– Max Acyclic Subgraph (not tight)– Random Erdos-Renyi graphs
Informed setting• Directed graphs
• Regret characterization– dominating sets and ind. set
• Both expectation and high prob.
)ln)((~
KGTO
EXP3-SET
))(ˆexp(]Pr[1
t
s st aaX
• Online Algorithm
otherwise 0
obseved )( if ] observed is )(Pr[
)()(ˆ a
a
aa t
t
t
t
where
)lnK(G)(
1t
(G)lnKT observed] is )(|Pr[2
ln
tG
T
tttT aaX
KR
• Theorem
)()](ˆ[ aaE tt
12
EXP3-Set Regret – key lemma
• Lemma
Note:MAB: Q=KFull info. Q=1
• Proof: Build an i.s. S– consider action a with
minimal Pr[a observed]– Add a to S– Delete a and its
neighbors
• Note
a
t Ga
aXQ )(
]observed Pr[
]Pr[
1 ]observed Pr[
]Pr[
]observed Pr[
]Pr[
)()(
aNj
t
aNj
t
a
jX
j
jX
17
EXP3-DOM
• Simplified version– fixed graph G– D is dominating set
• log approx
• Main modification– add probabilities to D
• induce observability
• probabilities:
• Select Xt using pt
• Observe lt(a) for a in SXt,t
• weights
][||
)1( ,, DaI
DW
wp
t
tata
|)|/)(ˆexp(,1, Daww ttata
][] Pr[
)()(ˆ
,tXt
t tSiI
aobserve
aa
18
EXP3-DOM
• Simple example• Transitive observability– tournament
• action 1 observes all actions– D={1}
• EXP3-DOM• Sample action 1 with
prob γ– action 1 is the
exploration
• Otherwise run a MAB– specifically EXP3-SET
• Intuition– action 1 replaces
mixture with uniform
Conclusion
• Observability model– Between MAB and Experts
• more work to be done
• Uninformed setting– Undirected graph
• Informed setting– Directed graph
• [Kocak, Neu, Valko and R. Muno] improved uniformed
24
EXP3-DOM: key lemma
• Lemma– G directed graph, – d-
i indegree of i, – α=α(G)
• Turan’s Theorem– undirected graph G(V,E)
• Proof: high level– shrink graph
• GK,Gk-1, …
– delete nodes
• step s: – delete max indegree node
• From Turan’s theorem
K
i i
K
d1
1ln21
1
||
||2);(
1
||
V
EG
V
2
1
2
||
||
||max
s
s
s
si
V
V
Dd
EXP3-DOM: key lemma (proof)
• Completing the proof
• Note, due to edge elimination
)1ln(22
1
12
1
12
1
1
1
1
1
1
1
1
1 1,
2 ,
2 ,,11 ,
K
i
dK
dK
ddd
K
i i
i
K
i KiK
K
K
i KiK
K
K
i KiK
K
i Ki
1,,1 KiKi dd
EXP3-DOM- Key lemma (modified)
• Lemma (what we really need!)• G(V,E) directed graph– INi indegree of i – r size dominating set; and α size ind. set– p distribution over V• pi≥β
r
KrK
pp
pQ
K
iINj ji
i
i
21ln2
2
1
27
EXP3 –DOM: changing graphs
• Simple– all dom. set same size– approx. same size
• Problem– different size dom. set
• can be 1 or K
• Solution– keep log levels
• depend on log2 (Dt)
– algorithm per level
• Complications– parameters depend on
level– setting the learning rate
• need a delicate doubling
• Main tech. challenge– handle dynamic
adversary.
28
EXP3-DOM
• receive obs. graph– find dominating set Dt
• logarithmic approximation
• Run the right copy– Let bt = log2 (Dt)
– run copy bt
• log copies
• For Copy bt – param. depend on bt
• probabilities:
• Select Xt using p
• Observe lt(a) for a in SXt,t
• weights
][)1( ,, t
tt
tata DaI
DW
wp
)2/)(ˆexp(,1,tb
ttata aww
][] Pr[
)()(ˆ
,tXt
t tSiI
aobserve
aa
EXP3-DOM – main Theorem
• Theorem:
• tuning γb
K
bTt b
btb
b
b
T b
QE
KR
log
01]
21[
ln2
))ln()(ln]||4[)((ln1
KTKQDEKORT
t
bttTt
top related