(deep) reinforcement learning - computer vision lab....
TRANSCRIPT
1
(Deep) Reinforcement Learning
Kee-Eung KimSchool of Computer Science, KAIST
2
Machine Learning Tasks
Unsupervised Learning• Exploratory machine learning: find patterns (structures) in the input
Supervised Learning• Learning from teacher: desired target for each input is known
Reinforcement Learning• Learning from critic: learning algorithm makes prediction
and then the evaluated loss is given as feedback
f~xi; ~yigNi=1f~xi; ~yigNi=1
~y = f (~x)~y = f (~x)
~y = g(~x)~y = g(~x)
~y~y
~y~y
L(~y; ~y)L(~y; ~y)
3
Google Allo
4
Machine Learning for Smart Reply Think about…
• Reply generation:• Random?• Max-score replies from trained model?
• Learn from user interaction?• Clicks (i.e. feedback rather direct specification)• If no-click, then…?
Need to engage actively with the user and learn from feedback
5
Reinforcement Learning
An area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, stochastic optimization, multi-agent systems, … [adapted from Wikipedia]
Russell & Norvig, Artificial Intelligence: a Modern Approach
(reward & state)
6
Multi-Armed Bandits Stochastic bandits [Robbins 1952]
• : number of arms (or actions)• : number of rounds • : reward random variable for arm in round
(unknown )
Bandit algorithm implements policy (allocation strategy) that chooses which arm to play based on the sequence of past plays and rewards
Goal: maximize cumulative reward (= minimize expected regret)
KK
TT
ri;t » P (ri;t; µi)ri;t » P (ri;t; µi) ii tt
AA
RT = maxi=1;:::;K
E
"TX
t=1
ri;t
#¡ E
"TX
t=1
rIt;t
#RT = max
i=1;:::;KE
"TX
t=1
ri;t
#¡ E
"TX
t=1
rIt;t
#
Always play the best arm Actual play by the algorithm
µiµi
7
Regret Regret measures on-line performance of a learning algorithm
Regret curve
• Sublinear regret implies convergence to optimal policy
RT = maxi=1;:::;K
E
"TX
t=1
ri;t
#¡ E
"TX
t=1
rIt;t
#RT = max
i=1;:::;KE
"TX
t=1
ri;t
#¡ E
"TX
t=1
rIt;t
#
Always play the best arm Actual play by the algorithm
8
Exploration vs Exploitation Trade-off Fundamental challenge in RL!
• Always playing arm that has the best empirical reward (exploit) won’t findthe true best arm so very large regret
• Always playing arm that was leastplayed (explore) will find the true best arm but also very large regret
Formal result [Lai & Robbins 1985]
• Suppose and . Then,
• Roughly,
O(T )O(T )
O(T )O(T )
ri;t » Ber(¹i)ri;t » Ber(¹i) ¢i = ¹¤ ¡ ¹i¢i = ¹¤ ¡ ¹i
limT!1
infRT
ln T¸X
i:¢i>0
¢i
DKL(¹i; ¹¤)limT!1
infRT
ln T¸X
i:¢i>0
¢i
DKL(¹i; ¹¤)
9
Upper-Confidence Bound (UCB) UCB1 policy [Auer et al. 2002]
• Select arm with the best upper-confidence value• Optimism in the face of uncertainty (OFU) principle for exploration
• In round , play
where
Formal result:
tt
It = arg maxi=f1;:::;Kg
"¹i;t¡1 +
s2 log t
Ti(t ¡ 1)
#It = arg max
i=f1;:::;Kg
"¹i;t¡1 +
s2 log t
Ti(t ¡ 1)
#
¹i;t¡1 =1
Ti(t ¡ 1)
Ti(t¡1)X
s=1
ri;s [empirical mean of reward for arm i]
Ti(t ¡ 1) =
t¡1X
s=1
±Is;i [# of plays for arm i in rounds 1; : : : ; t ¡ 1]
¹i;t¡1 =1
Ti(t ¡ 1)
Ti(t¡1)X
s=1
ri;s [empirical mean of reward for arm i]
Ti(t ¡ 1) =
t¡1X
s=1
±Is;i [# of plays for arm i in rounds 1; : : : ; t ¡ 1]
RT ·X
i 6=i¤
min
·10
¢iln T; T¢i
¸RT ·
X
i 6=i¤
min
·10
¢iln T; T¢i
¸
10
Can You Beat the Bandit? http://iosband.github.io/2015/07/28/Beat-the-bandit.html
11
Contextual Bandits [Langford & Zhang 2008]
Real-world scenarios: e.g. recommendation systems, context-sensitive ads, personalized search ranking, medical decision making …• Need to learn while interacting with real-world• Once again, exploration vs. exploitation trade-off
12
Contextual Bandits [Langford & Zhang 2008]
For• Observe some context• Play arm • Receive reward
Regret• Strategy: mapping
Formal result• S-Exp3 algorithm achieves
t = 1; : : : ; T :t = 1; : : : ; T :
xt 2 Xxt 2 Xat 2 f1; : : : Kgat 2 f1; : : : Kg
rat;t » P (rat;t; xt; µat)rat;t » P (rat;t; xt; µat)
RT = maxg:S!f1;;:::;Kg
E
"TX
t=1
rg(xt);t
#¡ E
"TX
t=1
rIt;t
#RT = max
g:S!f1;;:::;KgE
"TX
t=1
rg(xt);t
#¡ E
"TX
t=1
rIt;t
#g : X ! f1; : : : ; Kgg : X ! f1; : : : ; Kg
RT ·p
2T jSjK ln KRT ·p
2T jSjK ln K
13
LinUCB [Chu et al 2011]
Contextual bandit with linear payoff assumption
Estimate parameters by least squares regression
Compute upper-confidence bound for each action:
E[rjx; a] ¼ Qµ(x; a) = Á(x; a)>µE[rjx; a] ¼ Qµ(x; a) = Á(x; a)>µ
At =Pt
¿=1 Á(x¿ ; a¿ )Á(x¿ ; a¿ )>
bt =Pt
¿=1 Á(x¿ ; a¿ )r¿
µt = A¡1t bt
At =Pt
¿=1 Á(x¿ ; a¿ )Á(x¿ ; a¿ )>
bt =Pt
¿=1 Á(x¿ ; a¿ )r¿
µt = A¡1t bt
maxµ2E
Qµ(x; a) = Á(x; a)>µ + cq
Á(x; a)>A¡1Á(x; a)maxµ2E
Qµ(x; a) = Á(x; a)>µ + cq
Á(x; a)>A¡1Á(x; a)
14
Thompson Sampling [Thompson 1933]
Probability matching heuristic: select actions randomly according to its probability of being optimal
Implementation using sampling:
Formal analysis also available
Pr(ajxt) =
ZIµ
E[rajxt; µa] = maxa0
E[ra0 jxt; µa0 ]
¶p(µjD)dµPr(ajxt) =
ZIµ
E[rajxt; µa] = maxa0
E[ra0 jxt; µa0 ]
¶p(µjD)dµ
15
Markov Decision Processes (MDPs) [Puterman 1995]
A classical model for sequential decision problems, define:• set of possible environment states• set of possible actions• reward function• probabilistic transition function• discount rate
Given MDP and policy • How good is the policy for each state…?
• State-Value function:
• Action-Value function:
s 2 Ss 2 S
a 2 Aa 2 A
R(s; a)R(s; a)
T (s; a; s0) = Pr(s0js; a)T (s; a; s0) = Pr(s0js; a)
M = hS; A; R; T; °iM = hS; A; R; T; °i ¼ : S ! A¼ : S ! A
° 2 [0; 1)° 2 [0; 1)
V ¼(s) = Es0;s00;:::[R¼([s; ¼(s); s0; ¼(s0); s00; ¼(s00); : : :])]
= R(s; ¼(s)) + °P
s0 T (s; ¼(s); s0)V ¼(s0)
V ¼(s) = Es0;s00;:::[R¼([s; ¼(s); s0; ¼(s0); s00; ¼(s00); : : :])]
= R(s; ¼(s)) + °P
s0 T (s; ¼(s); s0)V ¼(s0)
R¼([s; ¼(s);s0; ¼(s0); s00; ¼(s00); : : :])
= R(s; ¼(s)) + °R(s0; ¼(s0)) + °2R(s00; ¼(s00)) + ¢ ¢ ¢R¼([s; ¼(s);s0; ¼(s0); s00; ¼(s00); : : :])
= R(s; ¼(s)) + °R(s0; ¼(s0)) + °2R(s00; ¼(s00)) + ¢ ¢ ¢
Q¼(s; a) = R(s; a) + °P
s0 T (s; a; s0)V ¼(s0)Q¼(s; a) = R(s; a) + °P
s0 T (s; a; s0)V ¼(s0)
context change dynamics
16
Optimal Policy Optimal policy maximizes the value function for each state
The value function of optimal policy should satisfy• Optimal state-value function:
• Optimal action-value function:
• Further, Bellman optimality equation:
• We need to find the solution to the system of nonlinear equations
¼¤ = argmax¼
V ¼(s);8s 2 S¼¤ = argmax¼
V ¼(s);8s 2 S
V ¤(s) = max¼
V ¼(s);8s 2 SV ¤(s) = max¼
V ¼(s);8s 2 S
Q¤(s; a) = max¼
Q¼(s; a); 8s 2 S;8a 2 A(s)Q¤(s; a) = max¼
Q¼(s; a); 8s 2 S;8a 2 A(s)
V ¤(s) = maxa
Q¤(s; a)
= maxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]
V ¤(s) = maxa
Q¤(s; a)
= maxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]
17
Value Iteration Suppose that we know the value function of optimal policy
• It should satisfy Bellman optimality equation:
• Then the optimal policy can be derived by
This readily suggests an iterative algorithm:
• As , converges to the utility of optimal policy• Proof sketch: is a contraction mapping
i ! 1i ! 1
V ¤(s) = maxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]V ¤(s) = maxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]
¼¤(s) = argmaxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]¼¤(s) = argmaxa
[R(s; a) + °P
s0 T (s; a; s0)V ¤(s0)]
Vi+1(s) Ã maxa
[R(s; a) + °P
s0 T (s; a; s0)Vi(s0)]
, Vi+1 Ã BVi
Vi+1(s) Ã maxa
[R(s; a) + °P
s0 T (s; a; s0)Vi(s0)]
, Vi+1 Ã BVi
Vi+1Vi+1
[Bellman update]
Vi+1 Ã BViVi+1 Ã BVi
18
Planning vs Learning Planning: any computational process that uses a model to create
or improve a policy• Agent has a-priori knowledge about the model of the environment
(e.g. MDP)
Reinforcement Learning• The agent doesn’t have a-priori knowledge about the model• It may explicitly learn (i.e. construct) the model from the interaction,
but not a necessity• exploration vs. exploitation trade-off
19
Category of RL AlgorithmsModel based algorithms
Value based algorithms
Policy search algorithms
T;RT;R
¼¼
ss
aa
s0s0
rr
Q¤(s; a)Q¤(s; a)ss
aa
ss ¼¤(s)¼¤(s)
Littman, MLSS 2009
20
Model based: MDP Learning On experience :
•••
•
If:• visited infinitely often•
Then:• [Littman 1996]
hst; at; rt; st+1ihst; at; rt; st+1iR(st; at) Ã R(st; at) + ®t(rt ¡R(st; at))R(st; at) Ã R(st; at) + ®t(rt ¡R(st; at))
T (st; at; st+1) Ã T (st; at; st+1) + ®t(1 ¡ T (st; at; st+1))T (st; at; st+1) Ã T (st; at; st+1) + ®t(1 ¡ T (st; at; st+1))
T (st; at; s0) Ã T (st; at; s
0) + ®t(0 ¡ T (st; at; s0)) 8s0 6= st+1T (st; at; s
0) Ã T (st; at; s0) + ®t(0 ¡ T (st; at; s
0)) 8s0 6= st+1
Q(s; a) = R(s; a) + °P
s0 T (s; a; s0) maxa0 Q(s0; a0)Q(s; a) = R(s; a) + °P
s0 T (s; a; s0) maxa0 Q(s0; a0)
Littman, MLSS 2009
8hs; ai8hs; aiPt ®t = 1;
Pt ®
2t < 1
Pt ®t = 1;
Pt ®
2t < 1
Q(s; a) ! Q¤(s; a)Q(s; a) ! Q¤(s; a)
21
Value based: Q-Learning On experience :
•
If:• visited infinitely often•
Then:• [Watkins & Dayan 1992]
Littman, MLSS 2009
hst; at; rt; st+1ihst; at; rt; st+1iQ(st; at) Ã Q(st; at) + ®t(rt + ° maxa0 Q(st+1; a
0) ¡Q(st; at))Q(st; at) Ã Q(st; at) + ®t(rt + ° maxa0 Q(st+1; a0) ¡Q(st; at))
8hs; ai8hs; aiPt ®t = 1;
Pt ®
2t < 1
Pt ®t = 1;
Pt ®
2t < 1
Q(s; a) ! Q¤(s; a)Q(s; a) ! Q¤(s; a)
22
Policy Search: Policy Gradient Methods Represent policy with parameter , and do
where
Note: •
Self-play to improve the policy network in AlphaGo
µµ
µi+1 Ã µi + ®rµVµµi+1 Ã µi + ®rµVµ
rµVµ = rµE¿ [R(¿)]
= rµ
P¿ pµ(¿)R(¿)
=P
¿ rµpµ(¿)R(¿)
=P
¿ pµ(¿)rµ log pµ(¿)R(¿)
¼ 1N
P¿2D rµ log pµ(¿)R(¿)
rµVµ = rµE¿ [R(¿)]
= rµ
P¿ pµ(¿)R(¿)
=P
¿ rµpµ(¿)R(¿)
=P
¿ pµ(¿)rµ log pµ(¿)R(¿)
¼ 1N
P¿2D rµ log pµ(¿)R(¿)
pµ(¿) = p(s1)QT
t=1 p(st+1jst; at)¼µ(atjxt)pµ(¿) = p(s1)QT
t=1 p(st+1jst; at)¼µ(atjxt)rµ log pµ(¿) =
PTt=1 rµ log ¼µ(atjst)rµ log pµ(¿) =
PTt=1 rµ log ¼µ(atjst)
23
Exploration vs. Exploitation Trade-off Uniformly selecting actions (!)
•
ε-greedy policy
•
ε-soft policy• Choose some constant
•
Thompson sampling is gaining popularity [Osband & Van Roy 2013]
Pr(ajs) =kQ(s;a)
Pa0 kQ(s;a0)
Pr(ajs) =kQ(s;a)
Pa0 kQ(s;a0)
k ¸ 1k ¸ 1
Pr(ajs) = 1=jAjPr(ajs) = 1=jAj
Pr(ajs) =
8><
>:
1=jAj with prob. ²
1 with prob. 1-² if a = arg maxa0 Q(s; a0)
0 with prob. 1-² if a 6= arg maxa0 Q(s; a0)
Pr(ajs) =
8><
>:
1=jAj with prob. ²
1 with prob. 1-² if a = arg maxa0 Q(s; a0)
0 with prob. 1-² if a 6= arg maxa0 Q(s; a0)
24
Continuous or Large State-Spaces Represent states as feature vectors:
•• e.g. linear model for value: •
Stochastic Gradient Descent• Mean-Square-Error:• Gradient:• Training rule:
• If , will converge to a local optima• Q-learning:
The foundation of Deep Q-Network (DQN) and Value network in AlphaGo (roughly speaking)
Á(s) = [Á1(s); Á2(s); : : : ; ÁK(s)]>Á(s) = [Á1(s); Á2(s); : : : ; ÁK(s)]>
Vt(s) = µ>t Á(s) =PK
i=1 µt;iÁi(s)Vt(s) = µ>t Á(s) =PK
i=1 µt;iÁi(s)
rµtVt(s) = Á(s)rµtVt(s) = Á(s)
MSE(µt) = 12
Rs p(s)[V ¼(s) ¡ Vt(s)]
2dsMSE(µt) = 12
Rs p(s)[V ¼(s) ¡ Vt(s)]
2ds
rµtMSE(µt) = ¡Rs p(s)[V ¼(s) ¡ Vt(s)]rµtVt(s)dsrµtMSE(µt) = ¡Rs p(s)[V ¼(s) ¡ Vt(s)]rµtVt(s)ds
µt+1 Ã µt + ®[vt ¡ Vt(st)]rµVt(st)µt+1 Ã µt + ®[vt ¡ Vt(st)]rµVt(st)
E[vt] = V ¼(st)E[vt] = V ¼(st) µtµt
vt = rt + ° maxa Q(st+1; a); Vt(st) = Q(st; at)vt = rt + ° maxa Q(st+1; a); Vt(st) = Q(st; at)
25
Research Directions Bayesian Reinforcement Learning
• Bayes-optimal actions = optimal explore-exploit trade-off!
• Computationally intractable, but good approximation exists• [Dearden et al 1998], [Duff 2002], [Poupart et al 2006], [Wang et al.
2012], [Kim et al 2015], [Lee & Kim 2015] …
Inverse Reinforcement Learning• Inverse problem of reinforcement learning• [Ng & Russell 2000], [Ziebart et al 2008], [Choi & Kim 2015], …
Learning from environment (reward) and trainer (feedback)• Motivated by behaviorism by Skinner [1965]• [Knox & Stone 2012], [Griffith et al 2013], [Loftin et al 2016], [Hong et al
2016]
¼¤(s) = argmaxa
ZQ(s; ajM)p(M jD)dM¼¤(s) = argmax
a
ZQ(s; ajM)p(M jD)dM
Expe
rt’s
be
havi
or
Rew
ard
func
tion IRL
Dynamicsmodel
RL