(deep) reinforcement learning - computer vision lab....

1

(Deep) Reinforcement Learning

Kee-Eung KimSchool of Computer Science, KAIST

2

Machine Learning Tasks

Unsupervised Learning• Exploratory machine learning: find patterns (structures) in the input

Supervised Learning• Learning from teacher: desired target for each input is known

Reinforcement Learning• Learning from critic: learning algorithm makes prediction

and then the evaluated loss is given as feedback

f~xi; ~yigNi=1f~xi; ~yigNi=1

~y = f (~x)~y = f (~x)

~y = g(~x)~y = g(~x)

~y~y

~y~y

L(~y; ~y)L(~y; ~y)

3

Google Allo

4

Machine Learning for Smart Reply Think about…

• Reply generation:• Random?• Max-score replies from trained model?

• Learn from user interaction?• Clicks (i.e. feedback rather direct specification)• If no-click, then…?

Need to engage actively with the user and learn from feedback

5

Reinforcement Learning

An area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, stochastic optimization, multi-agent systems, … [adapted from Wikipedia]

Russell & Norvig, Artificial Intelligence: a Modern Approach

(reward & state)

6

Multi-Armed Bandits Stochastic bandits [Robbins 1952]

• : number of arms (or actions)• : number of rounds • : reward random variable for arm in round

(unknown )

Bandit algorithm implements policy (allocation strategy) that chooses which arm to play based on the sequence of past plays and rewards

Goal: maximize cumulative reward (= minimize expected regret)

KK

TT

ri;t » P (ri;t; µi)ri;t » P (ri;t; µi) ii tt

AA

RT = maxi=1;:::;K

E

"TX

t=1

ri;t

#¡ E

"TX

t=1

rIt;t

#RT = max

i=1;:::;KE

"TX

t=1

ri;t

#¡ E

"TX

t=1

rIt;t

#

Always play the best arm Actual play by the algorithm

µiµi

7

Regret Regret measures on-line performance of a learning algorithm

Regret curve

• Sublinear regret implies convergence to optimal policy

RT = maxi=1;:::;K

E

"TX

t=1

ri;t

#¡ E

"TX

t=1

rIt;t

#RT = max

i=1;:::;KE

"TX

t=1

ri;t

#¡ E

"TX

t=1

rIt;t

#

Always play the best arm Actual play by the algorithm

8

Exploration vs Exploitation Trade-off Fundamental challenge in RL!

• Always playing arm that has the best empirical reward (exploit) won’t findthe true best arm so very large regret

• Always playing arm that was leastplayed (explore) will find the true best arm but also very large regret

Formal result [Lai & Robbins 1985]

• Suppose and . Then,

• Roughly,

O(T )O(T )

O(T )O(T )

ri;t » Ber(¹i)ri;t » Ber(¹i) ¢i = ¹¤ ¡ ¹i¢i = ¹¤ ¡ ¹i

limT!1

infRT

ln T¸X

i:¢i>0

¢i

DKL(¹i; ¹¤)limT!1

infRT

ln T¸X

i:¢i>0

¢i

DKL(¹i; ¹¤)

9

Upper-Confidence Bound (UCB) UCB1 policy [Auer et al. 2002]

• Select arm with the best upper-confidence value• Optimism in the face of uncertainty (OFU) principle for exploration

• In round , play

where

Formal result:

tt

It = arg maxi=f1;:::;Kg

"¹i;t¡1 +

s2 log t

Ti(t ¡ 1)

#It = arg max

i=f1;:::;Kg

"¹i;t¡1 +

s2 log t

Ti(t ¡ 1)

#

¹i;t¡1 =1

Ti(t ¡ 1)

Ti(t¡1)X

s=1

ri;s [empirical mean of reward for arm i]

Ti(t ¡ 1) =

t¡1X

s=1

±Is;i [# of plays for arm i in rounds 1; : : : ; t ¡ 1]

¹i;t¡1 =1

Ti(t ¡ 1)

Ti(t¡1)X

s=1

ri;s [empirical mean of reward for arm i]

Ti(t ¡ 1) =

t¡1X

s=1

±Is;i [# of plays for arm i in rounds 1; : : : ; t ¡ 1]

RT ·X

i 6=i¤

min

·10

¢iln T; T¢i

¸RT ·

X

i 6=i¤

min

·10

¢iln T; T¢i

¸

10

Can You Beat the Bandit? http://iosband.github.io/2015/07/28/Beat-the-bandit.html

http://iosband.github.io/2015/07/28/Beat-the-bandit.html

11

Contextual Bandits [Langford & Zhang 2008]

Real-world scenarios: e.g. recommendation systems, context-sensitive ads, personalized search ranking, medical decision making …• Need to learn while interacting with real-world• Once again, exploration vs. exploitation trade-off

12

Contextual Bandits [Langford & Zhang 2008]

For• Observe some context• Play arm • Receive reward

Regret• Strategy: mapping

Formal result• S-Exp3 algorithm achieves

t = 1; : : : ; T :t = 1; : : : ; T :

xt 2 Xxt 2 Xat 2 f1; : : : Kgat 2 f1; : : : Kg

rat;t » P (rat;t; xt; µat)rat;t » P (rat;t; xt; µat)

RT = maxg:S!f1;;:::;Kg

E

"TX

t=1

rg(xt);t

#¡ E

"TX

t=1

rIt;t

#RT = max

g:S!f1;;:::;KgE

"TX

t=1

rg(xt);t

#¡ E

"TX

t=1

rIt;t

#g : X ! f1; : : : ; Kgg : X ! f1; : : : ; Kg

RT ·p

2T jSjK ln KRT ·p

2T jSjK ln K

13

LinUCB [Chu et al 2011]

Contextual bandit with linear payoff assumption

Estimate parameters by least squares regression

Compute upper-confidence bound for each action:

E[rjx; a] ¼ Qµ(x; a) = Á(x; a)>µE[rjx; a] ¼ Qµ(x; a) = Á(x; a)>µ

At =Pt

¿=1 Á(x¿ ; a¿ )Á(x¿ ; a¿ )>

bt =Pt

¿=1 Á(x¿ ; a¿ )r¿

µt = A¡1t bt

At =Pt

¿=1 Á(x¿ ; a¿ )Á(x¿ ; a¿ )>

bt =Pt

¿=1 Á(x¿ ; a¿ )r¿

µt = A¡1t bt

maxµ2E

Qµ(x; a) = Á(x; a)>µ + cq

Á(x; a)>A¡1Á(x; a)maxµ2E

Qµ(x; a) = Á(x; a)>µ + cq

Á(x; a)>A¡1Á(x; a)

14

Thompson Sampling [Thompson 1933]

Probability matching heuristic: select actions randomly according to its probability of being optimal

Implementation using sampling:

Formal analysis also available

Pr(ajxt) =

ZIµ

E[rajxt; µa] = maxa0

E[ra0 jxt; µa0 ]

¶p(µjD)dµPr(ajxt) =

ZIµ

E[rajxt; µa] = maxa0

E[ra0 jxt; µa0 ]

¶p(µjD)dµ

15

Markov Decision Processes (MDPs) [Puterman 1995]

A classical model for sequential decision problems, define:• set of possible environment states• set of possible actions• reward function• probabilistic transition function• discount rate

Given MDP and policy • How good is the policy for each state…?

• State-Value function:

• Action-Value function:

s 2 Ss 2 S

a 2 Aa 2 A

R(s; a)R(s; a)

T (s; a; s0) = Pr(s0js; a)T (s; a; s0) = Pr(s0js; a)

M = hS; A; R; T; °iM = hS; A; R; T; °i ¼ : S ! A¼ : S ! A

° 2 [0; 1)° 2 [0; 1)

V ¼(s) = Es0;s00;:::[R¼([s; ¼(s); s0; ¼(s0); s00; ¼(s00); : : :])]

= R(s; ¼(s)) + °P

s0 T (s; ¼(s); s0)V ¼(s0)

V ¼(s) = Es0;s00;:::[R¼([s; ¼(s); s0; ¼(s0); s00; ¼(s00); : : :])]

= R(s; ¼(s)) + °P

s0 T (s; ¼(s); s0)V ¼(s0)

R¼([s; ¼(s);s0; ¼(s0); s00; ¼(s00); : : :])

= R(s; ¼(s)) + °R(s0; ¼(s0)) + °2R(s00; ¼(s00)) + ¢ ¢ ¢R¼([s; ¼(s);s0; ¼(s0); s00; ¼(s00); : : :])

= R(s; ¼(s)) + °R(s0; ¼(s0)) + °2R(s00; ¼(s00)) + ¢ ¢ ¢

Q¼(s; a) = R(s; a) + °P

s0 T (s; a; s0)V ¼(s0)Q¼(s; a) = R(s; a) + °P

s0 T (s; a; s0)V ¼(s0)

context change dynamics

16

Optimal Policy Optimal policy maximizes the value function for each state

The value function of optimal policy should satisfy• Optimal state-value function:

• Optimal action-value function:

• Further, Bellman optimality equation:

• We need to find the solution to the system of nonlinear equations

¼¤ = argmax¼

V ¼(s);8s 2 S¼¤ = argmax¼

V ¼(s);8s 2 S

V ¤(s) = max¼

V ¼(s);8s 2 SV ¤(s) = max¼

V ¼(s);8s 2 S

Q¤(s; a) = max¼

Q¼(s; a); 8s 2 S;8a 2 A(s)Q¤(s; a) = max¼

Q¼(s; a); 8s 2 S;8a 2 A(s)

V ¤(s) = maxa

Q¤(s; a)

= maxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]

V ¤(s) = maxa

Q¤(s; a)

= maxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]

17

Value Iteration Suppose that we know the value function of optimal policy

• It should satisfy Bellman optimality equation:

• Then the optimal policy can be derived by

This readily suggests an iterative algorithm:

• As , converges to the utility of optimal policy• Proof sketch: is a contraction mapping

i ! 1i ! 1

V ¤(s) = maxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]V ¤(s) = maxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]

¼¤(s) = argmaxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]¼¤(s) = argmaxa

[R(s; a) + °P

s0 T (s; a; s0)V ¤(s0)]

Vi+1(s) Ã maxa

[R(s; a) + °P

s0 T (s; a; s0)Vi(s0)]

, Vi+1 Ã BVi

Vi+1(s) Ã maxa

[R(s; a) + °P

s0 T (s; a; s0)Vi(s0)]

, Vi+1 Ã BVi

Vi+1Vi+1

[Bellman update]

Vi+1 Ã BViVi+1 Ã BVi

18

Planning vs Learning Planning: any computational process that uses a model to create

or improve a policy• Agent has a-priori knowledge about the model of the environment

(e.g. MDP)

Reinforcement Learning• The agent doesn’t have a-priori knowledge about the model• It may explicitly learn (i.e. construct) the model from the interaction,

but not a necessity• exploration vs. exploitation trade-off

19

Category of RL AlgorithmsModel based algorithms

Value based algorithms

Policy search algorithms

T;RT;R

QQ

¼¼

ss

aa

s0s0

rr

Q¤(s; a)Q¤(s; a)ss

aa

ss ¼¤(s)¼¤(s)

Littman, MLSS 2009

20

Model based: MDP Learning On experience :

•••

•

If:• visited infinitely often•

Then:• [Littman 1996]

hst; at; rt; st+1ihst; at; rt; st+1iR(st; at) Ã R(st; at) + ®t(rt ¡R(st; at))R(st; at) Ã R(st; at) + ®t(rt ¡R(st; at))

T (st; at; st+1) Ã T (st; at; st+1) + ®t(1 ¡ T (st; at; st+1))T (st; at; st+1) Ã T (st; at; st+1) + ®t(1 ¡ T (st; at; st+1))

T (st; at; s0) Ã T (st; at; s

0) + ®t(0 ¡ T (st; at; s0)) 8s0 6= st+1T (st; at; s

0) Ã T (st; at; s0) + ®t(0 ¡ T (st; at; s

0)) 8s0 6= st+1

Q(s; a) = R(s; a) + °P

s0 T (s; a; s0) maxa0 Q(s0; a0)Q(s; a) = R(s; a) + °P

s0 T (s; a; s0) maxa0 Q(s0; a0)

Littman, MLSS 2009

8hs; ai8hs; aiPt ®t = 1;

Pt ®

2t < 1

Pt ®t = 1;

Pt ®

2t < 1

Q(s; a) ! Q¤(s; a)Q(s; a) ! Q¤(s; a)

21

Value based: Q-Learning On experience :

•

If:• visited infinitely often•

Then:• [Watkins & Dayan 1992]

Littman, MLSS 2009

hst; at; rt; st+1ihst; at; rt; st+1iQ(st; at) Ã Q(st; at) + ®t(rt + ° maxa0 Q(st+1; a

0) ¡Q(st; at))Q(st; at) Ã Q(st; at) + ®t(rt + ° maxa0 Q(st+1; a0) ¡Q(st; at))

8hs; ai8hs; aiPt ®t = 1;

Pt ®

2t < 1

Pt ®t = 1;

Pt ®

2t < 1

Q(s; a) ! Q¤(s; a)Q(s; a) ! Q¤(s; a)

22

Policy Search: Policy Gradient Methods Represent policy with parameter , and do

where

Note: •

Self-play to improve the policy network in AlphaGo

µµ

µi+1 Ã µi + ®rµVµµi+1 Ã µi + ®rµVµ

rµVµ = rµE¿ [R(¿)]

= rµ

P¿ pµ(¿)R(¿)

=P

¿ rµpµ(¿)R(¿)

=P

¿ pµ(¿)rµ log pµ(¿)R(¿)

¼ 1N

P¿2D rµ log pµ(¿)R(¿)

rµVµ = rµE¿ [R(¿)]

= rµ

P¿ pµ(¿)R(¿)

=P

¿ rµpµ(¿)R(¿)

=P

¿ pµ(¿)rµ log pµ(¿)R(¿)

¼ 1N

P¿2D rµ log pµ(¿)R(¿)

pµ(¿) = p(s1)QT

t=1 p(st+1jst; at)¼µ(atjxt)pµ(¿) = p(s1)QT

t=1 p(st+1jst; at)¼µ(atjxt)rµ log pµ(¿) =

PTt=1 rµ log ¼µ(atjst)rµ log pµ(¿) =

PTt=1 rµ log ¼µ(atjst)

23

Exploration vs. Exploitation Trade-off Uniformly selecting actions (!)

•

ε-greedy policy

•

ε-soft policy• Choose some constant

•

Thompson sampling is gaining popularity [Osband & Van Roy 2013]

Pr(ajs) =kQ(s;a)

Pa0 kQ(s;a0)

Pr(ajs) =kQ(s;a)

Pa0 kQ(s;a0)

k ¸ 1k ¸ 1

Pr(ajs) = 1=jAjPr(ajs) = 1=jAj

Pr(ajs) =

8><

>:

1=jAj with prob. ²

1 with prob. 1-² if a = arg maxa0 Q(s; a0)

0 with prob. 1-² if a 6= arg maxa0 Q(s; a0)

Pr(ajs) =

8><

>:

1=jAj with prob. ²

1 with prob. 1-² if a = arg maxa0 Q(s; a0)

0 with prob. 1-² if a 6= arg maxa0 Q(s; a0)

24

Continuous or Large State-Spaces Represent states as feature vectors:

•• e.g. linear model for value: •

Stochastic Gradient Descent• Mean-Square-Error:• Gradient:• Training rule:

• If , will converge to a local optima• Q-learning:

The foundation of Deep Q-Network (DQN) and Value network in AlphaGo (roughly speaking)

Á(s) = [Á1(s); Á2(s); : : : ; ÁK(s)]>Á(s) = [Á1(s); Á2(s); : : : ; ÁK(s)]>

Vt(s) = µ>t Á(s) =PK

i=1 µt;iÁi(s)Vt(s) = µ>t Á(s) =PK

i=1 µt;iÁi(s)

rµtVt(s) = Á(s)rµtVt(s) = Á(s)

MSE(µt) = 12

Rs p(s)[V ¼(s) ¡ Vt(s)]

2dsMSE(µt) = 12

Rs p(s)[V ¼(s) ¡ Vt(s)]

2ds

rµtMSE(µt) = ¡Rs p(s)[V ¼(s) ¡ Vt(s)]rµtVt(s)dsrµtMSE(µt) = ¡Rs p(s)[V ¼(s) ¡ Vt(s)]rµtVt(s)ds

µt+1 Ã µt + ®[vt ¡ Vt(st)]rµVt(st)µt+1 Ã µt + ®[vt ¡ Vt(st)]rµVt(st)

E[vt] = V ¼(st)E[vt] = V ¼(st) µtµt

vt = rt + ° maxa Q(st+1; a); Vt(st) = Q(st; at)vt = rt + ° maxa Q(st+1; a); Vt(st) = Q(st; at)

25

Research Directions Bayesian Reinforcement Learning

• Bayes-optimal actions = optimal explore-exploit trade-off!

• Computationally intractable, but good approximation exists• [Dearden et al 1998], [Duff 2002], [Poupart et al 2006], [Wang et al.

2012], [Kim et al 2015], [Lee & Kim 2015] …

Inverse Reinforcement Learning• Inverse problem of reinforcement learning• [Ng & Russell 2000], [Ziebart et al 2008], [Choi & Kim 2015], …

Learning from environment (reward) and trainer (feedback)• Motivated by behaviorism by Skinner [1965]• [Knox & Stone 2012], [Griffith et al 2013], [Loftin et al 2016], [Hong et al

2016]

¼¤(s) = argmaxa

ZQ(s; ajM)p(M jD)dM¼¤(s) = argmax

a

ZQ(s; ajM)p(M jD)dM

Expe

rt’s

be

havi

or

Rew

ard

func

tion IRL

Dynamicsmodel

RL

(deep) reinforcement learning - computer vision lab....

Documents