introduction of reinforcement learningtlkagk/courses/ml_2017...goodness of actor â¢given an actor...
TRANSCRIPT
Introduction ofReinforcement Learning
Deep Reinforcement Learning
Reference
⢠Textbook: Reinforcement Learning: An Introduction
⢠http://incompleteideas.net/sutton/book/the-book.html
⢠Lectures of David Silver
⢠http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around 1:30 each)
⢠http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning )
⢠Lectures of John Schulman
⢠https://youtu.be/aUrX-rP_ss4
Scenario of Reinforcement Learning
Agent
Environment
Observation Action
RewardDonât do that
State Change the environment
Scenario of Reinforcement Learning
Agent
Environment
Observation
RewardThank you.
Agent learns to take actions maximizing expected reward.
https://yoast.com/how-to-clean-site-structure/
State
Action
Change the environment
Machine Learning â Looking for a Function
Environment
Observation Action
Reward
Function input
Used to pick the best function
Function output
Action = Ï( Observation )
Actor/Policy
Learning to play Go
Environment
Observation Action
Reward
Next Move
Learning to play Go
Environment
Observation Action
Reward
If win, reward = 1
If loss, reward = -1
reward = 0 in most cases
Agent learns to take actions maximizing expected reward.
Learning to play Go
⢠Supervised:
⢠Reinforcement Learning
Next move:â5-5â
Next move:â3-3â
First move âŠâŠ many moves âŠâŠ Win!
Alpha Go is supervised learning + reinforcement learning.
Learning from teacher
Learning from experience
(Two agents play with each other.)
Learning a chat-bot
⢠Machine obtains feedback from user
⢠Chat-bot learns to maximize the expected reward
https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg
How are you?
Bye bye âº
Hello
Hi âº
-10 3
http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm
Learning a chat-bot
⢠Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
Learning a chat-bot
⢠By this approach, we can generate a lot of dialogues.
⢠Use some pre-defined rules to evaluate the goodness of a dialogue
Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4
Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8
Machine learns from the evaluation
Deep Reinforcement Learning for Dialogue Generation
https://arxiv.org/pdf/1606.01541v3.pdf
Learning a chat-bot
⢠Supervised
⢠Reinforcement
Helloâº
Agent
âŠâŠ
Agent
âŠâŠ. âŠâŠ. âŠâŠ
Bad
âHelloâ Say âHiâ
âBye byeâ Say âGood byeâ
More applications
⢠Flying Helicopter⢠https://www.youtube.com/watch?v=0JL04JJjocc
⢠Driving⢠https://www.youtube.com/watch?v=0xo1Ldx3L5Q
⢠Robot⢠https://www.youtube.com/watch?v=370cT-OAzzM
⢠Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI⢠http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-
giant-electricity-bill-with-deepmind-powered-ai
⢠Text generation ⢠https://www.youtube.com/watch?v=pbQ4qe8EwLo
Example: Playing Video Game
⢠Widely studies: ⢠Gym: https://gym.openai.com/
⢠Universe: https://openai.com/blog/universe/
Machine learns to play video games as human players
⢠Machine learns to take proper action itself
⢠What machine observes is pixels
Example: Playing Video Game
⢠Space invader
fire
Score (reward)
Kill the aliens
Termination: all the aliens are killed, or your spaceship is destroyed.
shield
Example: Playing Video Game
⢠Space invader
⢠Play yourself: http://www.2600online.com/spaceinvaders.html
⢠How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw
Start with observation ð 1 Observation ð 2 Observation ð 3
Example: Playing Video Game
Action ð1: ârightâ
Obtain reward ð1 = 0
Action ð2 : âfireâ
(kill an alien)
Obtain reward ð2 = 5
Usually there is some randomness in the environment
Start with observation ð 1 Observation ð 2 Observation ð 3
Example: Playing Video Game
After many turns
Action ðð
Obtain reward ðð
Game Over(spaceship destroyed)
This is an episode.
Learn to maximize the expected cumulative reward per episode
Properties of Reinforcement Learning⢠Reward delay
⢠In space invader, only âfireâ obtains reward
⢠Although the moving before âfireâ is important
⢠In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
⢠Agentâs actions affect the subsequent data it receives
⢠E.g. Exploration
Outline
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach
Alpha Go: policy-based + value-based + model-based
Policy-based ApproachLearning an Actor
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple âŠâŠ
Neural Network as Actor
Neural network as Actor
⢠Input of neural network: the observation of machine represented as a vector or a matrix
⢠Output neural network : each action corresponds to a neuron in output layer
âŠâŠ
NN as actor
pixelsfire
right
leftProbability of taking
the action
What is the benefit of using network instead of lookup table?
0.7
0.2
0.1
generalization
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple âŠâŠ
Neural Network as Actor
Goodness of Actor
⢠Review: Supervised learning
1x
2x
âŠâŠ
256x
âŠâŠ
âŠâŠ
âŠâŠ
âŠâŠ
âŠâŠ
y1
y2
y10
Loss ð
â1â
âŠâŠ
1
0
0
âŠâŠ target
Softm
axAs close as
possibleGiven a set of parameters ð
ð¿ =
ð=1
ð
ðð
Total Loss:
Training Example
Goodness of Actor
⢠Given an actor ðð ð with network parameter ð
⢠Use the actor ðð ð to play the video game⢠Start with observation ð 1⢠Machine decides to take ð1⢠Machine obtains reward ð1⢠Machine sees observation ð 2⢠Machine decides to take ð2⢠Machine obtains reward ð2⢠Machine sees observation ð 3⢠âŠâŠ
⢠Machine decides to take ðð⢠Machine obtains reward ðð
Total reward: ð ð = Ïð¡=1ð ðð¡
Even with the same actor, ð ð is different each time
We define àŽ€ð ð as the expected value of RΞ
àŽ€ð ð evaluates the goodness of an actor ðð ð
Randomness in the actor and the game
END
Goodness of Actor
⢠ð = ð 1, ð1, ð1, ð 2, ð2, ð2, ⯠, ð ð , ðð , ðð
ð ð|ð =
ð ð 1 ð ð1|ð 1, ð ð ð1, ð 2|ð 1, ð1 ð ð2|ð 2, ð ð ð2, ð 3|ð 2, ð2 â¯
= ð ð 1 à·
ð¡=1
ð
ð ðð¡|ð ð¡, ð ð ðð¡ , ð ð¡+1|ð ð¡ , ðð¡
Control by your actor ðð
not related to your actor
Actorðð
left
right
fire
ð ð¡
0.1
0.2
0.7
ð ðð¡ = "ðððð"|ð ð¡ , ð= 0.7
We define àŽ€ð ð as the expected value of RΞ
Goodness of Actor
⢠An episode is considered as a trajectory ð
⢠ð = ð 1, ð1, ð1, ð 2, ð2, ð2, ⯠, ð ð , ðð , ðð⢠ð ð = Ïð¡=1
ð ðð¡â¢ If you use an actor to play the game, each ð has a
probability to be sampled
⢠The probability depends on actor parameter ð: ð ð|ð
àŽ€ð ð =
ð
ð ð ð ð|ð
Sum over all possible trajectory
Use ðð to play the game N times, obtain ð1, ð2, ⯠, ðð
Sampling ð from ð ð|ðN times
â1
ð
ð=1
ð
ð ðð
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple âŠâŠ
Neural Network as Actor
Gradient Ascent
⢠Problem statement
⢠Gradient ascent
⢠Start with ð0
⢠ð1 â ð0 + ðð» àŽ€ð ð0
⢠ð2 â ð1 + ðð» àŽ€ð ð1
⢠âŠâŠ
ðâ = ðððmaxð
àŽ€ð ð
ð» àŽ€ð ð
ð = ð€1, ð€2, ⯠, ð1, â¯
=
΀ð àŽ€ð ð ðð€1
΀ð àŽ€ð ð ðð€2
â®Î€ð àŽ€ð ð ðð1â®
Policy Gradient àŽ€ð ð =
ð
ð ð ð ð|ð ð» àŽ€ð ð =?
ð» àŽ€ð ð =
ð
ð ð ð»ð ð|ð =
ð
ð ð ð ð|ðð»ð ð|ð
ð ð|ð
=
ð
ð ð ð ð|ð ð»ðððð ð|ð
Use ðð to play the game N times, Obtain ð1, ð2, ⯠, ðð
â1
ð
ð=1
ð
ð ðð ð»ðððð ðð|ð
ð ð do not have to be differentiable
It can even be a black box.
ðððð ð ð¥
ðð¥=
1
ð ð¥
ðð ð¥
ðð¥
Policy Gradient
⢠ð = ð 1, ð1, ð1, ð 2, ð2, ð2, ⯠, ð ð , ðð , ðð
ð ð|ð = ð ð 1 à·
ð¡=1
ð
ð ðð¡|ð ð¡, ð ð ðð¡ , ð ð¡+1|ð ð¡, ðð¡
ð»ðððð ð|ð =?
ðððð ð|ð
= ðððð ð 1 +
ð¡=1
ð
ðððð ðð¡|ð ð¡ , ð + ðððð ðð¡ , ð ð¡+1|ð ð¡ , ðð¡
ð»ðððð ð|ð =
ð¡=1
ð
ð»ðððð ðð¡|ð ð¡ , ðIgnore the terms not related to ð
Policy Gradient
ð» àŽ€ð ð â1
ð
ð=1
ð
ð ðð ð»ðððð ðð|ð
ð»ðððð ð|ð
=
ð¡=1
ð
ð»ðððð ðð¡|ð ð¡ , ð
=1
ð
ð=1
ð
ð ðð
ð¡=1
ðð
ð»ðððð ðð¡ð|ð ð¡
ð, ð
ðððð€ â ðððð + ðð» àŽ€ð ðððð
=1
ð
ð=1
ð
ð¡=1
ðð
ð ðð ð»ðððð ðð¡ð|ð ð¡
ð, ð
If in ðð machine takes ðð¡ð when seeing ð ð¡
ð in
ð ðð is positive Tuning ð to increase ð ðð¡ð|ð ð¡
ð
ð ðð is negative Tuning ð to decrease ð ðð¡ð|ð ð¡
ð
It is very important to consider the cumulative reward ð ðð of the whole trajectory ðð instead of immediate reward ðð¡
ð
What if we replace ð ðð with ðð¡
ð âŠâŠ
Policy Gradient
ð1: ð 11, ð1
1 ð ð1
ð 21, ð2
1
âŠâŠ
ð ð1
âŠâŠ
ð2: ð 12, ð1
2 ð ð2
ð 22, ð2
2
âŠâŠ
ð ð2
âŠâŠ
ð â ð + ðð» àŽ€ð ð
ð» àŽ€ð ð =
1
ð
ð=1
ð
ð¡=1
ðð
ð ðð ð»ðððð ðð¡ð|ð ð¡
ð, ð
Given actor parameter ð
UpdateModel
DataCollection
Policy Gradient
âŠâŠ
fire
right
left
s
1
0
0
â
ð=1
3
à·ðŠðððððŠðMinimize:
Maximize:
ðððð "ðððð¡"|ð
ðŠð à·ðŠð
ððððŠð =
ð â ð + ðð»ðððð "ðððð¡"|ð
Considered as Classification Problem
Policy Gradientð â ð + ðð» àŽ€ð ð
ð» àŽ€ð ð =
1
ð
ð=1
ð
ð¡=1
ðð
ð ðð ð»ðððð ðð¡ð|ð ð¡
ð, ð
ð1: ð 11, ð1
1 ð ð1
ð 21, ð2
1
âŠâŠ
ð ð1
âŠâŠ
ð2: ð 12, ð1
2 ð ð2
ð 22, ð2
2
âŠâŠ
ð ð2
âŠâŠ
Given actor parameter ð
âŠ
ð 11
NN
fire
right
left
ð11 = ðððð¡
10
0
ð 12
NN
fire
right
left
ð12 = ðððð
00
1
Policy Gradientð â ð + ðð» àŽ€ð ð
ð» àŽ€ð ð =
1
ð
ð=1
ð
ð¡=1
ðð
ð ðð ð»ðððð ðð¡ð|ð ð¡
ð, ð
ð1: ð 11, ð1
1 ð ð1
ð 21, ð2
1
âŠâŠ
ð ð1
âŠâŠ
ð2: ð 12, ð1
2 ð ð2
ð 22, ð2
2
âŠâŠ
ð ð2
âŠâŠ
Given actor parameter ð
âŠ
ð 11
NN ð11 = ðððð¡
2
2
1
1 ð 11
NN ð11 = ðððð¡
ð 12
NN ð12 = ðððð
Each training data is weighted by ð ðð
Add a Baselineðððð€ â ðððð + ðð» àŽ€ð ðððð
ð» àŽ€ð ð â1
ð
ð=1
ð
ð¡=1
ðð
ð ðð ð»ðððð ðð¡ð|ð ð¡
ð , ð
It is possible that ð ðð
is always positive.
Ideal case
SamplingâŠâŠ
a b c a b c
It is probability âŠ
a b c
Not sampled
a b c
The probability of the actions not sampled will decrease.
1
ð
ð=1
ð
ð¡=1
ðð
ð ðð â ð ð»ðððð ðð¡ð|ð ð¡
ð, ð
Value-based ApproachLearning a Critic
Critic
⢠A critic does not determine the action.
⢠Given an actor Ï, it evaluates the how good the actor is
http://combiboilersleeds.com/picaso/critics/critics-4.html
An actor can be found from a critic.
e.g. Q-learning
Critic
⢠State value function ðð ð
⢠When using actor ð, the cumulated reward expects to be obtained after seeing observation (state) s
ððsðð ð
scalar
ðð ð is large ðð ð is smaller
Critic
ð以åçé¿å 倧銬æ¥é£ = bad
ðè®åŒ·çé¿å 倧銬æ¥é£ = good
How to estimate ðð ð
⢠Monte-Carlo based approach⢠The critic watches ð playing the game
After seeing ð ð,
Until the end of the episode, the cumulated reward is ðºð
After seeing ð ð,
Until the end of the episode, the cumulated reward is ðºð
ðð ð ðððð ð ðºð
ðð ð ðððð ð ðºð
How to estimate ðð ð
⢠Temporal-difference approach
â¯ð ð¡ , ðð¡ , ðð¡ , ð ð¡+1â¯
ðð ð ð¡ððð ð¡
ðð ð ð¡+1ððð ð¡+1
ðð ð ð¡ + ðð¡ = ðð ð ð¡+1
ðð ð ð¡ â ðð ð ð¡+1 ðð¡-
Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.
MC v.s. TD
ðð ð ðððð ð ðºðLarger variance
unbiased
ðð ð ð¡ððð ð¡ ðð ð ð¡+1 ðð ð ð¡+1ð +
Smaller varianceMay be biased
âŠ
MC v.s. TD
⢠The critic has the following 8 episodes⢠ð ð , ð = 0, ð ð , ð = 0, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 1, END
⢠ð ð , ð = 0, END
[Sutton, v2, Example 6.4]
(The actions are ignored here.)
ðð ð ð =?
ðð ð ð = 3/4
0? 3/4?
Monte-Carlo:
Temporal-difference:
ðð ð ð = 0
ðð ð ð + ð = ðð ð ð3/43/4 0
Another Critic
⢠State-action value function ðð ð , ð
⢠When using actor ð, the cumulated reward expects to be obtained after seeing observation s and taking a
ððs ðð ð , ð
scalara
ðð ð , ð = ðððð¡
ðð ð , ð = ðððð
ðð ð , ð = ðððâð¡ðð
for discrete action only
s
Q-Learning
ð interacts with the environment
Learning ðð ð , ðFind a new actor ðâ² âbetterâ than ð
TD or MC
?
ð = ðâ²
Q-Learning
⢠Given ðð ð , ð , find a new actor ðâ² âbetterâ than ð
⢠âBetterâ: ððâ² ð ⥠ðð ð , for all state s
ðâ² ð = ðððmaxð
ðð ð , ð
â¢ðâ² does not have extra parameters. It depends on Q
â¢Not suitable for continuous action a
Deep Reinforcement LearningActor-Critic
Actor-Critic
ð interacts with the environment
Learning ðð ð , ð , ðð ð
Update actor from ð â ðâ based on ðð ð , ð , ðð ð
TD or MCð = ðâ²
Actor-Critic
⢠Tips
⢠The parameters of actor ð ð and critic ðð ð can be shared
Networkð
Network
fire
right
left
Network
ðð ð
Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9
Asynchronous
ð2
ð1
1. Copy global parameters
2. Sampling some data
3. Compute gradients
4. Update global models
âð
âð
ð1
+ðâðð1
(other workers also update models)
Demo of A3C
⢠Racing Car (DeepMind)
https://www.youtube.com/watch?v=0xo1Ldx3L5Q
Demo of A3C
⢠Visual Doom AI Competition @ CIG 2016
⢠https://www.youtube.com/watch?v=94EPSjQH38Y
Concluding Remarks
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach