introduction to reinforcement learning - sjtuzou-jn/reinforcement_learning_master/l1.pdf ·...
TRANSCRIPT
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Introduction to Reinforcement Learning
Prof. Junni Zou
Institute of Media, Information and NetworkDept. of Computer Science and Engineering
Shanghai Jiao Tong Universityhttp://min.sjtu.edu.cn
Fall, 2019
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 1 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Outline
1 Course Information
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Elements of RL
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 2 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Course Information
Table of Contents
1 Course Information
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Elements of RL
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 3 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Course Information
About Me
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 4 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Course Information
About Me
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 5 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Course Information
About Class
Instructor: Junni Zou ([email protected]), SEIEE Building 3-437
Course website:http://www.cs.sjtu.edu.cn/~zou-jn/teaching.html
Reference book: Richard S. Sutton and Andrew G. Barto,Reinforcement Learning: An Introduction, Second Edition
TA: Yuankun Jiang ([email protected])
Grading Policy:project: 50%, technical report: 50%
We use the lecture slides of Prof. David Silver as a reference.http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 6 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Course Information
Online Courses of Reinforcement Learning
Reinforcement Learning, David Silver, UCL
Reinforcement Learning, Emma Brunskill, Stanford
Deep Reinforcement Learning, Sergey Levine, UC Berkeley
Deep Reinforcement Learning and Control, Katerina Fragkiadaki,CMU
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 7 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Table of Contents
1 Course Information
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Elements of RL
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 8 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Development of Reinforcement Learning
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 9 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Development of Reinforcement Learning
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 10 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Branches of Machine Learning
Lecture 1: Introduction to Reinforcement Learning
About RL
Branches of Machine Learning
Reinforcement Learning
Supervised Learning
Unsupervised Learning
MachineLearning
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 11 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Characteristics of Reinfrocement Learning
What makes reinforcement learning different from other machinelearning paradigms?
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 12 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Many Faces of Reinfrocement Learning
Lecture 1: Introduction to Reinforcement Learning
About RL
Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering Neuroscience
Psychology
Machine Learning
Classical/OperantConditioning
Optimal Control
RewardSystem
Operations Research
BoundedRationality
Reinforcement Learning
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 13 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
About Reinforcement Learning
Examples of RL
Defeat the world champion at Backgammon
Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 14 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem
Table of Contents
1 Course Information
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Elements of RL
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 15 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem Reward
Rewards
A reward signal defines the goal in a reinforcement learning problem
A reward Rt is a scalar feedback signal
The agent’s sole objective is to maximize the total reward it receivesover the long run (cumulative reward)
Indicates how well the agent is doing at step t
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximization of expected cumulativereward
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 16 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem Reward
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ value for following desired trajectory− value for crashing
Defeat the world champion at Backgammon
+/− value for winning/losing a game
Manage an investment portfolio
+ value for each $ in bank
Control a power station
+ value for producing power− value for exceeding safety thresholds
Make a humanoid robot walk
+ value for forward motion− value for falling over
Play many different Atari games better than humans
+/− value for increasing/decreasing score
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 17 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem Reward
Sequential Decision Making
Goal: select actions to maximize total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain morelong-term rewardExamples
A financial investment (may take months to mature)Refuelling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many movesfrom now)
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 18 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem Environments
Agent and Environment
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments
Agent and Environment
observation
reward
action
At
Rt
Ot
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 19 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem Environments
Agent and Environment
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
Environments
Agent and Environment
observation
reward
action
At
Rt
Ot At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward Rt
The environment:
Receives action At
Emits observation Ot+1
Emits scalar reward Rt+1
t increments at env. step
At each step t the agent:Executes action At
Receives observation Ot
Receives scalar reward Rt
The environment:Receives action At
Emits observation Ot+1
Emits scalar reward Rt+1
t increments at every step
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 20 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
History and State
The history is the sequence of observations, actions, rewards
Ht = O1,R1,A1, · · · ,At−1,Ot ,Rt
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agentWhat happens next depends on the history:
The agent selects actionsThe environment selects observations/rewards
State is the information used to determine what happens next
Formally, state is a function of the history:
St = f (Ht)
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 21 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
Information State
An information state (a.k.a. Markov) contains all useful informationfrom the history.
Definition
A state St is Markov if and only if
P[St+1|St ] = P[St+1|S1, · · · , St ]
The future is independent of the past given the present
H1:t → St → Ht+1:∞
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Set is Markov
The history Ht is Markov
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 22 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
Rat ExampleLecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Rat Example
What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 23 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
Rat ExampleLecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Rat Example
What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?What if agent state = last 3 items in sequence?
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 24 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
Rat Example
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Rat Example
What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bell and levers?
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 25 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
The Reinforcement Learning Problem State
Rat Example
Lecture 1: Introduction to Reinforcement Learning
The RL Problem
State
Rat Example
What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bells and levers?
What if agent state = complete sequence?What if agent state = last 3 items in sequence?
What if agent state = counts for lights, bell and levers?
What if agent state = complete sequence?
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 26 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL
Table of Contents
1 Course Information
2 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Elements of RL
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 27 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL
Major Components of an RL Agent
An RL agent may include one or more of these components:
Policy: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 28 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Policy
Policy
A policy is the agent’s behaviour
It is a map from state to action, e.g.
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 29 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Value Function
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/ badness of states
And therefore to select between actions, e.g.
vπ(s) = Eπ
[Rt+1 + γRt+2 + γ2Rt+3 + · · · |St = s
]
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 30 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Model
Model
A model predicts what the environment will do next
Transitions: P predicts the next state
Rewards: R predicts the next (immediate) reward, e.g.
Pass′ = P[St+1 = s ′|St = s,At = a]
Ras = E[Rt+1|St = s,At = a]
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 31 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Maze Example
Maze Example
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example
Start
Goal
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent’s location
Rewards: −1 per time-step
Actions: N, E, S, W
States: Agent’s location
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 32 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Maze Example
Maze Example: PolicyLecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Policy
Start
Goal
Arrows represent policy π(s) for each state sArrows represent policy π(s) for each state s
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 33 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Maze Example
Maze Example: Value FunctionLecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Value Function
-14 -13 -12 -11 -10 -9
-16 -15 -12 -8
-16 -17 -6 -7
-18 -19 -5
-24 -20 -4 -3
-23 -22 -21 -22 -2 -1
Start
Goal
Numbers represent value vπ(s) of each state sNumbers represent value function vπ(s) for each state s
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 34 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Maze Example
Maze Example: ModelLecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Model
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1
-1 -1
-1 -1
Start
Goal
Agent may have an internalmodel of the environment
Dynamics: how actionschange the state
Rewards: how much rewardfrom each state
The model may be imperfect
Grid layout represents transition model Pass′
Numbers represent immediate reward Ras from each state s
(same for all a)
Agent may have an internalmodel of the environment
Dynamics: how actions changethe state
Rewards: how much rewardfrom each state
The model may be imperfect
Grid layout represents transitionmodel Pa
ss′
Numbers represent immediatereward Ra∫ from each state s(same for all a)
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 35 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Categorization
Categorizing RL Approaches (1)
How to choose actions?Value Based
No Policy (Implicit)Value Function
Policy BasedPolicyNo Value Function
Actor-CriticPolicyValue Function
Whether can we learn amodel?
Model freePolicy and/or Value FunctionNo Model
Model BasedPolicy and/or Value FunctionModel
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 36 / 37
���������� �����
��������������� ������ ���
���� ����� !" #�$ ��� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)
,&�$)./01 2�� �33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂
_�'!���.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!���(*d.e �$+1�)8� a.#���$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+
Elements of RL Categorization
Categorizing RL Approaches (2)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
RL Agent Taxonomy
Model
Value Function PolicyActorCritic
Value-Based Policy-Based
Model-Free
Model-Based
Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 37 / 37