introduction to reinforcement learning - sjtuzou-jn/reinforcement_learning_master/l1.pdf ·...

��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)

,&�$)./01 2�� 33 % �45.6�$7�.+�+� .8��.�!!�+)�9.&:.;<<<=>?@[email protected]@.FGHI>@ICDBA5J�)��).1�8� #�*)�!)KLMNONPNQ.RS.TQUOVWKLSRXYVNORLW.VLU.ZQN[RX\ ]� ̂

_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Introduction to Reinforcement Learning

Prof. Junni Zou

Institute of Media, Information and NetworkDept. of Computer Science and Engineering

Shanghai Jiao Tong Universityhttp://min.sjtu.edu.cn

Fall, 2019

Prof. Zou (MIN, SJTU) Reinforcement Learning Spring, 2019 1 / 37

http://min.sjtu.edu.cn

��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Outline

1 Course Information

2 About Reinforcement Learning

3 The Reinforcement Learning Problem

4 Elements of RL


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Course Information

Table of Contents




4 Elements of RL


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Course Information

About Me


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Course Information

About Class

Instructor: Junni Zou ([email protected]), SEIEE Building 3-437

Course website:http://www.cs.sjtu.edu.cn/~zou-jn/teaching.html

Reference book: Richard S. Sutton and Andrew G. Barto,Reinforcement Learning: An Introduction, Second Edition

TA: Yuankun Jiang ([email protected])

Grading Policy:project: 50%, technical report: 50%

We use the lecture slides of Prof. David Silver as a reference.http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html


http://www.cs.sjtu.edu.cn/~zou-jn/teaching.html

http://incompleteideas.net/book/RLbook2018.pdf

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Course Information

Online Courses of Reinforcement Learning

Reinforcement Learning, David Silver, UCL

Reinforcement Learning, Emma Brunskill, Stanford

Deep Reinforcement Learning, Sergey Levine, UC Berkeley

Deep Reinforcement Learning and Control, Katerina Fragkiadaki,CMU


http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

http://web.stanford.edu/class/cs234/index.html

http://rail.eecs.berkeley.edu/deeprlcourse/

https://katefvision.github.io/

��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

About Reinforcement Learning

Table of Contents




4 Elements of RL


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Development of Reinforcement Learning


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Branches of Machine Learning

Lecture 1: Introduction to Reinforcement Learning

About RL

Branches of Machine Learning

Reinforcement Learning

Supervised Learning

Unsupervised Learning

MachineLearning


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Characteristics of Reinfrocement Learning

What makes reinforcement learning different from other machinelearning paradigms?

There is no supervisor, only a reward signal

Feedback is delayed, not instantaneous

Time really matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Many Faces of Reinfrocement Learning


About RL

Many Faces of Reinforcement Learning

Computer Science

Economics

Mathematics

Engineering Neuroscience

Psychology

Machine Learning

Classical/OperantConditioning

Optimal Control

RewardSystem

Operations Research

BoundedRationality

Reinforcement Learning


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Examples of RL

Defeat the world champion at Backgammon

Manage an investment portfolio

Control a power station

Make a humanoid robot walk

Play many different Atari games better than humans


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

The Reinforcement Learning Problem

Table of Contents




4 Elements of RL


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

The Reinforcement Learning Problem Reward

Rewards

A reward signal defines the goal in a reinforcement learning problem

A reward Rt is a scalar feedback signal

The agent’s sole objective is to maximize the total reward it receivesover the long run (cumulative reward)

Indicates how well the agent is doing at step t

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)

All goals can be described by the maximization of expected cumulativereward


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Examples of Rewards

Fly stunt manoeuvres in a helicopter

+ value for following desired trajectory− value for crashing

Defeat the world champion at Backgammon

+/− value for winning/losing a game

Manage an investment portfolio

+ value for each $ in bank

Control a power station

+ value for producing power− value for exceeding safety thresholds

Make a humanoid robot walk

+ value for forward motion− value for falling over

Play many different Atari games better than humans

+/− value for increasing/decreasing score


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Sequential Decision Making

Goal: select actions to maximize total future reward

Actions may have long term consequences

Reward may be delayed

It may be better to sacrifice immediate reward to gain morelong-term rewardExamples

A financial investment (may take months to mature)Refuelling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many movesfrom now)


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

The Reinforcement Learning Problem Environments

Agent and Environment


The RL Problem

Environments


observation

reward

action

At

Rt

Ot


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

The Reinforcement Learning Problem Environments



The RL Problem

Environments


observation

reward

action

At

Rt

Ot At each step t the agent:

Executes action At

Receives observation Ot

Receives scalar reward Rt

The environment:

Receives action At

Emits observation Ot+1

Emits scalar reward Rt+1

t increments at env. step

At each step t the agent:Executes action At

Receives observation Ot

Receives scalar reward Rt

The environment:Receives action At

Emits observation Ot+1

Emits scalar reward Rt+1

t increments at every step


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

The Reinforcement Learning Problem State

History and State

The history is the sequence of observations, actions, rewards

Ht = O1,R1,A1, · · · ,At−1,Ot ,Rt

i.e. all observable variables up to time t

i.e. the sensorimotor stream of a robot or embodied agentWhat happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

State is the information used to determine what happens next

Formally, state is a function of the history:

St = f (Ht)


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Information State

An information state (a.k.a. Markov) contains all useful informationfrom the history.

Definition

A state St is Markov if and only if

P[St+1|St ] = P[St+1|S1, · · · , St ]

The future is independent of the past given the present

H1:t → St → Ht+1:∞

Once the state is known, the history may be thrown away

i.e. The state is a sufficient statistic of the future

The environment state Set is Markov

The history Ht is Markov


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Rat ExampleLecture 1: Introduction to Reinforcement Learning

The RL Problem

State

Rat Example

What if agent state = last 3 items in sequence?

What if agent state = counts for lights, bells and levers?

What if agent state = complete sequence?


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Rat ExampleLecture 1: Introduction to Reinforcement Learning

The RL Problem

State

Rat Example



What if agent state = complete sequence?What if agent state = last 3 items in sequence?


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Rat Example


The RL Problem

State

Rat Example




What if agent state = counts for lights, bell and levers?


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Rat Example


The RL Problem

State

Rat Example




What if agent state = counts for lights, bell and levers?

What if agent state = complete sequence?


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL

Table of Contents




4 Elements of RL


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviour function

Value function: how good is each state and/or action

Model: agent’s representation of the environment


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Policy

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g.

Deterministic policy: a = π(s)

Stochastic policy: π(a|s) = P[At = a|St = s]


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Value Function

Value Function

Value function is a prediction of future reward

Used to evaluate the goodness/ badness of states

And therefore to select between actions, e.g.

vπ(s) = Eπ

[Rt+1 + γRt+2 + γ2Rt+3 + · · · |St = s

]


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Model

Model

A model predicts what the environment will do next

Transitions: P predicts the next state

Rewards: R predicts the next (immediate) reward, e.g.

Pass′ = P[St+1 = s ′|St = s,At = a]

Ras = E[Rt+1|St = s,At = a]


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Maze Example

Maze Example


Inside An RL Agent

Maze Example

Start

Goal

Rewards: -1 per time-step

Actions: N, E, S, W

States: Agent’s location

Rewards: −1 per time-step

Actions: N, E, S, W

States: Agent’s location


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Maze Example: PolicyLecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Maze Example: Policy

Start

Goal

Arrows represent policy π(s) for each state sArrows represent policy π(s) for each state s


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Maze Example: Value FunctionLecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Maze Example: Value Function

-14 -13 -12 -11 -10 -9

-16 -15 -12 -8

-16 -17 -6 -7

-18 -19 -5

-24 -20 -4 -3

-23 -22 -21 -22 -2 -1

Start

Goal

Numbers represent value vπ(s) of each state sNumbers represent value function vπ(s) for each state s


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+


Maze Example: ModelLecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Maze Example: Model

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1

-1 -1 -1

-1

-1 -1

-1 -1

Start

Goal

Agent may have an internalmodel of the environment

Dynamics: how actionschange the state

Rewards: how much rewardfrom each state

The model may be imperfect

Grid layout represents transition model Pass′

Numbers represent immediate reward Ras from each state s

(same for all a)

Agent may have an internalmodel of the environment

Dynamics: how actions changethe state

Rewards: how much rewardfrom each state

The model may be imperfect

Grid layout represents transitionmodel Pa

ss′

Numbers represent immediatereward Ra∫ from each state s(same for all a)


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Categorization

Categorizing RL Approaches (1)

How to choose actions?Value Based

No Policy (Implicit)Value Function

Policy BasedPolicyNo Value Function

Actor-CriticPolicyValue Function

Whether can we learn amodel?

Model freePolicy and/or Value FunctionNo Model

Model BasedPolicy and/or Value FunctionModel


��

��

�� !" #�$ �� %$&'(!�)(�*� %��+'� ,!)(-()(�� ,&�$)


_�'!��.)�.0*�)()$)�.�4./�9(�̀.0*4� ��)(�*̀.�*9.1�)8� abc(d*�'.% �!��(*d.e �$+1�)8� a.#��$*(!�)(�*.e �$+#��+$)� .f(�(�*.e �$+

Elements of RL Categorization

Categorizing RL Approaches (2)


Inside An RL Agent

RL Agent Taxonomy

Model

Value Function PolicyActorCritic

Value-Based Policy-Based

Model-Free

Model-Based


introduction to reinforcement learning - sjtuzou-jn/reinforcement_learning_master/l1.pdf ·...

Documents