powerpoint presentation - chapter 1 introduction9 example 1: cart pole avoid failure: the pole...
TRANSCRIPT
![Page 1: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/1.jpg)
REINFORCEMENT LEARNING
❐ The Problem❐ Reinforcement Learning Methods
Monte-Carlo learning Temporal-difference learning TD(λ)
❐ Function approximation
1
David Silver
![Page 2: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/2.jpg)
1. THE PROBLEM
❐ What is reinforcement learning?
![Page 3: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/3.jpg)
Learn from Experience To ...
❐ Fly stunt manoeuvres in a helicopter❐ Defeat the world champion at Backgammon/Checkers/Go❐ Manage an investment portfolio❐ Control a power station❐ Walk a robot through an obstacle course
![Page 4: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/4.jpg)
![Page 5: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/5.jpg)
5
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t = 0, 1, 2, K Agent observes state at step t : st ∈S produces action at step t : at ∈ A(st ) gets resulting reward : rt+1 ∈ℜ
and resulting next state : st+1
t. . . st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3 . . .t +3a
![Page 6: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/6.jpg)
Textbook
❐ Sutton & Barto, “An Introduction to Reinforcement Learning”
❐ http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html
6
Slides adapted from:http://webdocs.cs.ualberta.ca/~sutton/RLslides.tgz
![Page 7: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/7.jpg)
7
Policy at step t , πt : a mapping from states to action probabilities πt (s, a) = probability that at = a when st = s
The Agent Learns a Policy
❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience.
❐ Roughly, the agent’s goal is to get as much reward as it can over the long run.
![Page 8: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/8.jpg)
Rt = rt+1 + rt+2 + ...+ rT
Rt = rt+1 + γrt+2 + γ2rt+3 + ...
8
Returns
Episodic tasks: interaction breaks naturally into terminating episodes, e.g., games, trips through a maze.
The return Rt is the total reward from time t onwards
Continuing tasks: interaction does not have natural episodes.
shortsighted 0 ←γ → 1 farsighted
Where is the discount rateγ, 0 ≤ γ ≤ 1
![Page 9: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/9.jpg)
9
Example 1: Cart Pole
Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward = +1 for each step before failure⇒ return = number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:reward = −1 upon failure; 0 otherwise⇒ return = −γ k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
![Page 10: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/10.jpg)
10
Example 2: Computer Go
Play Go against a human opponent
As an episodic task where each game is an episode
Return is maximized by beating opponent.
The computer player is the agentThe human opponent is the environment
reward = +1 if computer player wins, 0 if human wins, 0 for every non-terminal move
=> return = win/lose game
![Page 11: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/11.jpg)
11
The Markov Property
❐ The state is the information available to the agent at step t❐ The state can be immediate observations, or can be built up over
time from sequences of observations❐ Ideally, a state should summarize the information contained in
all past observations ❐ This is known as the Markov Property:
Pr(st+1 = s�, rt+1 = r|st, at, rt, st−1, at−1..., r1, s0, a0)
= Pr(st+1 = s�, rt+1 = r|st)
![Page 12: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/12.jpg)
12
Markov Decision Processes
❐ If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).
❐ If state and action sets are finite, it is a finite MDP. ❐ To define a finite MDP, you need to give:
state and action sets one-step “dynamics” defined by transition probabilities:
reward probabilities:€
Ps ʹ′ s a = Pr st +1 = ʹ′ s st = s,at = a{ } for all s, ʹ′ s ∈ S, a∈ A(s).
€
Rs ʹ′ s a = E rt +1 st = s,at = a,st +1 = ʹ′ s { } for all s, ʹ′ s ∈ S, a∈ A(s).
![Page 13: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/13.jpg)
13
Value Functions
State - value function for policy π :
Vπ (s) = Eπ Rt st = s{ } = Eπ γ krt+k +1 st = sk =0
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
Action- value function for policy π :
Qπ (s, a) = Eπ Rt st = s, at = a{ } = Eπ γ krt+ k+1 st = s,at = ak= 0
∞
∑⎧ ⎨ ⎩
⎫ ⎬ ⎭
❐ The value of a state is the expected return starting from that state; depends on the agent’s policy:
❐ The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π :
![Page 14: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/14.jpg)
14
Gridworld
❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that
move agent out of special states A and B as shown.
State-value function for equiprobable random policy;γ = 0.9
![Page 15: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/15.jpg)
15
Optimal Value Functions❐ There are always one or more policies that are better than or
equal to all the others. These are the optimal policies. We denote them by π *.
❐ Optimal policies share the same optimal state-value function:
❐ Optimal policies also share the same optimal action-value function:
V∗ (s) = maxπVπ (s) for all s ∈S
Q∗(s, a) = maxπQπ (s, a) for all s ∈S and a ∈A(s)
This is the expected return for taking action a in state s and thereafter following an optimal policy.
![Page 16: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/16.jpg)
16
Why Optimal State-Value Functions are Useful
V∗
V∗
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld:
π*
![Page 17: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/17.jpg)
Summary
❐ Many important control problems can be formalised as reinforcement learning problems
❐ The agent’s goal is to maximise its return (long-term reward) given its experience
❐ The value function summarises the long-term consequences of following a policy
❐ There is a unique optimal value function for any optimal policy
❐ The policy evaluation problem is to find Vπ(s)❐ The control problem is to find V*(s)
17
![Page 18: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/18.jpg)
2. MONTE-CARLO METHODS
❐ Learning from sample episodes
![Page 19: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/19.jpg)
19
Short History of Monte-Carlo Methods
❐ Developed in Los Alamos during World War II By John von Neumann and Stanislaw Ulam Central to success of the Manhattan Project
❐ Solve massive problems with no analytical solution❐ Using random sampling instead of exhaustive computation❐ John von Neumann chose Monte-Carlo as codename
![Page 20: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/20.jpg)
20
Monte-Carlo Evaluation
❐ Estimates value of policy π, Vπ(s) ❐ Learns from complete episodes of experience❐ Model-free: don’t need to know the MDP dynamics❐ Sample-based: samples state transitions at random, rather
than exhaustively searching all possible state transitions
![Page 21: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/21.jpg)
21
Monte-Carlo Evaluation
❐ Monte-Carlo evaluation is very simple: Remember all returns from each state s V(s) is the average return starting from state s and
following π❐ V(s) converges to V π(s) if s is visited infinitely often
![Page 22: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/22.jpg)
V (st) ← V (st) + αt [Rt − V (st)]
αt =1
N(st)
Incremental Averaging
22
❐ Don’t have to remember all returns:
❐ If then this computes the average return from s
❐ If α is constant then this computes a recent average return❐ Useful for tracking non-stationary values
![Page 23: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/23.jpg)
23
Monte-Carlo in Blackjack
❐ Object: Have your card sum be greater than the dealers without exceeding 21.
❐ States (200 of them): player’s sum dealer’s showing card useable ace?
❐ Reward: +1 for winning, 0 for a draw, -1 for losing❐ Actions: stick (no more cards), twist (receive another card)❐ Policy: Stick if my sum is 20 or 21, else twist
![Page 24: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/24.jpg)
24
Blackjack value functions
![Page 25: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/25.jpg)
25
Monte Carlo Estimation of Action Values (Q)
❐ Q(s,a) is the average return starting from state s and action a following π
❐ Q(s,a) converges to Qπ(s,a) if every state-action pair is visited infinitely many times
❐ e.g. Use ε-greedy policy: Select greedy action with probability 1 - ε Select random action with probability ε
![Page 26: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/26.jpg)
26
Monte-Carlo Control
❐ Evaluation: evaluate current policy using MC simulation❐ Improvement: update policy to act greedily with respect to
action-value function❐ Converges on optimal policy, if ε is slowly reduced to 0
![Page 27: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/27.jpg)
27
Blackjack example continued
![Page 28: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/28.jpg)
Summary
❐ Monte-Carlo evaluation is the simplest policy evaluation method Model-free Sample-based Complete episodes (no bootstrapping)
❐ Monte-Carlo control is the simplest control method Monte-Carlo evaluation ε-greedy policy improvement
❐ Simple and effective solution to many problems❐ Good convergence properties
28
![Page 29: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/29.jpg)
3. TEMPORAL-DIFFERENCE LEARNING
❐ Learning a guess from a guess
![Page 30: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/30.jpg)
30
TD Prediction
Simple every - visit Monte Carlo method :
V(st )← V(st) +α Rt − V (st )[ ]
Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ
Recall:
The simplest TD method, TD(0) :
V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]
target: the actual return after time t
target: an estimate of the return
incremental
![Page 31: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/31.jpg)
31
Monte-Carlo Simulation
T T T TT
T T T T T
V(st )← V(st) +α Rt − V (st )[ ]where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
![Page 32: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/32.jpg)
32
Temporal-Difference Learning
T T T TT
T T T T T
st+1rt+1
st
V(st )← V(st) +α rt+1 + γ V (st+1 ) − V(st )[ ]
TTTTT
T T T T T
![Page 33: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/33.jpg)
33
cf. Full-width backups
V(st )← Eπ rt+1 +γ V(st ){ }
T
T T T
st
rt+1st+1
T
TT
T
TT
T
T
T
![Page 34: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/34.jpg)
34
Advantages of TD Learning
❐ TD methods do not require a model of the environment, only experience
❐ TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome You can learn without the final outcome
– From incomplete sequences❐ TD reduces variance but increases bias
MC depends on long sequences of random variables TD depends only on one step
❐ Both MC and TD converge on optimal value
![Page 35: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/35.jpg)
35
TD and MC on the Random Walk
![Page 36: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/36.jpg)
36
Learning An Action-Value Function
Estimate Qπ for the current behavior policy π.
After every transition from a nonterminal state st , do this :
Q st , at( )←Q st , at( ) + α rt+1 +γ Q st+1,at+1( ) −Q st ,at( )[ ]If st+1 is terminal, then Q(st+1, at+1 ) = 0.
![Page 37: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/37.jpg)
37
Sarsa: On-Policy TD Control
Policy evaluation: temporal-difference learningPolicy improvement: ε-greedy
![Page 38: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/38.jpg)
38
Windy Gridworld
undiscounted, episodic, reward = –1 until goal
![Page 39: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/39.jpg)
39
Results of Sarsa on the Windy Gridworld
![Page 40: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/40.jpg)
40
Q-Learning: Off-Policy TD Control
One - step Q - learning :
Q st , at( )←Q st , at( ) + α rt+1 +γ maxaQ st+1, a( ) −Q st , at( )[ ]
![Page 41: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/41.jpg)
41
Cliffwalking
ε−greedy, ε = 0.1
![Page 42: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/42.jpg)
Summary
❐ Temporal-difference learning is: Model-free Sample-based Bootstrapping (learns from incomplete episodes)
❐ Can use TD for control: On-policy (SARSA) Off-policy (Q-learning)
❐ TD is lower variance but higher bias than MC❐ Usually more efficient in practice
42
![Page 43: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/43.jpg)
4. TD(λ)
❐ Between MC and TD
![Page 44: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/44.jpg)
44
n-step TD Prediction
❐ Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)
![Page 45: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/45.jpg)
45
n-Step TD Learning on Random Walks
❐ Task: 19 state random walk
❐ Optimal n depends on task
![Page 46: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/46.jpg)
46
TD(λ)
❐ TD(λ) is a method for averaging all n-step backups weight by λn-1 (time since
visitation) λ-return:
❐ Backup using λ-return:
€
Rtλ = (1− λ) λn−1
n=1
∞
∑ Rt(n )
€
ΔVt (st ) =α Rtλ −Vt (st )[ ]
![Page 47: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/47.jpg)
47
λ-Return Weighting Function
€
Rtλ = (1− λ) λn−1
n=1
T− t−1
∑ Rt(n ) + λT− t−1Rt
Until termination After termination
![Page 48: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/48.jpg)
48
TD(λ) on the Random Walk
❐ Same 19 state random walk as before
![Page 49: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/49.jpg)
49
Relation to TD(0) and MC
❐ λ-return can be rewritten as:
❐ If λ = 1, you get MC:
❐ If λ = 0, you get TD(0)
€
Rtλ = (1− λ) λn−1
n=1
T− t−1
∑ Rt(n ) + λT− t−1Rt
€
Rtλ = (1−1) 1n−1
n=1
T− t−1
∑ Rt(n ) +1T− t−1Rt = Rt
€
Rtλ = (1− 0) 0n−1
n=1
T− t−1
∑ Rt(n ) + 0T− t−1Rt = Rt
(1)
Until termination After termination
![Page 50: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/50.jpg)
50
Unified View
![Page 51: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/51.jpg)
5. FUNCTION APPROXIMATION
❐ RL in large state spaces
![Page 52: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/52.jpg)
52
Function Approximation
❐ Can’t store the value of every state Too many states / continuous state space Limited memory Too slow to learn individual values / no generalisation
❐ Use function approximation to approximate value function Using small number of parameters Generalising between similar states
❐ Adjust parameters by gradient descent (or similar idea)
![Page 53: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/53.jpg)
Vt(s) = f(s, �θt)
53
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function Vπ
e.g.,
r θ t could be the vector of connection weights
of a neural network.
€
θ t = θt (1),θt (2),…,θt (n)( )T
transposeParameters:
Value function approximation:
![Page 54: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/54.jpg)
54
Monte-Carlo Training Examples
€
description of st , rt+1 + γV (st+1){ }
As a training example:
input target output
Monte-Carlo backup:V (st) ← V (st) + αt [Rt − V (st)]
Rt }
![Page 55: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/55.jpg)
55
TD Training Examples
€
e.g., the TD(0) backup : V (st )←V (st ) +α rt+1 + γV (st+1) −V (st )[ ]
€
description of st , rt+1 + γV (st+1){ }
As a training example:
input target output
![Page 56: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/56.jpg)
56
Gradient Descent
€
Let f be any function of the parameter space.
Its gradient at any point θ t in this space is :
∇ θ f ( θ t ) = ∂f (
θ t )
∂θ(1),∂f ( θ t )
∂θ(2),…,∂f (
θ t )
∂θ(n)
⎛
⎝ ⎜
⎞
⎠ ⎟
T
.
θ (1)
θ (2) r θ t = θ t(1),θ t(2)( )T
€
θ t+1 =
θ t −α∇
θ f ( θ t )
Iteratively move down the gradient:
![Page 57: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/57.jpg)
57
Linear Methods
€
Represent states as feature vectors :for each s∈ S : φ s = φs(1),φs(2),…,φs(n)( )T
€
Vt (s) = θ tT φ s = θt (i)φs(i)
i=1
n
∑
€
∇ θ Vt (s) =
φ s
![Page 58: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/58.jpg)
V (s) = φ(s)T �θ
δt = rt+1 + γV (st+1)− V (st)
∆�θ = αδφ(st)
58
Linear Temporal-Difference Learning
Update parameters in direction that reduces TD-error:
![Page 59: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/59.jpg)
59
Coarse Coding
![Page 60: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/60.jpg)
60
Tile Coding
❐ Binary feature for each tile❐ Number of features present at
any one time is constant❐ Binary features means weighted
sum easy to compute❐ Easy to compute indices of the
freatures present
![Page 61: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/61.jpg)
61
Mountain-Car Task
![Page 62: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/62.jpg)
62
Mountain Car with Radial Basis Functions
![Page 63: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/63.jpg)
Convergence
❐ Monte-Carlo learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators
❐ Temporal-difference learning converges with: On-policy learning (e.g. not Q-learning) Linear function approximation
❐ Gradient temporal-difference learning converges with: On-policy or off-policy learning Arbitrary smooth function approximators Sutton et al. (ICML 2009)
63
![Page 64: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/64.jpg)
SUMMARY
❐ Reinforcement learning: Learn from experience to achieve more reward
❐ Problem can be reduced to: Policy evaluation (prediction) Control
❐ Model-free solution methods: Monte-Carlo simulation and Monte-Carlo control Temporal-difference learning, SARSA, Q-learning TD(λ) provides spectrum between MC and TD
❐ Function approximation: Scale up MC and TD to large state spaces
64
![Page 65: PowerPoint Presentation - Chapter 1 Introduction9 Example 1: Cart Pole Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. reward = +1 for each](https://reader030.vdocuments.net/reader030/viewer/2022040204/5eb57e08f2791a36ec54b91d/html5/thumbnails/65.jpg)
QUESTIONS
❐ The only stupid question is the one you never asked– Richard Sutton
65