reinforcement learning - igi · häusler mlb ku 708.062 ws 2010/11 2 second homework set theory and...
TRANSCRIPT
![Page 1: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/1.jpg)
Reinforcement Learning
Institute for Theoretical Computer Science http://www.igi.tugraz.at/haeusler
Machine Learning B708.062 08W 1sst KU
Dr. Stefan Häusler
Institut für Grundlagen der Informationsverarbeitung
Technische Universität Graz
Course WS 2010/11
![Page 2: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/2.jpg)
Häusler MLB KU 708.062 WS 2010/11 2
Second homework set
Theory and practice of reinforcement learning (RL):
● Two theoretical problems
● On- vs. off-policy learning
● Function approximation
● RL with self-play
![Page 3: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/3.jpg)
Häusler MLB KU 708.062 WS 2010/11 3
Assignment 4
![Page 4: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/4.jpg)
Häusler MLB KU 708.062 WS 2010/11 4
Assignment 4
Corollary to
– Comparing policies through value functions
– Shows that we can concentrate on deterministic policies
![Page 5: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/5.jpg)
Häusler MLB KU 708.062 WS 2010/11 5
Assignment 5
![Page 6: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/6.jpg)
Häusler MLB KU 708.062 WS 2010/11 6
Assignment 5
Optimal policy:
– Highest possible value / return at every state
– Return:
– Value function:
Modifications:
– Linear scaling of all rewards
– Change only maximum / minimum reward
Hint: Carefully read preconditions (γ<1, continuing,…)
![Page 7: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/7.jpg)
Häusler MLB KU 708.062 WS 2010/11 7
Example
γ= 1 Optimal policy
![Page 8: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/8.jpg)
Häusler MLB KU 708.062 WS 2010/11 8
Example
New optimal policy!
![Page 9: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/9.jpg)
Häusler MLB KU 708.062 WS 2010/11 9
Example
New optimal policy!No change if we simultaneously change both goal rewards
![Page 10: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/10.jpg)
Häusler MLB KU 708.062 WS 2010/11 10
Example
Modification of negative rewarddoes not change anything here
New optimal policy!
![Page 11: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/11.jpg)
Häusler MLB KU 708.062 WS 2010/11 11
Programming tasks
MATLAB Reinforcement Learning Toolbox by Jose Antonio Martin
(http://www.dacya.ucm.es/jam/download.htm)
You need to modify the MATLAB code
– set up state and action space
– define reward function
– select learning algorithm + parameters
Not using the toolbox is OK - but more work for you!
![Page 12: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/12.jpg)
Häusler MLB KU 708.062 WS 2010/11 12
Assignment 6
![Page 13: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/13.jpg)
Häusler MLB KU 708.062 WS 2010/11 13
RL algorithms
Policy Evaluation (the prediction problem): For a given policy π, compute the state-value function Vπ.
V s = E [Rt∣st=s ]
= E [∑k=0
∞
k r tk1∣st=s]
= E [r t1∑k=0
∞
k r tk2∣st=s]
= E [r t1V st1∣st=s]
Rt … actual return following time t
γ … discount factor
Eπ[.] … expectation value w.r.p to policy π
Monte Carlo methods
Dynamic ProgrammingTemporal difference learning
![Page 14: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/14.jpg)
Häusler MLB KU 708.062 WS 2010/11 14
Temporal difference learning
Recall:
Simple every-visit Monte Carlo method
The simplest TD method,TD(0):
![Page 15: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/15.jpg)
Häusler MLB KU 708.062 WS 2010/11 15
Learning a Q-function
Estimate for the current policy
After each transition from a non-terminal state do
If is terminal, then
Compare:
![Page 16: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/16.jpg)
Häusler MLB KU 708.062 WS 2010/11 16
SARSA: On-policy TD learning
One-step SARSA-learning:
![Page 17: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/17.jpg)
Häusler MLB KU 708.062 WS 2010/11 17
Example: Windy grid-world
Undiscounted, episodic, reward = –1 until the goal is reached.
![Page 18: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/18.jpg)
Häusler MLB KU 708.062 WS 2010/11 18
Q-learning: Off-policy TD learning
One-step Q-learning:
![Page 19: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/19.jpg)
Häusler MLB KU 708.062 WS 2010/11 19
Example: Cliff-walking
![Page 20: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/20.jpg)
Häusler MLB KU 708.062 WS 2010/11 20
Eligibility traces
Within one trial more information about how to get to the goal.
Can considerably accelerate learning
![Page 21: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/21.jpg)
Häusler MLB KU 708.062 WS 2010/11 21
Sarsa(λ): On-policy TD learning with eligibility traces
![Page 22: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/22.jpg)
Häusler MLB KU 708.062 WS 2010/11 22
Assignment 7
![Page 23: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/23.jpg)
Häusler MLB KU 708.062 WS 2010/11 23
Gradient descent methods
Define
...
![Page 24: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/24.jpg)
Häusler MLB KU 708.062 WS 2010/11 24
Linear value function approximation (I)
Gradient descent on the value function
t1=t−/2∇t[V st −V t st ]
2
=t[V st −V t st ]∇tV st
![Page 25: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/25.jpg)
Häusler MLB KU 708.062 WS 2010/11 25
Linear value function approximation (II)
![Page 26: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/26.jpg)
Häusler MLB KU 708.062 WS 2010/11 26
Linear value function approximation (III)
t1=t[V st−V t s t]∇t
V st i.e.
![Page 27: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/27.jpg)
Häusler MLB KU 708.062 WS 2010/11 27
What feature vectors to chose? (I)
Tile coding:
![Page 28: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/28.jpg)
Häusler MLB KU 708.062 WS 2010/11 28
What feature vectors to chose? (II)
RBF coding:
![Page 29: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/29.jpg)
Häusler MLB KU 708.062 WS 2010/11 29
How to turn this into a control problem
How to estimate the Q function?
Convergence is not guaranteed anymore.
Use a different parameter vector for each action. Update the one for at.
vt
![Page 30: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/30.jpg)
Häusler MLB KU 708.062 WS 2010/11 30
Assignment 8
![Page 31: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/31.jpg)
Häusler MLB KU 708.062 WS 2010/11 31
Learning with self-play
RL is becoming increasingly used in Computer Games, e.g. Black and White
Biggest Success Story of RL: World-class Backgammon program [Tesauro,1995]
• TD-Gammon is learning autonomously
• Little prior knowledge
• Most computer games rely heavily on knowledge of AI designer
• Training by playing against itself, i.e. self-play
![Page 32: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/32.jpg)
Häusler MLB KU 708.062 WS 2010/11 32
TD Gammon: How it works (I)
Evaluation Function:
– Represented by Neural Network
– NN yields probability of winning for every position
Input representation:
– binary representation of current board position
– Later: complex Backgammon-features
![Page 33: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/33.jpg)
Häusler MLB KU 708.062 WS 2010/11 33
TD Gammon: How it works (II)
Training
– Plays training matches against itself
– Rewards: +1 for victory, –1 for loss, otherwise 0
– undiscounted: γ= 1
– TD(λ) learning improves estimation for non-terminal positions
– Neural Network is trained (via Backprop) for new estimated winning probabilities
Initially random weights
Backpropagation of TD-Error
Performance
– Increases with number of training matches
– Up to 1.5 Mio. training matches required for learning
![Page 34: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/34.jpg)
Häusler MLB KU 708.062 WS 2010/11 34
TD Gammon: Results
Original version (no prior knowledge)
– plays comparable to other programs
Improvement: Feature Definition, hidden Neurons
– Use prior backgammon knowledge to define sensible features (not raw board position)
– Among top 3 players in the world (human and computer)
Signs of Creativity
– TD-Gammon played some opening moves differently than human grandmasters
– Statistical evaluation: TD-Gammon moves are better
– Today humans play the same moves!
![Page 35: Reinforcement Learning - IGI · Häusler MLB KU 708.062 WS 2010/11 2 Second homework set Theory and practice of reinforcement learning (RL): Two theoretical problems On- vs. off-policy](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e168b1247a5905dce2f9e79/html5/thumbnails/35.jpg)
Häusler MLB KU 708.062 WS 2010/11 35
Task for assignment 8
Find a 2-player game with a small state set, e.g. Tic-Tac-Toe, Blackjack, Nim
– suitable for tabular learning (no approximation)
Let your program learn with self-play.
Evaluate playing strength.