reinforcement learning michael roberts with material from: reinforcement learning: an introduction...
TRANSCRIPT
Reinforcement LearningMichael Roberts
With Material From: Reinforcement Learning: An Introduction
Sutton & Barto (1998)
What is RL?
• Trial & error learning– without model– with model
• Structure
s1 s2
s3
s4
r1
r2
r3
RL vs. Supervised Learning
• Evaluative vs. Instructional feedback
• Role of exploration
• On-line performance
K-armed Bandit Problem
Agent
Actions
Average Rewards
10
-5
100
0
0, 0, 5, 10, 35
5, 10, -15, -15, -10
K-armed Bandit Cont.
• Greedy exploration• ε-greedy • Softmax
Average Reward:
Incremental formula:
where: α = 1 / (k+1)
Probability of choosing action a:
More General Problems
• More than one state• Delayed rewards
• Markov Decision Process (MDP)– Set of states – Set of actions– Reward function– State transition function
• Table or Function Approximation
Example: Recycling Robot
Recycling Robot: Transition Graph
Dynamic Programming
Backup Diagram
.25.25.25
.5.5.3.7.6.4
Rewards 10 5 200 200 -10 1000
Dynamic Programming:Optimal Policy
Backup for Optimal Policy
Performance Metrics
• Eventual convergence to optimality
• Speed of convergence to optimality
• Regret
(Kaelbling, L., Littman, M., & Moore, A. 1996)
Gridworld Example
Initialize V arbitrarily, e.g. , for all
Repeat
For each
until (a small positive number)
Output a deterministic policy, such that:
Temporal Difference Learning
• RL without a model• Issue of: temporal credit assignment• Bootstraps like DP
• TD(0):
TD Learning
• Again, TD(0) =
TD(λ) =
where e is called an eligibility trace
Backup Diagram for TD(λ)
TD-Gammon (Tesauro)
Additional Work
• POMDP’s
• Macros
• Multi-agent rl
• Multiple reward structures