lirong xia reinforcement learning (2) tue, march 21, 2014
TRANSCRIPT
![Page 1: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/1.jpg)
Lirong Xia
Reinforcement Learning (2)
Tue, March 21, 2014
![Page 2: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/2.jpg)
• Project 2 due tonight
• Project 3 is online (more later)
– due in two weeks
2
Reminder
![Page 3: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/3.jpg)
Recap: MDPs
3
• Markov decision processes:– States S– Start state s0
– Actions A– Transition p(s’|s,a) (or T(s,a,s’))– Reward R(s,a,s’) (and discount ϒ)
• MDP quantities:– Policy = Choice of action for each (MAX) state– Utility (or return) = sum of discounted rewards
![Page 4: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/4.jpg)
Optimal Utilities
4
• The value of a state s:– V*(s) = expected utility starting in s and
acting optimally• The value of a Q-state (s,a):
– Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally
• The optimal policy:– π*(s) = optimal action from state s
![Page 5: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/5.jpg)
Solving MDPs
5
• Value iteration– Start with V1(s) = 0– Given Vi, calculate the values for all states for depth i+1:
– Repeat until converge– Use Vi as evaluation function when computing Vi+1
• Policy iteration– Step 1: policy evaluation: calculate utilities for some fixed
policy– Step 2: policy improvement: update policy using one-step
look-ahead with resulting utilities as future values– Repeat until policy converges
1'
max , , ' , , ' 'i ia
s
V s T s a s R s a s V s
![Page 6: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/6.jpg)
• Don’t know T and/or R, but can observe R
– Learn by doing
– can have multiple episodes (trials)
6
Reinforcement learning
![Page 7: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/7.jpg)
The Story So Far: MDPs and RL
7
Things we know how to do:• If we know the MDP
– Compute V*, Q*, π* exactly– Evaluate a fixed policy π
• If we don’t know the MDP– If we can estimate the MDP then
solve– We can estimate V for a fixed
policy π– We can estimate Q*(s,a) for the
optimal policy while executing an exploration policy
Techniques:• Computation
• Value and policy iteration
• Policy evaluation
• Model-based RL• sampling
• Model-free RL:• Q-learning
![Page 8: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/8.jpg)
Model-Free Learning
8
• Model-free (temporal difference) learning– Experience world through episodes
(s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…)– Update estimates each transition (s,a,r,s’)– Over time, updates will mimic Bellman updates
1'
'
Q-Value Iteration (model-based, requires known MDP)
, , , ' , , ' max ', 'i ia
s
Q s a T s a s R s a s Q s a
'
Q-Learning (model-free, requires only experienced transitions)
( , ) (1 ) ( , ) max ', 'a
Q s a Q s a r Q s a
![Page 9: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/9.jpg)
Detour: Q-Value Iteration
9
• We’d like to do Q-value updates to each Q-state:
– But can’t compute this update without knowing T,R• Instead, compute average as we go
– Receive a sample transition (s,a,r,s’)– This sample suggests
– But we want to merge the new observation to the old ones
– So keep a running average
1'
'
, , , ' , , ' max ', 'i ia
s
Q s a T s a s R s a s Q s a
'
( , ) (1 ) ( , ) max ', 'a
Q s a Q s a r Q s a
'
( , ) max ', 'a
Q s a r Q s a
![Page 10: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/10.jpg)
Q-Learning Properties
10
• Will converge to optimal policy– If you explore enough (i.e. visit each q-state many times)– If you make the learning rate small enough– Basically doesn’t matter how you select actions (!)
• Off-policy learning: learns optimal q-values, not the values of the policy you are following
![Page 11: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/11.jpg)
Q-Learning
11
• Q-learning produces tables of q-values:
![Page 12: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/12.jpg)
Exploration / Exploitation
12
• Random actions (ε greedy)–Every time step, flip a coin–With probability ε, act randomly–With probability 1-ε, act according to current
policy
![Page 13: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/13.jpg)
Today: Q-Learning with state abstraction
13
• In realistic situations, we cannot possibly learn about every single state!– Too many states to visit them all in training– Too many states to hold the Q-tables in memory
• Instead, we want to generalize:– Learn about some small number of training states from
experience– Generalize that experience to new, similar states– This is a fundamental idea in machine learning, and we’ll
see it over and over again
![Page 14: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/14.jpg)
Example: Pacman
14
• Let’s say we discover through experience that this state is bad:
• In naive Q-learning, we know nothing about this state or its Q-states:
• Or even this one!
![Page 15: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/15.jpg)
Feature-Based Representations
15
• Solution: describe a state using a vector of features (properties)– Features are functions from states
to real numbers (often 0/1) that capture important properties of the state
– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1/ (dist to dot)2
• Is Pacman in a tunnel? (0/1)• …etc• Is it the exact state on this slide?
– Can also describe a Q-state (s,a) with features (e.g. action moves closer to food)
Similar to a evaluation function
![Page 16: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/16.jpg)
Linear Feature Functions
16
• Using a feature representation, we can write a Q function (or value function) for any state using a few weights:
• Advantage: our experience is summed up in a few powerful numbers
• Disadvantage: states may share features but actually be very different in value!
![Page 17: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/17.jpg)
Function Approximation
17
• Q-learning with linear Q-functions:transition = (s,a,r,s’)
• Intuitive interpretation:– Adjust weights of active features– E.g. if something unexpectedly bad happens, disprefer all
states with that state’s features• Formal justification: online least squares
'
difference max ', ' ( , )a
r Q s a Q s a Exact Q’s
Approximate Q’s
![Page 18: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/18.jpg)
Example: Q-Pacman
18
Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a)
fDOT(s,NORTH)=0.5
fGST(s,NORTH)=1.0
Q(s,a)=+1
R(s,a,s’)=-500
difference=-501
wDOT←4.0+α[-501]0.5
wGST ←-1.0+α[-501]1.0
Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a)
α= Northr = -500
s
s’
![Page 19: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/19.jpg)
Linear Regression
19
![Page 20: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/20.jpg)
Ordinary Least Squares (OLS)
20
![Page 21: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/21.jpg)
Minimizing Error
21
21
error2
error
k kk
k k mkm
m m k k mk
w y w f x
wy w f x f x
w
w w y w f x f x
Imagine we had only one point x with features f(x):
Approximate q update explained:
max ( ', ') ( , )m ma
mw r Q s Q f xaaw s
“target” “prediction”
![Page 22: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/22.jpg)
• As many as possible?
– computational burden
– overfitting
• Feature selection is important
– requires domain expertise
22
How many features should we use?
![Page 23: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/23.jpg)
Overfitting
23
![Page 24: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/24.jpg)
The elements of statistical learning. Fig 2.11. 24
![Page 25: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014](https://reader038.vdocuments.net/reader038/viewer/2022102800/56649c735503460f94925cf7/html5/thumbnails/25.jpg)
• MDPs
– Q1: value iteration
– Q2: find parameters that lead to certain optimal policy
– Q3: similar to Q2
• Q-learning
– Q4: implement the Q-learning algorithm
– Q5: implement ε greedy action selection
– Q6: try the algorithm
• Approximate Q-learning and state abstraction
– Q7: Pacman
– Q8 (bonus): design and implement a state-abstraction Q-learning
algorithm
• Hints
– make your implementation general
– try to use class methods as much as possible
25
Overview of Project 3