lirong xia reinforcement learning (2) tue, march 21, 2014

Lirong Xia

Reinforcement Learning (2)

Tue, March 21, 2014

• Project 2 due tonight

• Project 3 is online (more later)

– due in two weeks

2

Reminder

Recap: MDPs

3

• Markov decision processes:– States S– Start state s0

– Actions A– Transition p(s’|s,a) (or T(s,a,s’))– Reward R(s,a,s’) (and discount ϒ)

• MDP quantities:– Policy = Choice of action for each (MAX) state– Utility (or return) = sum of discounted rewards

Optimal Utilities

4

• The value of a state s:– V*(s) = expected utility starting in s and

acting optimally• The value of a Q-state (s,a):

– Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

• The optimal policy:– π*(s) = optimal action from state s

Solving MDPs

5

• Value iteration– Start with V1(s) = 0– Given Vi, calculate the values for all states for depth i+1:

– Repeat until converge– Use Vi as evaluation function when computing Vi+1

• Policy iteration– Step 1: policy evaluation: calculate utilities for some fixed

policy– Step 2: policy improvement: update policy using one-step

look-ahead with resulting utilities as future values– Repeat until policy converges

1'

max , , ' , , ' 'i ia

s

V s T s a s R s a s V s

• Don’t know T and/or R, but can observe R

– Learn by doing

– can have multiple episodes (trials)

6

Reinforcement learning

The Story So Far: MDPs and RL

7

Things we know how to do:• If we know the MDP

– Compute V*, Q*, π* exactly– Evaluate a fixed policy π

• If we don’t know the MDP– If we can estimate the MDP then

solve– We can estimate V for a fixed

policy π– We can estimate Q*(s,a) for the

optimal policy while executing an exploration policy

Techniques:• Computation

• Value and policy iteration

• Policy evaluation

• Model-based RL• sampling

• Model-free RL:• Q-learning

Model-Free Learning

8

• Model-free (temporal difference) learning– Experience world through episodes

(s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…)– Update estimates each transition (s,a,r,s’)– Over time, updates will mimic Bellman updates

1'

'

Q-Value Iteration (model-based, requires known MDP)

, , , ' , , ' max ', 'i ia

s

Q s a T s a s R s a s Q s a

'

Q-Learning (model-free, requires only experienced transitions)

( , ) (1 ) ( , ) max ', 'a

Q s a Q s a r Q s a

Detour: Q-Value Iteration

9

• We’d like to do Q-value updates to each Q-state:

– But can’t compute this update without knowing T,R• Instead, compute average as we go

– Receive a sample transition (s,a,r,s’)– This sample suggests

– But we want to merge the new observation to the old ones

– So keep a running average

1'

'

, , , ' , , ' max ', 'i ia

s

Q s a T s a s R s a s Q s a

'

( , ) (1 ) ( , ) max ', 'a

Q s a Q s a r Q s a

'

( , ) max ', 'a

Q s a r Q s a

Q-Learning Properties

10

• Will converge to optimal policy– If you explore enough (i.e. visit each q-state many times)– If you make the learning rate small enough– Basically doesn’t matter how you select actions (!)

• Off-policy learning: learns optimal q-values, not the values of the policy you are following

Q-Learning

11

• Q-learning produces tables of q-values:

Exploration / Exploitation

12

• Random actions (ε greedy)–Every time step, flip a coin–With probability ε, act randomly–With probability 1-ε, act according to current

policy

Today: Q-Learning with state abstraction

13

• In realistic situations, we cannot possibly learn about every single state!– Too many states to visit them all in training– Too many states to hold the Q-tables in memory

• Instead, we want to generalize:– Learn about some small number of training states from

experience– Generalize that experience to new, similar states– This is a fundamental idea in machine learning, and we’ll

see it over and over again

Example: Pacman

14

• Let’s say we discover through experience that this state is bad:

• In naive Q-learning, we know nothing about this state or its Q-states:

• Or even this one!

Feature-Based Representations

15

• Solution: describe a state using a vector of features (properties)– Features are functions from states

to real numbers (often 0/1) that capture important properties of the state

– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1/ (dist to dot)2

• Is Pacman in a tunnel? (0/1)• …etc• Is it the exact state on this slide?

– Can also describe a Q-state (s,a) with features (e.g. action moves closer to food)

Similar to a evaluation function

Linear Feature Functions

16

• Using a feature representation, we can write a Q function (or value function) for any state using a few weights:

• Advantage: our experience is summed up in a few powerful numbers

• Disadvantage: states may share features but actually be very different in value!

Function Approximation

17

• Q-learning with linear Q-functions:transition = (s,a,r,s’)

• Intuitive interpretation:– Adjust weights of active features– E.g. if something unexpectedly bad happens, disprefer all

states with that state’s features• Formal justification: online least squares

'

difference max ', ' ( , )a

r Q s a Q s a Exact Q’s

Approximate Q’s

Example: Q-Pacman

18

Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a)

fDOT(s,NORTH)=0.5

fGST(s,NORTH)=1.0

Q(s,a)=+1

R(s,a,s’)=-500

difference=-501

wDOT←4.0+α[-501]0.5

wGST ←-1.0+α[-501]1.0

Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a)

α= Northr = -500

s

s’

Linear Regression

19

Ordinary Least Squares (OLS)

20

Minimizing Error

21

21

error2

error

k kk

k k mkm

m m k k mk

w y w f x

wy w f x f x

w

w w y w f x f x

Imagine we had only one point x with features f(x):

Approximate q update explained:

max ( ', ') ( , )m ma

mw r Q s Q f xaaw s

“target” “prediction”

• As many as possible?

– computational burden

– overfitting

• Feature selection is important

– requires domain expertise

22

How many features should we use?

Overfitting

23

The elements of statistical learning. Fig 2.11. 24

• MDPs

– Q1: value iteration

– Q2: find parameters that lead to certain optimal policy

– Q3: similar to Q2

• Q-learning

– Q4: implement the Q-learning algorithm

– Q5: implement ε greedy action selection

– Q6: try the algorithm

• Approximate Q-learning and state abstraction

– Q7: Pacman

– Q8 (bonus): design and implement a state-abstraction Q-learning

algorithm

• Hints

– make your implementation general

– try to use class methods as much as possible

25

Overview of Project 3

lirong xia reinforcement learning (2) tue, march 21, 2014

Documents