lirong xia reinforcement learning (1) tue, march 18, 2014

31
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Upload: walker-bown

Post on 31-Mar-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Lirong Xia

Reinforcement Learning (1)

Tue, March 18, 2014

Page 2: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Midterm

– mean 82/99

– Problem 5 (b) 2 has 8 points now

– your grade on LMS

• Project 2 due/Project 3 out on this Friday

2

Reminder

Page 3: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Markov decision processes

• Computing the optimal policy

– value iteration

– policy iteration

3

Last time

Page 4: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Grid World

4

• The agent lives in a grid• Walls block the agent’s path• The agent’s actions do not

always go as planned:– 80% of the time, the action North takes

the agent North (if there is no wall there)– 10% of the time, North takes the agent

West; 10% East– If there is a wall in the direction the

agent would have taken, the agent stays for this turn

• Small “living” reward each step• Big rewards come at the end• Goal: maximize sum of reward

Page 5: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Markov Decision Processes

5

• An MDP is defined by:– A set of states s∈S– A set of actions a∈A– A transition function T(s, a, s’)

• Prob that a from s leads to s’• i.e., p(s’|s, a)• sometimes called the model

– A reward function R(s, a, s’)• Sometimes just R(s) or R(s’)

– A start state (or distribution)– Maybe a terminal state

• MDPs are a family of nondeterministic search problems– Reinforcement learning (next class): MDPs

where we don’t know the transition or reward functions

Page 6: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Recap: Defining MDPs

6

• Markov decision processes:– States S– Start state s0

– Actions A– Transition p(s’|s,a) (or T(s,a,s’))– Reward R(s,a,s’) (and discount ϒ)

• MDP quantities so far:– Policy = Choice of action for each (MAX) state– Utility (or return) = sum of discounted rewards

Page 7: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Optimal Utilities

7

• Fundamental operation: compute the values (optimal expectimax utilities) of states s

• Define the value of a state s:– V*(s) = expected utility starting in s and

acting optimally• Define the value of a Q-state (s,a):

– Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

• Define the optimal policy:– π*(s) = optimal action from state s

Page 8: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

The Bellman Equations

8

• One-step lookahead relationship amongst optimal utility values:

* *max ,a

V s Q s a

* *

'

, , , ' , , ' 's

Q s a T s a s R s a s V s * *

'

max , , ' , , ' 'a

s

V s T s a s R s a s V s

Page 9: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Solving MDPs

9

• We want to find the optimal policy

• Proposal 1: modified expectimax search, starting from each state s:

* *arg max ,a

s Q s a

* *

'

, , , ' , , ' 's

Q s a T s a s R s a s V s * *max ,

aV s Q s a

*

Page 10: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Value Iteration

10

• Idea:– Start with V1(s) = 0– Given Vi, calculate the values for all states for depth i+1:

– Repeat until converge– Use Vi as evaluation function when computing Vi+1

1'

max , , ' , , ' 'i ia

s

V s T s a s R s a s V s

Page 11: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Example: Value Iteration

11

• Information propagates outward from terminal states and eventually all states have correct value estimates

Page 12: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Policy Iteration

12

• Alternative approach:– Step 1: policy evaluation: calculate utilities for some fixed

policy (not optimal utilities!)– Step 2: policy improvement: update policy using one-step

look-ahead with resulting converged (but not optimal!) utilities as future values

– Repeat steps until policy converges

Page 13: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Still have an MDP:– A set of states – A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)

• Still looking for an optimal policy π*(s)

• New twist: don’t know T and/or R, but can observe R

– Learn by doing

– can have multiple episodes (trials)

13

Today: Reinforcement learning

Page 14: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Studied experimentally for more than 60

years in psychology

– Rewards: food, pain, hunger, drugs, etc.

• Example: foraging

– Bees learn near-optimal foraging plan in field of

artificial flowers with controlled nectar supplies

14

Example: animal learning

Page 15: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Stanford autonomous helicopter

http://heli.stanford.edu/

15

What can you do with RL

Page 16: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Model-based learning

– learn the model of MDP (transition probability

and reward)

– compute the optimal policy as if the learned

model is correct

• Model-free learning

– learn the optimal policy without explicitly

learning the transition probability

– Q-learning: learn the Q-state Q(s,a) directly

16

Reinforcement learning methods

Page 17: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Model-Based Learning

17

• Idea:– Learn the model empirically in episodes– Solve for values as if the learned model were correct

• Simple empirical model learning– Count outcomes for each s,a– Normalize to give estimate of T(s,a,s’)– Discover R(s,a,s’) when we experience (s,a,s’)

• Solving the MDP with the learned model– Iterative policy evaluation, for example

1'

, , ' , , ' 'i is

V s T s s s R s s s V s

Page 18: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Example: Model-Based Learning

18

• Episodes:(1,1) up-1 (1,1) up-1

(1,2) up-1 (1,2) up-1 (1,2) up-1 (1,3) right-1 (1,3) right-1 (2,3) right-1 (2,3) right-1 (3,3) right-1 (3,3) right-1 (3,2) up-1 (3,2) up-1 (4,2) exit-100 (3,3) right-1 (done) (4,3) exit +100 (done)

γ= 1

T(<3,3>, right, <4,3>) = 1/3

T(<2,3>, right, <3,3>) =2/2

Page 19: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Model-based vs. Model-Free

19

• Want to compute an expectation weighted by probability p(x) (e.g. expected utility):

• Model-based: estimate p(x) from samples, compute expectation

• Model-free: estimate expectation directly from samples

• Why does this work? Because samples appear with the right frequencies!

x

E f x p x f x

Page 20: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

• Flip a biased coin

– if head, you get $10

– if tail, you get $1

– you don’t know the probability of head/tail

– What is your expected gain?

• 8 episodes

– h, t, t, t, h, t, t, h

• Model-based: p(h)=3/8, so E(gain) = 10*3/8+1*5/8=35/8

• Model-free: (10+1+1+1+10+1+1+10)/8=35/8

20

Example

Page 21: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Sample-Based Policy Evaluation?

21

• Approximate the expectation with samples (drawn from an unknown T!)

1'

, , ' , , ' 'i is

V s T s s s R s s s V s

1

1i i

i

V s samplek

Almost! But we cannot

rewind time to get samples from s in the same episode

Page 22: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Temporal-Difference Learning

22

• Big idea: learn from every experience!– Update V(s) each time we experience (s,a,s’,R)– Likely s’ will contribute updates more often

• Temporal difference learning– Policy still fixed

, , ' '

Sample of V(s):Update to V( 1

s):

Same update:

isample R s s s V sV s V s sampleV s V s sample V s

Page 23: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Exponential Moving Average

23

• Exponential moving average– Makes recent samples more important

– Forgets about the past (distant past values were wrong anyway)

– Easy to compute from the running average

• Decreasing learning rate can give converging averages

11n n nx x x

Page 24: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Problems with TD Value Learning

24

• TD value learning is a model-free way to do policy evaluation

• However, if we want to turn values into a (new) policy, we’re sunk:

• Idea: learn Q-value directly• Makes action selection model-free too!

*arg max ,a

s Q s a

* *

'

, , , ' , , ' 's

Q s a T s a s R s a s V s

Page 25: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Active Learning

25

• Full reinforcement learning– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– You can choose any actions you like– Goal: learn the optimal policy

• In this case:– Learner makes choices!– Fundamental tradeoff: exploration vs. exploitation

• exploration: try new actions• exploitation: focus on optimal actions based on the current

estimation– This is NOT offline planning! You actually take

actions in the world and see what happens…

Page 26: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Detour: Q-Value Iteration

26

• Value iteration: find successive approx optimal values– Start with V0*(s)=0– Given Vi*, calculate the values for all states for depth i+1:

• But Q-values are more useful– Start with Q0*(s,a)=0– Given Qi*, calculate the Q-values for all Q-states for depth

i+1:

1'

max , , ' , , ' 'i ia

s

V s T s a s R s a s V s

1'

'

, , , ' , , ' max ', 'i ia

s

Q s a T s a s R s a s Q s a

Page 27: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Q-Learning

27

• Q-Learning: sample-based Q-value iteration• Learn Q*(s,a) values

– Receive a sample (s,a,s’,R)– Consider your old estimate: Q(s,a)– Consider your new sample estimate:

– Incorporate the new estimate into a running average

''

'

* , , , ' , , ' max * ', '

, , ' max * ', 'a

s

a

Q s a T s a s R s a s Q s a

sample R s a s Q s a

Page 28: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Q-Learning Properties

28

• Amazing result: Q-learning converges to optimal policy– If you explore enough– If you make the learning rate small enough– …but not decrease it too quickly!– Basically doesn’t matter how you select actions (!)

• Neat property: off-policy learning– Learn optimal policy without following it (some caveats)

Page 29: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Q-Learning

29

• Q-learning produces tables of Q-values:

Page 30: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Exploration / Exploitation

30

• Several schemes for forcing exploration– Simplest: random actions (ε greedy)

• Every time step, flip a coin• With probability ε, act randomly• With probability 1-ε, act according to current policy

– Problems with random actions?• You do explore the space, but keep thrashing around once

learning is done• One solution: lower ε over time• Another solution: exploration functions

Page 31: Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014

Exploration Functions

31

• When to explore– Random actions: explore a fixed amount– Better idea: explore areas whose badness is not (yet)

established

• Exploration function– Takes a value estimate and a count, and returns an

optimistic utility, e.g. (exact form not important)

, /f u n u k n

sample =