cs 182/cogsci110/ling109 spring 2008 reinforcement learning: algorithms 4/1/2008 srini narayanan –...

CS 182/CogSci110/Ling109Spring 2008

Reinforcement Learning: Algorithms

4/1/2008

Srini Narayanan – ICSI and UC Berkeley

Lecture Outline

Introduction Basic Concepts

Expectation, Utility, MEU Neural correlates of reward based learning Utility theory from economics

Preferences, Utilities. Reinforcement Learning: AI approach

The problem Computing total expected value with discounting Q-values, Bellman’s equation TD-Learning

Reinforcement Learning

Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward

function Must learn to act so as to maximize expected

utility Change the rewards, change the behavior

DEMO

Elements of RL

Transition model, how action influences states Reward R, immediate value of state-action transition Policy , maps states to actions

Agent

Environment

State Reward Action

Policy

sss 221100 r a2

r a1

r a0 :::

Markov Decision Processes Markov decision processes (MDPs)

A set of states s S A model T(s,a,s’) = P(s’ | s,a)

Probability that action a in state s leads to s’

A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one)

A start state (or distribution) Maybe a terminal state

MDPs are the simplest case of reinforcement learning In general reinforcement learning, we

don’t know the model or the reward function

Elements of RL

r(state, action)immediate reward values

100

0

0

100

G

0

0

0

0

0

0

0

0

0

Reward Sequences In order to formalize optimality of a policy, need to understand

utilities of reward sequences Typically consider stationary preferences: If I prefer one state

sequence starting today, I would prefer the same starting tomorrow.

Theorem: only two ways to define stationary utilities Additive utility:

Discounted utility:

Elements of RL

Value function: maps states to state values

Discount factor [0, 1) (here 0.9)

V*(state) valuesr(state, action)immediate reward values

100

0

0

100

G

0

0

0

0

0

0

0

0

0 G

90 100 0

81 90 100

2 11π trγtγrtrsV ...

G 90 100 0

81 90 100

G 90 100 0

81 90 100

RL task (restated)

Execute actions in environment,

observe results.

Learn action policy : state action that

maximizes expected discounted reward

E [r(t) + r(t + 1) + 2r(t + 2) + …]

from any starting state in S

Hyperbolic discounting

Ainslee 1992

Short term rewards are different from long term rewardsUsed in many animal discounting modelsHas been used to explain

procrastinationaddiction

Evidence from Neuroscience(Next lecture)

MDP Solutions In deterministic single-agent search, want an optimal

sequence of actions from start to a goal In an MDP we want an optimal policy (s)

A policy gives an action for each state Optimal policy maximizes expected utility (i.e. expected rewards)

if followed

Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

Example Optimal Policies

R(s) = -2.0R(s) = -0.4

R(s) = -0.03R(s) = -0.01

Utility of a State

Define the utility of a state under a policy:V(s) = expected total (discounted) rewards starting in s

and following Recursive definition (one-step look-ahead):

Also called policy evaluation

Bellman’s Equation for Selecting actions

Definition of utility leads to a simple relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

Formally: Bellman’s Equation

That’s my equation!

r(state, action)immediate reward values

Q(state, action) valuesV*(state) values

100

0

0

100

G

0

0

0

0

0

0

0

0

0

90

81

100

G

0

81

72

90

81 81

72

90

81

100

G 90 100 0

81 90 100

Q-values

The expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))

Representation

Explicit

Implicit Weighted linear function/neural network

Classical weight updating

State Action Q(s, a)

2 MoveLeft 81

2 MoveRight 100

... ... ...

A table of values for each action: Q-Functions

A q-value is the value of a (state and action) under a policy Utility of taking starting in state s, taking

action a, then following thereafter

The Bellman Equations

Definition of utility leads to a simple relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

Formally:

Optimal Utilities

Goal: calculate the optimal utility of each state

V*(s) = expected (discounted) rewards with optimal actions

Why: Given optimal utilities, MEU tells us the optimal policy

MDP solution methods

If we know T(s, a, s’) and R(s,a,s’), then we can solve the MDP to find the optimal policy in a number of ways.

Dynamic programming Iterative Estimation methods

Value Iteration Assume 0 initial values for each state and update using the

Bellman equation to pick actions.

Policy iteration Evaluate a given policy (find V(s) for the policy), then change it

using Bellman updates till there is no improvement in the policy.


Reinforcement learning: W have an MDP:

A set of states s S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’)

Are looking for a policy (s) We don’t know T or R

I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn


Target function is : state action

However…

We have no training examples of form

<state, action>

Training examples are of form

<<state, action>, new-state, reward>

Passive Learning

Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values (and maybe the model)

In this case: No choice about what actions to take Just execute the policy and learn from experience

Example: Direct EstimationSimple Monte Carlo

Episodes:

x

y

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right -1

(4,3) exit +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(4,2) exit -100

(done)U(1,1) ~ (92 + -106) / 2 = -7

U(3,3) ~ (99 + 97 + -102) / 3 = 31.3

= 1, R = -1

+100

-100

Full Estimation (Dynamic Programming)

)()( 11 ttt sVrEsV

T

T T T

st

rt1

st1

T

TT

T

TT

T

T

T

Simple Monte Carlo

T T T TT

T T T T T

V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Combining DP and MC

T T T TT

T T T T T

st1

rt1

st

V(st ) V(st) rt1 V (st1 ) V(st )

TTTTT

T T T T T


Target function is : state action

However…

We have no training examples of form

<state, action>

Training examples are of form

<<state, action>, new-state, reward>

Model-Free Learning

Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates

(over time)

Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever

successor occurs

a

s

s, a

s,a,s’s’

Q-Learning

Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:

Nudge the old estimate towards the new sample:

Any problems with this?

What if the starting policy doesn’t let you explore the state space? T(s,a,s’) is unknown and never estimated. The value of unexplored states is never computed.

How do we address this problem? Fundamental problem in RL and in Biology AI solutions include

e-greedy Softmax

Evidence from Neuroscience (next lecture).

Exploration / Exploitation

Several schemes for forcing exploration Simplest: random actions (-greedy)

Every time step, flip a coin With probability , act randomly With probability 1-, act according to current policy

(best q value for instance)

Problems with random actions? You do explore the space, but keep thrashing

around once learning is done One solution: lower over time Another solution: exploration functions

Q-Learning

Q Learning features

On-line, Incremental Bootstrapping (like DP unlike MC) Model free Converges to an optimal policy.

On average when alpha is small With probability 1 when alpha is high in the

beginning and low at the end (say 1/k)


Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected utility Change the rewards, change the behavior

Examples: Learning your way around, reward for reaching the destination. Playing a game, reward at the end for winning / losing Vacuuming a house, reward for each piece of dirt picked up Automated taxi, reward for each passenger delivered

DEMO

Demo of Q Learning

Demo arm-control Parameters

learning rate) discounted reward (high for future rewards) exploration(should decrease with time)

MDP Reward= number of the pixel moved to the right/

iteration number Actions : Arm up and down (yellow line), hand up and

down (red line)

cs 182/cogsci110/ling109 spring 2008 reinforcement learning: algorithms 4/1/2008 srini narayanan –...

Documents