reinforcement learning

69
Reinforcement Learning Based on Slides by Avi Pfeffer and David Parkes

Upload: heaton

Post on 05-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Reinforcement Learning. Based on Slides by Avi Pfeffer and David Parkes. Mechanism. Reward. State. Closed Loop Interactions. Environment. Agent. Sensors. Actuators. Percepts. Actions. Reinforcement Learning. When mechanism(=model) is unknown - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reinforcement Learning

Reinforcement LearningReinforcement Learning

Based on Slides

by Avi Pfeffer and David Parkes

Page 2: Reinforcement Learning

Closed Loop InteractionsClosed Loop Interactions

Environment

Sensors Actuators

Reward

PerceptsPercepts ActionsActions

Agent

StateState

MechanismMechanism

Page 3: Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

• When mechanism(=model) is unknown

• When mechanism is known, but model is too hard to solve

Page 4: Reinforcement Learning

Basic IdeaBasic Idea

• Select an action using some sort of action selection process

• If it leads to a reward, reinforce taking that action in future

• If it leads to a punishment, avoid taking that action in future

Page 5: Reinforcement Learning

But It’s Not So SimpleBut It’s Not So Simple

• Rewards and punishments may be delayed– credit assignment problem: how do you

figure out which actions were responsible?

– study -> get-degree -> get job

• How do you choose an action?– exploration versus exploitation

• What if the state space is very large so you can’t visit all states?

Page 6: Reinforcement Learning

Model-Based RLModel-Based RL

Page 7: Reinforcement Learning

Model-Based Reinforcement Learning

Model-Based Reinforcement Learning

• Mechanism is an MDP

• Approach:– learn the MDP– solve it to determine the optimal policy

• Works when model is unknown, but it is not too large to store and solve

Page 8: Reinforcement Learning

Learning the MDPLearning the MDP• We need to learn the parameters of the reward and

transition models• We assume the agent plays every action in every

state a number of times• Let Ra

i = total reward received for playing a in state i• Let Na

i = number of times played a in state i• Let Na

ij = number of times j was reached when played a in state i

• R(i,a) = Rai / Na

i

• Taij = Na

ij / Nai

Page 9: Reinforcement Learning

NoteNote

• Learning and solving the MDP need not be a one-off thing

• Instead, we can repeatedly re-evalute the model and resolve it to get better and better policies

• How often should we solve the MDP?– depends how expensive solving is

compared to acting in the world

Page 10: Reinforcement Learning

Model-Based Reinforcement Learning Algorithm

Model-Based Reinforcement Learning Algorithm

Let π0 be arbitraryk←0Experience ←∅Repeat k←k + 1 Begin in state i For a while: Choose action a based on πk-1

Receive reward r and transition to j Experience ←Experience ∪< i, a, r, j > i ← j Learn MDP M from Experience Solve M to obtain πk

Page 11: Reinforcement Learning

Credit AssignmentCredit Assignment

• How does model-based RL deal with the credit assignment problem?

• By learning the MDP, the agent knows which states lead to which other states

• Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account

• So the problem is solved optimally

Page 12: Reinforcement Learning

Action SelectionAction Selection

• The line in the algorithm Choose action a based on πk-1

is not specific

• How do we choose the action?

Page 13: Reinforcement Learning

Action SelectionAction Selection

• The line in the algorithm Choose action a based on πk-1

is not specific

• How do we choose the action?

• Obvious answer: the policy tells us the action to perform

• But is that always what we want to do?

Page 14: Reinforcement Learning

Exploration versus Exploitation

Exploration versus Exploitation

• Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned

• Explore: play an action that will help you learn the model better

Page 15: Reinforcement Learning

QuestionsQuestions• When to explore• How to explore

– simple answer: play an action you haven’t played much yet in the current state

– more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much

• How to exploit– we know the answer to this: follow the learned

policy

Page 16: Reinforcement Learning

Conditions for Optimality

Conditions for Optimality

To ensure that the optimal policy will eventually be reached, we need to ensure that

1. Every action is taken in every state infinitely often in the long run

2. The probability of exploitation tends to 1

Page 17: Reinforcement Learning

Possible Exploration Strategies: 1Possible Exploration Strategies: 1

• Explore until time T, then exploit

• Why is this bad?

Page 18: Reinforcement Learning

Possible Exploration Strategies: 1Possible Exploration Strategies: 1

• Explore until time T, then exploit

• Why is this bad?– We may not explore long enough to get an

accurate model– As a result, the optimal policy will not be

reached

Page 19: Reinforcement Learning

Possible Exploration Strategies: 1Possible Exploration Strategies: 1

• Explore until time T, then exploit

• Why is this bad?– We may not explore long enough to get an

accurate model– As a result, the optimal policy will not be

reached

• But makes sense if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

Page 20: Reinforcement Learning

Possible Exploration Strategies: 1Possible Exploration Strategies: 1

• Explore until time T, then exploit• Why is this bad?

– We may not explore long enough to get an accurate model

– As a result, the optimal policy will not be reached

• But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy

• Works well for learning from simulation and performing in the real world

Page 21: Reinforcement Learning

Possible Exploration Strategies: 2Possible Exploration Strategies: 2

• Explore with a fixed probability of p

• Why is this bad?

Page 22: Reinforcement Learning

Possible Exploration Strategies: 2

Possible Exploration Strategies: 2

• Explore with a fixed probability of p

• Why is this bad?– Does not fully exploit when learning has

converged to optimal policy

Page 23: Reinforcement Learning

Possible Exploration Strategies: 2

Possible Exploration Strategies: 2

• Explore with a fixed probability of p

• Why is this bad?– Does not fully exploit when learning has

converged to optimal policy

• When could this approach be useful?

Page 24: Reinforcement Learning

Possible Exploration Strategies: 2

Possible Exploration Strategies: 2

• Explore with a fixed probability of p

• Why is this bad?– Does not fully exploit when learning has

converged to optimal policy

• When could this approach be useful?– If world is changing gradually

Page 25: Reinforcement Learning

Boltzmann ExplorationBoltzmann Exploration• In state i, choose action a

with probability

• T is called the temperature• High temperature: more exploration• T should be cooled down to reduce amount of

exploration over time • Sensitive to cooling schedule

Page 26: Reinforcement Learning

GuaranteeGuarantee

• If:– every action is taken in every state

infinitely often– probability of exploration tends to zero

• Then:– Model-based reinforcement learning will

converge to the optimal policy with probability 1

Page 27: Reinforcement Learning

Pros and ConsPros and Cons

• Pro: – makes maximal use of experience– solves model optimally given experience

• Con: – assumes model is small enough to solve– requires expensive solution procedure

Page 28: Reinforcement Learning

R-MaxR-Max• Assume R(s,a)=R-max (the maximal possible reward

– Called optimism bias

• Assume a special “heavens” state• R(heavens)=R-max• Tr(heavens,a,heavens)=1

• Solve and act optimally• When Na

i > c, update R(i,a) and Tr(i,a,j)• After each update, resolve• If you choose c properly, converges to the optimal

policy in polynomial number of iterations

Page 29: Reinforcement Learning

Model-Free RLModel-Free RL

Page 30: Reinforcement Learning

Monte Carlo SamplingMonte Carlo Sampling• If we want to estimate y = Ex~D[f(x)] we can

– Generate random samples x1,…,xN from D– Estimate

– Guaranteed to converge to correct estimate with sufficient samples

– Requires keeping count of # of samples

• Alternative, update average:– Generate random samples x1,…,xN from D– Estimate

Page 31: Reinforcement Learning

Estimating the Value of a Policy Using Monte-Carlo Sampling

Estimating the Value of a Policy Using Monte-Carlo Sampling

• Fix a policy π• When starting in state i, taking action a

according to π, getting reward r and transitioning to j, we get a sample of

• So we can updateVπ(i) ←(1-α)Vπ(i) + α(r + Vπ(j)

called bootstrapping -- we use V to update itself• Initial Vπ(j)‘s value can be 0 or some guess

Page 32: Reinforcement Learning

Temporal Difference Algorithm

Temporal Difference Algorithm

For each state i:

V(i) ←0

Begin in state i

Repeat:

Apply action a based on current policy

Receive reward r and transition to j

i ← j

Page 33: Reinforcement Learning

Credit AssignmentCredit Assignment

• By linking values to those of the next state, rewards and punishments are eventually propagated backwards

• We wait until end of game and then propagate backwards in reverse order

• Long term impact of a choice is inherent in the definition of value function

Page 34: Reinforcement Learning

But how do we learn to actBut how do we learn to act• We want to implement something like policy

iteration• This requires learning the Q function:

• We use a TD method, known as SARSA to estimate the Q function w.r.t. the current policy

• We can then update the policy as usual (policy improvement)

Page 35: Reinforcement Learning

TD for Control: SARSATD for Control: SARSA

Initialize Q(s,a) arbitrarilyRepeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’ from s’ using policy derived from Q (e.g., ε-greedy) Update: s s’, aa’ Until s is terminal

Page 36: Reinforcement Learning

Off-Policy vs. On-PolicyOff-Policy vs. On-Policy• On-policy learning: learn only the value of actions

used in the current policy. SARSA is an example of an on-policy method. We learn the Q values w.r.t. the policy we are currently using

• Off-policy learning: can learn the value of a policy different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).

Page 37: Reinforcement Learning

Q-LearningQ-Learning

• Don’t learn the model, learn the optimal Q-function, Q*, directly

• Works particularly well when model is too large to store, to solve or to learn– size of model: O(|States|2)– cost of solution by policy iteration: O(|

States|3)– size of Q function: O(|Actions|*|States|)

Page 38: Reinforcement Learning

Recursive Formulation of Q Function

Recursive Formulation of Q Function

Page 39: Reinforcement Learning

Learning the Q ValuesLearning the Q Values

• We don’t know Tai and we don’t want to learn

an explicit model

Page 40: Reinforcement Learning

Learning the Q ValuesLearning the Q Values

• We don’t know Tai and we don’t want to learn

an explicit model

• If only we knew that our future Q values were accurate…

• …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)

Page 41: Reinforcement Learning

Learning the Q ValuesLearning the Q Values

• We don’t know Tai and we don’t want to learn

an explicit model

• If only we knew that our future Q values were accurate…

• …every time we applied a in state i and transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)

• So we pretend that they are accurate– (after all, they get more and more accurate)

Page 42: Reinforcement Learning

Q Learning Update RuleQ Learning Update Rule• On transitioning from i to j, taking action a,

receiving reward r, update

Page 43: Reinforcement Learning

Q Learning Update RuleQ Learning Update Rule• On transitioning from i to j, taking action a,

receiving reward r, update

• α is the learning rate• Large α:

– learning is quicker– but may not converge

• α is often decreased over the course of learning

Page 44: Reinforcement Learning

Q Learning AlgorithmQ Learning AlgorithmFor each state i and action a:

Q(i,a) ←0

Begin in state i

Repeat:

Choose action a based on the Q values for state i for all actions

Receive reward r and transition to j

i ← j

Page 45: Reinforcement Learning

Choosing Which Action to Take

Choosing Which Action to Take

• Once you have learned the Q function, you can use it to determine the policy– in state i, choose action a that has

highest estimated Q(i,a)

• But we need to combine exploitation with exploration– same methods as before

Page 46: Reinforcement Learning

GuaranteeGuarantee

• If:– every action is taken in every state infinitely

often– αis sufficiently small

• Then Q learning will converge to the optimal Q values with probability 1

• If also:– probability of exploration tends to zero

• Then Q learning will converge to the optimal policy with probability 1

Page 47: Reinforcement Learning

Credit AssignmentCredit Assignment

• By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards

• But may take a long time

• Idea: wait until end of game and then propagate backwards in reverse order

Page 48: Reinforcement Learning

Q-learning (α = 1)Q-learning (α = 1)

S1

S2 S3 S4 S5

S6 S7 S8 S9

a

b

a,b a,b a,b

a,b a,b a,b

After playing aaaa:

Q(S4,a) = 1

Q(S4,b) = 0

Q(S3,a) = 1

Q(S3,b) = 0

Q(S2,a) = 1

Q(S2,b) = 0

Q(S1,a) = 1

Q(S1,b) = 0

After playing bbbb:

Q(S8,a) = 0

Q(S8,b) = -1

Q(S7,a) = 0

Q(S7,b) = 0

Q(S6,a) = 0

Q(S6,b) = 0

Q(S1,a) = 1

Q(S1,b) = 0

00 0 1

0 0 -1

0

Page 49: Reinforcement Learning

Bottom LineBottom Line

• Q learning makes optimistic assumption about the future

• Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated

• But eventually, Q learning will converge to optimal policy

Page 50: Reinforcement Learning

SARSA vs. Q-learningSARSA vs. Q-learning

how will each perform here?how will each perform here?

Page 51: Reinforcement Learning

Q-Learning learns the optimal path along the edge of the cliff, but falls off every now and then due e-greedy action selection.

Sarsa learns the safe path because it takes the action selection method into account when learning.

Because Sarsa learns the safe path, it receives higher average reward per trial than Q-Learning even though it does not walk the optimal path.

Page 52: Reinforcement Learning

• What if state space is very large?

• Then we can’t visit every state

• We need to generalize from states we have seen to states we haven’t seen

• This is just like learning from a training set and generalizing to the future

Issue: GeneralizationIssue: Generalization

Page 53: Reinforcement Learning

State Space And Variables

State Space And Variables

• When we looked at reinforcement learning, state space was monolithic– e.g. in darts, just a number

• In many domains, state consists of a number of variables– e.g. in backgammon, number of pieces at each

location• Size of state space is exponential in number of

variables• We also need to consider continuous state

spaces– e.g. helicopter

Page 54: Reinforcement Learning

Value Function ApproximationValue Function Approximation

• Define features X1,…,Xn of the state

Page 55: Reinforcement Learning

Value Function ApproximationValue Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning V(s) for every state, learn an approximation

Page 56: Reinforcement Learning

Value Function ApproximationValue Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning V(s) for every state, learn an approximation

• depends only on the features

Page 57: Reinforcement Learning

Value Function ApproximationValue Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning V(s) for every state, learn an approximation

• depends only on the features

• Represent compactly– e.g. using neural network

Page 58: Reinforcement Learning

Value Function ApproximationValue Function Approximation

• Define features X1,…,Xn of the state• Instead of learning V(s) for every state,

learn an approximation• depends only on the features• Represent compactly

– e.g. using neural network

• Works when state space is large but mechanism is known– e.g. backgammon

Page 59: Reinforcement Learning

Q Function ApproximationQ Function Approximation

• Define features X1,…,Xn of the state

Page 60: Reinforcement Learning

Q Function ApproximationQ Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning Q(s,a) for every state, learn an approximation

Page 61: Reinforcement Learning

Q Function ApproximationQ Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning Q(s,a) for every state, learn an approximation

• depends only on the features

Page 62: Reinforcement Learning

Q Function ApproximationQ Function Approximation

• Define features X1,…,Xn of the state

• Instead of learning Q(s,a) for every state, learn an approximation

• depends only on the features

• Represent compactly– e.g. using neural network

Page 63: Reinforcement Learning

Q Function ApproximationQ Function Approximation

• Define features X1,…,Xn of the state• Instead of learning Q(s,a) for every state,

learn an approximation• depends only on the features• Represent compactly

– e.g. using neural network

• Works when state space is large and mechanism is unknown– e.g. helicopter

Page 64: Reinforcement Learning

Value Function Approximation Update Rule

Value Function Approximation Update Rule

• On transitioning from i to j, taking action a, receiving reward r:

• Create a training instance in which– Inputs are features of i– Output is

• Run forward propagation and back propagation on this instance

Page 65: Reinforcement Learning

Basic ApproachBasic Approach

• Define features that summarize the state– state represented by features X1,…,Xn

• Assume that the value of a state approximately depends only on the features– V’(s) = f(x1,…,xn)

• Assume that f can be compactly represented

• Learn f from experience– how to learn such a function will be a major topic

of this course

Page 66: Reinforcement Learning

E.g. Samuel’s Checkers Player

E.g. Samuel’s Checkers Player

• Features:– x1: number of black pieces on board

– x2: number of red pieces on board

– x3: number of black kings on board

– x4: number of red kings on board

– x5: number of black pieces threatened

– x6: number of red pieces threatened

• f(x1,…,x6) = w1x1+w2x2+w3x3+w4x4+w5x5+w6x6

• w1,…,w6 are learnable parameters

Page 67: Reinforcement Learning

Training DataTraining Data

• Each time agent transitions from i to j, taking action a and receiving reward r, we get an estimate v = r + V’(j) for V(i)

• Let the features of i be x1,…,xn

• We get a training instance <x1,…,xn,v>

• We use this instance to update our model of f

Page 68: Reinforcement Learning
Page 69: Reinforcement Learning

Applications of MDPs, POMDPs and Reinforcement Learning

Applications of MDPs, POMDPs and Reinforcement Learning

• TD-Gammon: world champion level backgammon player

• Robotics and control: e.g. helicopter

• Industrial: e.g. job shop scheduling• Business: e.g. internet advertising• Military: e.g. target identification• Medical: e.g. testing and diagnosis