ece 517: reinforcement learning in artificial intelligence lecture 9: monte carlo methods

11

ECE 517: Reinforcement LearningECE 517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence

Lecture 9: Monte Carlo MethodsLecture 9: Monte Carlo Methods

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2011Fall 2011

September 22, 2011September 22, 2011

ECE 517 - Reinforcement Learning in AI 22

OutlineOutline

IntroductionIntroduction

MC Policy EvaluationMC Policy Evaluation Blackjack exampleBlackjack example

MC Estimation of Action-Value functionMC Estimation of Action-Value function

MC ControlMC Control

Incremental ImplementationIncremental Implementation


IntroductionIntroduction

Monte Carlo methods solve the RL problem Monte Carlo methods solve the RL problem without without requiring a modelrequiring a model of the environment of the environment

Often a model is not availableOften a model is not available Storage requirements are substantially relaxedStorage requirements are substantially relaxed

Require only Require only experienceexperience – sample sequences of states, – sample sequences of states, actions and rewardsactions and rewards

Experience can come in two general forms:Experience can come in two general forms: Online interactionOnline interaction with an environment – no prior with an environment – no prior

knowledge needed to yield optimal behaviorknowledge needed to yield optimal behavior Simulated interactionsSimulated interactions – require model but rely on samples – require model but rely on samples

Many times it is easy to generate samples but not obtain the Many times it is easy to generate samples but not obtain the probability distribution of the state transitionsprobability distribution of the state transitions

Monte Carlo methods are ways of solving the RL problem Monte Carlo methods are ways of solving the RL problem based on based on averaging sample returnsaveraging sample returns


Introduction (cont.)Introduction (cont.)

We consider mainly We consider mainly episodicepisodic (terminal) tasks (terminal) tasks Experience is divided into episodes that terminate Experience is divided into episodes that terminate

regardless of the actions taken regardless of the actions taken MC methods are incremental in an MC methods are incremental in an episode-by-episode-by-

episodeepisode manner manner Different than the step-by-step scheme employed Different than the step-by-step scheme employed

by DPby DP

Otherwise, same as DP - policy evaluation, policy Otherwise, same as DP - policy evaluation, policy improvement and generalized policy iterationimprovement and generalized policy iteration

We start with We start with policy evaluation policy evaluation – the – the computation of a value function corresponding computation of a value function corresponding to a given policyto a given policy


Monte Carlo Policy EvaluationMonte Carlo Policy Evaluation

We begin with learning the state-value function for a We begin with learning the state-value function for a given policygiven policyAn obvious way to estimate it from experience: An obvious way to estimate it from experience: average average all returns observed after visiting a given stateall returns observed after visiting a given state

Main idea behind MC methodsMain idea behind MC methods

Each occurrence of state Each occurrence of state ss in an episode is called a in an episode is called a visitvisit to to ssFirst-visit MCFirst-visit MC method method averages just the averages just the returns following returns following first visits to first visits to ssEvery-visit MCEvery-visit MC method method averages the returnsaverages the returnsfollowing all the visits following all the visits to to ss in a set of in a set ofepisodesepisodes


Blackjack ExampleBlackjack Example

The object is to obtain cards the sum ofThe object is to obtain cards the sum ofwhich is which is 2121

All face cards count as 10All face cards count as 10 Ace can count either as 1 or as 11Ace can count either as 1 or as 11 Player competes independently againstPlayer competes independently against

the dealerthe dealer

Begins where two cards are dealt to both dealerBegins where two cards are dealt to both dealerand agentand agentOutline of rules:Outline of rules:

If the player has 21 immediately (e.g. an ace and a 10-card) he wins, If the player has 21 immediately (e.g. an ace and a 10-card) he wins, unless the dealer has the same, in which case the game is a draw.unless the dealer has the same, in which case the game is a draw.

Otherwise, player can request additional cards, one by one until he Otherwise, player can request additional cards, one by one until he either stops or exceeds 21 (loses). either stops or exceeds 21 (loses).

If he stops, it becomes the dealer's turn. The dealer’s strategy is If he stops, it becomes the dealer's turn. The dealer’s strategy is simple and fixed: he stops on any sum of 17 or greater, and simple and fixed: he stops on any sum of 17 or greater, and continues otherwise. continues otherwise.

If the dealer exceeds 21, then the player wins; otherwise, the If the dealer exceeds 21, then the player wins; otherwise, the outcome--win, lose, or draw--is determined by whose final sum is outcome--win, lose, or draw--is determined by whose final sum is closer to 21. closer to 21.


Blackjack Example (cont.)Blackjack Example (cont.)

Playing blackjack is naturally formulated as an episodic finite MDPPlaying blackjack is naturally formulated as an episodic finite MDP

Each game of blackjack is an episodeEach game of blackjack is an episode Rewards of Rewards of +1+1, , -1-1, and , and 00 are given for winning, losing, and drawing, are given for winning, losing, and drawing,

respectivelyrespectively

We do not discount rewards (i.e. We do not discount rewards (i.e. ), and since all intermediate ), and since all intermediate rewards are zero – rewards are zero – the terminal reward == the returnthe terminal reward == the return

Action set = {Action set = {hithit (get another card), (get another card), stickstick (stop)}(stop)}

States depend on the player’s card and the visible card of the States depend on the player’s card and the visible card of the dealerdealer

We assume cards are dealt from an infinite deckWe assume cards are dealt from an infinite deck No advantage to keeping track of cards dealtNo advantage to keeping track of cards dealt

If the player gets an ace that is usable, i.e. If the player gets an ace that is usable, i.e. 21, he uses it21, he uses it

Decisions are based on three variable: Decisions are based on three variable: current sum (12-21), the current sum (12-21), the dealer’s displayed card (ace-10) and whether the player has a dealer’s displayed card (ace-10) and whether the player has a usable aceusable ace (total of 200 states) (total of 200 states)



Consider a policy that stops if the player’s sum is 20 or Consider a policy that stops if the player’s sum is 20 or 2121

Assume many blackjack games are playedAssume many blackjack games are played We’re looking for the value function that fits this policyWe’re looking for the value function that fits this policy

Note that here the same state is never revisited (Note that here the same state is never revisited (why?why?))



Although we have complete knowledge of the Although we have complete knowledge of the environment, it would not be easy to apply DP policy environment, it would not be easy to apply DP policy evaluationevaluation

Very hard to calculate Very hard to calculate e.g. player’s sum is 14 – what is the expectede.g. player’s sum is 14 – what is the expected

reward as a function of the dealer’s displayedreward as a function of the dealer’s displayedcard?card?

Since all of these calculations must be doneSince all of these calculations must be donepriorprior to running DP – it is often an impractical approach to running DP – it is often an impractical approach

In contrast, generating sample games (for MC) is much In contrast, generating sample games (for MC) is much easier to doeasier to do

Surprising insightSurprising insight: even if the environment’s dynamics are : even if the environment’s dynamics are known, MC is often a more efficient method to applyknown, MC is often a more efficient method to apply

Estimating values are independent (no “bootstrapping”)Estimating values are independent (no “bootstrapping”) Optimal policy trajectory corresponds to a small state subsetOptimal policy trajectory corresponds to a small state subset

ass

ass RP '' ,

∆∆


Monte Carlo vs. Dynamic ProgrammingMonte Carlo vs. Dynamic Programming

Can the backup diagram be applied to MC methods?Can the backup diagram be applied to MC methods? Recall that backup diagrams show top node to be updated Recall that backup diagrams show top node to be updated

and below all the transitions and leaf nodes that and below all the transitions and leaf nodes that contribute to the updatecontribute to the update

For MC, the root is a state node and below areFor MC, the root is a state node and below areall the nodes visited until terminal node is all the nodes visited until terminal node is reachedreached

Shows only transitions sampled on the one episodeShows only transitions sampled on the one episode

DP focuses on one-step transitions, whereas MCDP focuses on one-step transitions, whereas MCgoes all the way to the end of the episodegoes all the way to the end of the episode

Updating values for states is independent in MCUpdating values for states is independent in MC Computational complexity of updating one nodeComputational complexity of updating one node

is independent of is independent of |S||S| An attractive feature, since one can estimate An attractive feature, since one can estimate

only a subset of the node values (not have to do all)only a subset of the node values (not have to do all)


Monte Carlo estimation of action valuesMonte Carlo estimation of action values

If a model is not available, then it is particularly useful If a model is not available, then it is particularly useful to estimate to estimate actionaction values rather than values rather than statestate values values

Value function alone is not enough to establish policyValue function alone is not enough to establish policy Thus, we will focus our attention on estimating Thus, we will focus our attention on estimating Q*Q*((s,as,a))

We’ll need to estimate We’ll need to estimate QQ((s,as,a))The same MC approach will be employed: the agent The same MC approach will be employed: the agent records the rewards received after taking action records the rewards received after taking action aa at at state state ssHowever, here we must exercise exploration, However, here we must exercise exploration, otherwise some actions will never be evaluatedotherwise some actions will never be evaluatedOne way to achieve this: One way to achieve this: exploring startsexploring starts – each state- – each state-action pair at the beginning of an episode has non-action pair at the beginning of an episode has non-zero probabilityzero probability

We’ll consider the general stochastic approach laterWe’ll consider the general stochastic approach later


Monte Carlo ControlMonte Carlo Control

Next, we’ll outline how MC methods can be used in Next, we’ll outline how MC methods can be used in control – i.e. to approximate an optimal policycontrol – i.e. to approximate an optimal policy

The idea is to follow generalized policy iteration (GPI)The idea is to follow generalized policy iteration (GPI) Approximated Approximated policypolicy and and value functionsvalue functions are maintained are maintained

Let’s consider the basic policy iteration method for MCLet’s consider the basic policy iteration method for MC Policy evaluation is achieved by recording the outcomes Policy evaluation is achieved by recording the outcomes

of many episodesof many episodes Policy improvement is done by selecting greedy actions, Policy improvement is done by selecting greedy actions,

i.e.i.e.(s) = argmax(s) = argmaxaa Q(s,a) Q(s,a)


Monte Carlo Control (cont.)Monte Carlo Control (cont.)

Policy improvement is achieved by constructing Policy improvement is achieved by constructing k+1k+1 as as

the greedy policy with respect to the greedy policy with respect to QQkk

The policy improvement theorem holds since for all The policy improvement theorem holds since for all ss

If the two policies are equal, they are both optimalIf the two policies are equal, they are both optimal This way MC can lead to optimal policies with no modelThis way MC can lead to optimal policies with no model

This assumes This assumes exploring startsexploring starts and and infinite number of infinite number of episodesepisodes for MC policy evaluation for MC policy evaluationTo remove the latterTo remove the latter

Update only to a given level of performance Update only to a given level of performance Alternate between evaluation and improvement per Alternate between evaluation and improvement per

episodeepisode


Monte Carlo ES (Exploring Starts)Monte Carlo ES (Exploring Starts)


On-Policy MC ControlOn-Policy MC Control

How can we avoid the unlikely assumption of How can we avoid the unlikely assumption of exploring startsexploring starts??

Can’t simply improve the policy by following a Can’t simply improve the policy by following a greedy policy, since no exploration will take placegreedy policy, since no exploration will take place

We employ an We employ an -greedy policy instead-greedy policy instead With “high” probability we choose the greedy actionWith “high” probability we choose the greedy action Otherwise – explore uniformlyOtherwise – explore uniformly

SolutionSolution: generate : generate soft soft policies, i.e. policies, i.e. (s,a)>0(s,a)>0 for for all all ss and and aa

e.g. e.g. -soft policy – moves policy -soft policy – moves policy toward greedytoward greedy

For non-maxFor non-max for greedy (max) for greedy (max))(sA

)(

1sA

∆∆


On-Policy vs. Off-Policy methodsOn-Policy vs. Off-Policy methods

Two methods for ensuring that the agent selects all Two methods for ensuring that the agent selects all actions infinitely often (as number of episodes goes to actions infinitely often (as number of episodes goes to infinity)infinity)

On-Policy MethodsOn-Policy Methods Evaluate/improve the policy that is used to make Evaluate/improve the policy that is used to make

decisionsdecisions Policy is generally soft (e.g. Policy is generally soft (e.g. -soft policy)-soft policy)

Off-Policy MethodsOff-Policy Methods Value estimation and policy evaluation are separatedValue estimation and policy evaluation are separated Two policies are considered:Two policies are considered:

Behavior policy Behavior policy – the policy used to generate the actions– the policy used to generate the actions

Estimation policy Estimation policy – the policy evaluated and improved– the policy evaluated and improved As we will see, this has some key advantages … As we will see, this has some key advantages …


On-Policy MC Control (cont.)On-Policy MC Control (cont.)

Let Let ‘‘ denote the denote the -greedy-greedy policypolicy We want to show that its value function is higher than We want to show that its value function is higher than

’s’s

The condition of the policy improvement theorem to The condition of the policy improvement theorem to hold for all states is …hold for all states is …


On-Policy MC Control (cont.)On-Policy MC Control (cont.)


Evaluating one policy while following another (cont.)Evaluating one policy while following another (cont.)

Suppose we wanted to evaluate a policy by viewing Suppose we wanted to evaluate a policy by viewing episodes generated from a episodes generated from a different policydifferent policy

We require that We require that (s,a)>0(s,a)>0 implies implies ’’(s,a)>0(s,a)>0

Assumptions and definitions:Assumptions and definitions: We view episodes generated under policy We view episodes generated under policy ’’ Consider the Consider the iithth first visit to state first visit to state ss and the complete and the complete

sequence of states and actions following that visitsequence of states and actions following that visit Let Let ppii((ss)) and and pp’’ii((ss)) denote the probabilities of that denote the probabilities of that

complete sequence occurring under policies complete sequence occurring under policies and and ’’, , respectivelyrespectively

Let Let RRii((ss)) denote the denote the iithth observed return from state observed return from state ss

To average these returns to obtain an unbiased To average these returns to obtain an unbiased estimate of estimate of VV((ss)), we need only weight each return by , we need only weight each return by its relative probability of occurring under each policyits relative probability of occurring under each policy

∆∆


Evaluating one policy while following another (cont.)Evaluating one policy while following another (cont.)

The desired MC estimate after observing The desired MC estimate after observing nnss returns returns

from state from state ss is given by is given by

The above assumes knowledge of The above assumes knowledge of ppii((ss)) and and pp’’ii((ss)), which , which

are usually unknownare usually unknown However, their However, their ratioratio can be determined can be determined without a modelwithout a model

sincesince

from which we havefrom which we have


Off-Policy Monte Carlo ControlOff-Policy Monte Carlo Control

Recall that in Recall that in on-policyon-policy methods, the value of a policy is methods, the value of a policy is estimated while it is used for controlestimated while it is used for controlIn In off-policyoff-policy methods – the two functions are separated methods – the two functions are separated

Behavior Behavior policy – policy used to generate behaviorpolicy – policy used to generate behavior Estimation Estimation policy – policy evaluated and improvedpolicy – policy evaluated and improved

Off-policy MC methods employ the technique shown in the Off-policy MC methods employ the technique shown in the previous couple of slidesprevious couple of slides

They follow the They follow the behavior policybehavior policy while learning about and while learning about and improving the improving the estimation policyestimation policy

Requires the behavior policy to be soft (e.g. Requires the behavior policy to be soft (e.g. -soft)-soft) Advantage: estimation policy can be deterministic (e.g. Advantage: estimation policy can be deterministic (e.g.

greedy)greedy)

We will next look at an off-policy method for computing We will next look at an off-policy method for computing QQ

The estimation policy is greedy with respect to The estimation policy is greedy with respect to QQ The behavior policy is The behavior policy is -soft-soft


Off-Policy Monte Carlo Control (cont.)Off-Policy Monte Carlo Control (cont.)


Incremental ImplementationIncremental Implementation

Monte Carlo can be implemented incrementallyMonte Carlo can be implemented incrementally Substantial saving in memorySubstantial saving in memory

Suppose we want to compute the weighted average of Suppose we want to compute the weighted average of each returneach return

Vn wkRk

k 1

n

wkk1

n

000

11

11

11

WV

wWW

VRW

wVV

nnn

nnn

nnn

incremental equivalentnon-incremental


SummarySummary

MC has several advantages over DP:MC has several advantages over DP: Can learn directly from interaction with environmentCan learn directly from interaction with environment No need for full modelsNo need for full models No need to learn about ALL statesNo need to learn about ALL states Less harm by Markovian violations (more detail later)Less harm by Markovian violations (more detail later)

MC methods provide an alternate policy MC methods provide an alternate policy evaluation processevaluation processOne issue to watch for: maintaining sufficient One issue to watch for: maintaining sufficient explorationexploration

exploring starts, soft policiesexploring starts, soft policies

Introduced distinction between Introduced distinction between on-policyon-policy and and off-off-policypolicy methods methodsNo bootstrapping (as opposed to DP)No bootstrapping (as opposed to DP)

ece 517: reinforcement learning in artificial intelligence lecture 9: monte carlo methods

Documents