31.10.20051 soft computing lecture 13 reinforcement learning

31.10.2005 1

Soft Computing

Lecture 13

Reinforcement learning

31.10.2005 2

Examples of reinforcement learning• A master chess player makes a move. The choice is informed both by planning--

anticipating possible replies and counterreplies--and by immediate, intuitive judgments of the desirability of particular positions and moves.

• An adaptive controller adjusts parameters of a petroleum refinery's operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers.

• A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.

• A mobile robot decides whether it should enter a new room in search of more trash to collect or start trying to find its way back to its battery recharging station. It makes its decision based on how quickly and easily it has been able to find the recharger in the past.

• Phil prepares his breakfast. Closely examined, even this apparently mundane activity reveals a complex web of conditional behavior and interlocking goal-subgoal relationships: walking to the cupboard, opening it, selecting a cereal box, then reaching for, grasping, and retrieving the box. Other complex, tuned, interactive sequences of behavior are required to obtain a bowl, spoon, and milk jug. Each step involves a series of eye movements to obtain information and to guide reaching and locomotion. Rapid judgments are continually made about how to carry the objects or whether it is better to ferry some of them to the dining table before obtaining others. Each step is guided by goals, such as grasping a spoon or getting to the refrigerator, and is in service of other goals, such as having the spoon to eat with once the cereal is prepared and ultimately obtaining nourishment

31.10.2005 3

The agent-environment interaction in reinforcement learning

31.10.2005 4

Example: Pick-and-Place Robot

Consider using reinforcement learning to control the motion of a robot arm in a repetitive pick-and-place task. If we want to learn movements that are fast and smooth, the learning agent will have to control the motors directly and have low-latency information about the current positions and velocities of the mechanical linkages. The actions in this case might be the voltages applied to each motor at each joint, and the states might be the latest readings of joint angles and velocities. The reward might be +1 for each object successfully picked up and placed. To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment"jerkiness" of the motion.

31.10.2005 5

Example: Recycling Robot A mobile robot has the job of collecting empty soda cans in an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery. The robot's control system has components for interpreting sensory information, for navigating, and for controlling the arm and gripper. High-level decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery. This agent has to decide whether the robotshould• actively search for a can for a certain period of time,• remain stationary and wait for someone to bring it a can, or• head back to its home base to recharge its battery. This decision has to be made either periodically or whenever certain events occur, such as finding an empty can. The agent therefore has three actions, and its state is determined by the state of the battery. The rewards might be zero most of the time, but then become positive when the robot secures an empty can, or large and negative if the battery runs all the way down. In this example, the reinforcement learning agent is not the entire robot. The states it monitors describe conditions within the robot itself, not conditions of the robot's external environment. The agent's environment therefore includes the rest of the robot, which might contain other complex decision-making systems, as well as the robot's external environment.

31.10.2005 6

Features of reinforcement learning and main concepts

• Learning is combined with working• Working is a sequence of actions• Plan of actions is policy • Plan (policy) may be corrected in every time

(step)• Action is selected from policy (or no) in

according to estimation of state of environment (or estimation of action in same state) and reward received from environment

• Estimation of environment is determined by goal (target)

31.10.2005 7

Action-Value Methods We denote the true (actual) value of action as , and the estimated value at the tth play as . Recall that the true value of an action is the mean reward received when that action is selected. One natural way to estimate this is by averaging the rewards actually received when the action was selected. In other words, if at the th play action has been chosen times prior to , yielding rewards , then its value is estimated to be

31.10.2005 8

Reinforcement Comparison

A central intuition underlying reinforcement learning is that actions followed by large rewards should be made more likely to recur, whereas actions followed by small rewards should be made less likely to recur. But how is the learner to know what constitutes a large or a small reward? If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward. A natural choice for the reference reward is an average of previously received rewards. In other words, a reward is interpreted as large if it is higher than average, and small if it is lower than average. Learning methods based on this idea are called reinforcement comparison methods. These methods are sometimes more effective than action-value methods.

31.10.2005 9

Incremental implementation

The expression is an error in the estimate. It is reduced by taking a step toward the "Target." The target is presumed to indicate a desirable direction in which to move, though it may be noisy.

31.10.2005 10

Dynamic programmingBellman optimality equations:

or

Estimation of state of environment

31.10.2005 11

Dynamic programming (2)

31.10.2005 12

Improving of policy

31.10.2005 13

Dynamic programming (3)

• Compared with other methods, DP methods are actually quite efficient

• DP is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that the number of states often grows exponentially with the number of state variables. Large state sets do create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method

31.10.2005 14

Method Monte Carlo

First-visit MC method for estimating

as a set of episodes obtainedby following and passing through

31.10.2005 15

Method Monte Carlo (2)

Monte Carlo ES: A Monte Carlo control algorithm assuming exploring starts

31.10.2005 16

Temporal Difference Learning.TD-prediction

Where is the actual return following time and is a constant step-size parameter. This method is named as constant- MC. Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to (only then is known), TD methods need wait only until the next time step. At time they immediately form a target and make a useful update using the observed reward and the estimate . The simplest TD method, known as TD(0), is

31.10.2005 17

TD(0) algorithm

31.10.2005 18

Advantages of TD-methods

• TD methods have an advantage over dynamic programming methods in that they do not require a model of the environment, of its reward and next-state probability distributions

• Advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step

• TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks

31.10.2005 19

Q-learning – off-policy TD control

31.10.2005 20

R-learning for Undiscounted Continuing Tasks

R-learning is an off-policy control method for the advanced version of the reinforcement learning problem in which one neither discountsnor divides experience into distinct episodes with finite returns. In this case one seeks to obtain the maximum reward per time step

31.10.2005 21

R-learning algorithm

31.10.2005 22

Actor-Critic methods

31.10.2005 23

Actor-Critic methods (2)

The critic is a state-value function. After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is the TD error:

where is the current value function implemented by the critic

31.10.2005 24

Advantages of Actor-Critic Methods

• They require minimal computation in order to select actions. Consider a case where there are an infinite number of possible actions--for example, a continuous-valued action. Any method learning just action values must search through this infinite set in order to pick an action. If the policy is explicitly stored, then this extensive computation may not be needed for each action selection.

• They can learn an explicitly stochastic policy; that is, they can learn the optimal probabilities of selecting various actions. This ability turns out to be useful in competitive and non-Markov cases

31.10.2005 25

Gradient-Descent Methods

where denotes the vector of partial derivatives

31.10.2005 26

n-Step TD Prediction

31.10.2005 27

n-Step TD Prediction (2)λ-return

31.10.2005 28

TD-Gammon

31.10.2005 29

TD-Gammon (2)

31.10.2005 30

TD-Gammon (3)

TD-Gammon used the gradient-descent form of the TD( ) algorithm

31.10.2005 31

Difficulties of using of reinforcement learning and attempts to overcome them

• Exploration versus exploitation– use random actions– favor states not visited recently– separate the learning phase from the use phase– employ a teacher to guide exploration

• Slow time to convergence– combine learning with prior knowledge; use estimates of Q

values (rather than random values) initially– use a hierarchy of actions; learn primitive actions rst and freeze

the useful sequences into macros and then learn how to use the macros

– employ a teacher; use graded “lessons“ - starting near the rewards and then backing away, and use examples of good behavior [Lin, 1992]

– use more efficient computations; e.g. do several updates per episode [Moore & Atkeson, 1993]

31.10.2005 32

Difficulties of using of reinforcement learning and attempts to overcome them (2)

• Large state spaces– Use hand-coded features– Use neural networks– Use nearest-neighbor methods [Moore, 1990]

• Temporal discounting problems. Using small can make the learner too greedy for present rewards and indifferent to the future; but using large slows down learning– Use a learning method based on average rewards [Schwartz,

1993]• No “transfer" of learning . What is learned depends on the

reward structure; if the rewards change, learning has to start over.– Separate the learning into two parts: learn an “action model“ which

predicts how actions change states (and is constant over all problems), and then learn the “values" of states by reinforcement learning for each different set of rewards. Sometimes the reinforcement learning part can be replaced by a “planner" that uses the action model to produce plans to achieve goals

31.10.20051 soft computing lecture 13 reinforcement learning

Documents

reinforcement learning

reinforcement learning

robot arm

mobile robot

recycling robot

smooth movements

agentenvironment interaction

current positions