reinforcement learning chapter 13 what is reinforcement learning? q-learning examples 1

Reinforcement Learning

Chapter 13

• What is Reinforcement Learning?• Q-Learning• Examples

Machine Learning Categories

What’s reinforcement Learning?

• An autonomous agent should learn to choose optimal actions in each state to achieve its goals.

• The agent learns how to achieve that goal by trial-and-error interactions with its environment.

Example: Learning to ride a bike

• Suppose: In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right.

• At this point, there are two possible actions: – turn the handle bars right:

• crashing to the ground (a negative reinforcement)

– turn the handle bars left:• crashing to the ground (a negative reinforcement)

Example: Learning to ride a bike

• At this point, the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad.

• Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……

Reinforcement Learning: Suitable for state-action problems

• Board games: E.g. backgammon, chess, 8-puzzle, …(Reinforcement learning in board games., Imran Ghory, 2004)

s5 s6 s7

What’s reinforcement Learning?

environment

StateReward

Action

s : state

a : action

r : a reward function

control policy : S -> A

Example: TD-Gammon

• Tesauro (1995)

• RL to play Backgammon to become the world championship

• Immediate reward

– +100 if win

– -100 if lose

– 0 for all other states

• Trained by playing 1.5 million games against itself

• Now approximately equal to best human player

An Example of Reward Function

The Goal in Reinforcement Learning

• Goal: learn to choose actions that maximize:

r0 + r1 + 2 r2 + … ,

• where 0 < 1

• The discount factor is used to exponentially decrease the weight of reinforcements received in the future

• It’s called: Discounted Cumulative Reward

Discounted Cumulative Reward

Other Options

• Finite-horizon model:

• Average-reward model:

• Average discounted reward model:

Different Types of Learning Tasks

• Agent’s actions: – Deterministic, or – Nondeterministic

• Agent may have or haven’t the ability of predicting the next state that will result from each action

• Trainer of the agent: – Expert (who shows it examples of optimal action

sequences), or – agent itself(train itself by performing actions of its own

choice.)

Q-Learning for Simple Deterministic Worlds

example

Q(s1, aright) r + Q (s2 , )

0 + 0.9 max{63,81,100}

RL as a function approximation method

• Learning the control policy () is very similar to the function approximation problem, except:

1. Delayed reward– In RL, The trainer provides only a sequence of immediate

reward values => Facing the problem of temporal credit assignment.

2. Exploration or Exploitation (next slide)– Exploration to collect new information, or Exploitation of

what it already learned to maximize the cumulative rewards.

– In RL, the agents influence the distribution of training examples by the action sequence it chooses.

Explore or Exploit?• In Q-learning, there is no mention about how to choose an

action among possible actions, some obtions:

– Random uniform selction

– High Q-value selection

– Selection based on the following probability:

– Small k => exploration, large k => exploitation,

– Common choice: small k at the beginning of the learning process, then gradually increasing k

RL Vs. other function approximation(continued)

3. Partially Observable States– In many practical situations, the sensors provide only partial information

(like the camera in front of a robot). – Solution: considering previous observations together with the current

sensor data

4. Life-long Learning– Unlike the function approximation task, in RL, robots need to learn many

task simultaneously plus online learning process forever.

RL Convergence• Proved in p 377-378, Mitchell.

• Three conditions of convergency:

– Deterministic Markov Decision Process (MDP)

– Immediate positive bounded rewards

– Agent selects every agent-action pairs infinitely often.

Markov Decision Process• Finite set of States : S; Set of Actions: A

– t: discrete time step; – st: the state at time t; – at: the action at time t;

• At each discrete time, agent observe states st S, and chooses action at A. • Then receive immediate reward: rt , And state change to: st+1

• Markov assumption: st+1= (st , at ), rt=r (st , at )– i.e., rt, and st+1 depend only on current state and action

• Functions and r may be nondeterministic • Functions and r not necessarily be known to agent

st+1 rt+1

st+2 rt+2

at+1 at+2

Other issues in RL (p. 381 - 386)• Reinforcement Learning for non-deterministic rewards

and actions

• Temporal Difference Learning

• Generalizing from examples

• Relationship to dynamic programming

• Continuous reinforcement learning (state-of-the-art)

Homework

• 13.3– Tik-Tak-Toe

reinforcement learning chapter 13 what is reinforcement learning? q-learning examples 1

learning process

example of reward function

optimal actions

possible actions

series of actions

reinforcement learninggoal

reinforcement learningchapter

rl system

Documents

from reinforcement learning to deep reinforcement...

cooperative inverse reinforcement learning...cooperative...

reinforcement learning (2) - uni-hamburg.de · university...

reinforcement learning - 4. model-free reinforcement...

inverse reinforcement learning -...

reinforcement learning (2) - uni-hamburg.de · university...

inverse reinforcement learning cs885 reinforcement

tutorial: deep reinforcement learning - machine learning...

deep learning for reinforcement learning in · pdf filedeep...

reinforcement learning -...

generalization in reinforcement learning: successful...

reinforcement learning - multi-agent reinforcement

deep learning for reinforcement learning in pacman · deep...

reinforcement learning das reinforcement learning-problem...

no slide...

reinforcement learning

reinforcement learning & apprenticeship learning

reinforcement learning: learning algorithms

bayesian reinforcement learning -...

reinforcement learning lecture inverse reinforcement...