ece 517: reinforcement learning in artificial...
TRANSCRIPT
1
ECE 517: Reinforcement Learning in Artificial Intelligence
Lecture 17: Case Studies and Gradient Policy
Dr. Itamar Arel
College of EngineeringDepartment of Electrical Engineering and Computer Science
The University of TennesseeFall 2015
October 29, 2015
ECE 517: Reinforcement Learning in AI 2
Introduction
We’ll discuss several case studies of reinforcement learning
Illustrate some of the trade-offs and issues that arise in real-world applications
For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem
We also highlight the representation issues that are so often critical to successful applications
Applications of RL are still far from routine and typically require as much art as science
Making applications easier and more straightforward is one of the goals in current RL research
ECE 517: Reinforcement Learning in AI 3
TD-Gammon (Tesauro’s 1992, 1994, 1995, …)
One of the most impressive early applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon
TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of
the TD(l) algorithm and nonlinear function approximation
FA using a FFNN trained by backpropagating TD errors
There are probably more professional backgammon players than there are professional chess players
BG is in part a game of chance, which can be viewed as a large MDP
ECE 517: Reinforcement Learning in AI 4
TD-Gammon (cont.)
The game is played with 15 white and 15 black pieces on a board of 24 locations, called points
Here’s a typical position early in the game, seen from the perspective of the white player
ECE 517: Reinforcement Learning in AI 5
TD-Gammon (cont.)
White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps
The objective is to advance all pieces to points 19-24, and then off the board
Hitting – removal of single piece
30 pieces, 24 locations implies enormous number of configurations (state set is ~1020)
Effective branching factor of 400, considering that each dice roll has ~20 possibilities
ECE 517: Reinforcement Learning in AI 6
TD-Gammon - details
Although the game is highly stochastic, a complete description of the game's state is available at all times
The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won,
when it is 1
Episodic (game = episode), undiscounted
Non-linear form of TD(l) using a FF neural network Backpropagation of TD error
4 input units for each point; unary encoding of number of white pieces, plus other features
Use of “after-states”
Learning during self-play – fully incrementally
ECE 517: Reinforcement Learning in AI 7
TD-Gammon – Neural Network Employed
ECE 517: Reinforcement Learning in AI 8
Summary of TD-Gammon Results
Two players played against each other
Each had no prior knowledge of the game
Only the rules of the game were imposed
Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players
ECE 517: Reinforcement Learning in AI 9
Rebuttal on TD-Gammon
For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS (1997)
Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success”
No need for dealing with exploration/exploitation
Any such approach would work with backgammon
No sensitivity of state to value - success does not extend to other problems
e.g. Tetris, maze-type problems – exploration issue comes up
ECE 517: Reinforcement Learning in AI 10
The Acrobot
Robotic application of RL
Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to
the hands on the bar) cannotexert torque
The second joint (correspondingto the gymnast bending at thewaist) can
This system has been widelystudied by control engineersand machine learning researchers
ECE 517: Reinforcement Learning in AI 11
The Acrobot (cont.)
One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time
In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque
A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used
Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps)
Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context
ECE 517: Reinforcement Learning in AI 12
Acrobot Learning Curves for Sarsa(l)
ECE 517: Reinforcement Learning in AI 13
Typical Acrobot Learned Behavior
ECE 517: Reinforcement Learning in AI 14
RL in Robotics
Robot motor capabilities were investigated using RL
Walking, grabbing and delivering MIT Media Lab
Robocup competitions – soccer games
Sony AIBOs are commonlyemployed Maze-type problems
Balancing themselveson unstable platform
Multi-dimensional inputstreams
ECE 517: Reinforcement Learning in AI 15
Policy Gradient Methods
Assume that our policy, p, has a set of n real-valued parameters, q = {q 1, q2, q3, ... , qn} Running the policy with a particular q results in a
reward, rq Estimate the reward gradient, , for each qi
iθ
r
i
iiθ
θθ
r
This is another
learning rate
ECE 517: Reinforcement Learning in AI 16
Policy Gradient Methods (cont.)
This results in hill-climbing in policy space So, it’s subject to all the problems of hill-climbing
But, we can also use tricks from search theory, like random starting points and momentum terms
This is a good approach if you have a parameterized policy Let’s assume we have a “reasonable” starting policy
Typically faster than value-based methods
“Safe” exploration, if you have a good policy
Learns locally-best parameters for that policy
ECE 517: Reinforcement Learning in AI 17
An Example: Learning to Walk
RoboCup 4-legged league Walking quickly is a big advantage Historically - tuned manually
Robots have a parameterized gait controller 12 parameters Controls step length, height, etc.
Robot walk across soccer field and is timed Reward is a function of the time taken They know when to stop (distance measure)
ECE 517: Reinforcement Learning in AI 18
An Example: Learning to Walk (cont.)
Basic idea1. Pick an initial q = {q1, q2, ... , q12}
2. Generate N testing parameter settings by
perturbing qqj = {q1 + d1, q2 + d2, ... , q12 + d12}, di {-e, 0, e}
3. Test each setting, and observe rewardsqj→ rj
4. For each qi q
Calculate qi+, qi
0, qi- and set
5. Set q← q’, and go to 2
Average reward
when qni = qi - di
largest θ if
largest θ if
largest θ if
θθ'
i
i
i
ii
d
d0
0
ECE 517: Reinforcement Learning in AI 19
An Example: Learning to Walk (cont.)
Initial
Final
ECE 517: Reinforcement Learning in AI 20
Value Function or Policy Gradient?
When should I use policy gradient? When there’s a parameterized policy
When there’s a high-dimensional state space
When we expect the gradient to be smooth
Typically on episodic tasks (e.g. AIBO walking)
When should I use a value-based method? When there is no parameterized policy
When we have no idea how to solve the problem (i.e. no known structure)
ECE 517: Reinforcement Learning in AI 21
Summary
RL is a powerful tool which can support a wide range of applications
There is an art to defining the observations, states, rewards and actions
Main goal: formulate “as simple as possible” representation
Policy Gradient methods directly search in policy space