An Automated Measure of MDP Similarity for Transfer in Reinforcement Learning
Haitham Bou Ammar Eric Eaton
Gerhard Weiss Kurt Driessens Karl Tuyls
Decebal Mocanu Matthew Taylor
Introduc)on
Reinforcement learning (RL) is a key technique for learning through interac8on with the environment
Problem Defini)on:
RL problems are formalized as Markov Decision Processes (MDPs):
: Ac8on Space
: Discount Factor
: State Space : Transi8on Probability
: Reward Func8on
2 Bou Ammar, Eaton, et al.
Goal
Learn op8mal policy by maximizing
Q(s, a) = E
" 1X
t=0
�tRt
#
hS,A,P,R, �i
Reinforcement learners are slow to learn
Problem
Reuse knowledge from other sources
Possible Solu)on Impressive Results
3 Bou Ammar, Eaton, et al.
Mo)va)on
• Learning from Demonstra8on
• Transfer Learning
Transfer Learning
New target task Pool of source tasks from same domain
… …
…
Ques)ons to answer: 1. How to transfer? 2. What to transfer?
3. When to transfer?
lots of approaches lots of approaches
Needs a task similarity measure
4 Bou Ammar, Eaton, et al.
Less progress has been achieved
RBDist: Similarity Measure Between MDPs
5 Bou Ammar, Eaton, et al.
RBDist: Similarity Measure Between MDPs
RBM Energy Func)on
Probability distribu)on Weights are trained using contras)ve
divergence
6 Bou Ammar, Eaton, et al.
Our measure is based on Restricted Boltzmann Machines (RBMs):
• Set of visible units • Set of hidden units
. . .
Visible Layer
Hidden Layer
. . .
Sampled trajectories capturing source dynamics
7 Bou Ammar, Eaton, et al.
RBDist: Similarity Measure Between MDPs
Step 1: Train an RBM to approximate the source task’s dynamics
. . .
Visible Layer
Hidden Layer
. . .
The RBM learns a genera8ve model that captures the source dynamics. Key Idea: If the dynamics of a source and target domain are similar, the
RBM trained on the source task should be able to reconstruct trajectories from the target task.
Separate into i.i.d. tuples and train RBM
hs, a, s0i
Trajectories from target task
8 Bou Ammar, Eaton, et al.
RBDist: Similarity Measure Between MDPs
Step 2: Reconstruct target task trajectories by sampling the trained RBM
. . . Visible Layer
Hidden Layer
. . .
Reconstruc)on of target trajectories based on source
task’s dynamics
Sampling
Step 3: Measure reconstruc)on error of sampled target trajectories as RBDist
RBDist =1
n
nX
k=1
ek ek = L2⇣D
s(k)2 , a(k)2 , s0(k)2
E
0,Ds(k)2 , a(k)2 , s0
(k)2
E
1
⌘
original tuple reconstructed tuple
Experiments & Results
9 Bou Ammar, Eaton, et al.
Dynamical Systems & Benchmarks
Inverted Pendulum Cart Pole Mountain Car
Swing and balance pole upright by applying torques
Balance pole upright by applying linear forces
Control car to reach goal by oscilla8ng around the valley
10 Bou Ammar, Eaton, et al.
Inverted Pendulum
Mountain Car
11 Bou Ammar, Eaton, et al.
Results: Dynamical Phases
RBDist can automa)cally cluster dynamical phases
Cart Pole
20 40 60 80 100 120 140−700
−600
−500
−400
−300
−200
−100
0
100
200
Different Cartpoles
Re
wa
rd
Jump Start Inverted Pendulum
Inverted Pendulum
Mountain Car
Cart Pole
Results: Transfer Performance
12 Bou Ammar, Eaton, et al.
RBDist correlates with transfer performance
13 Bou Ammar, Eaton, et al.
Thank you!
This work was supported in part by ONR N00014-‐11-‐1-‐0139,
AFOSR FA8750-‐14-‐1-‐0069, and NSF IIS-‐1149917.
Please send correspondence to:
Haitham Bou Ammar Eric Eaton [email protected] [email protected]