advice taking and transfer learning: naturally-inspired extensions to reinforcement learning lisa...
TRANSCRIPT
Advice Taking and Transfer Learning:
Naturally-Inspired Extensionsto Reinforcement Learning
Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik
University of Wisconsin - Madison
University of Minnesota - Duluth*
Reinforcement LearningReinforcement Learning
Environment
Agent
action rewardstate
May be
delayed
Q-LearningQ-Learning
Update Q-function incrementallyUpdate Q-function incrementally Follow current Q-function to choose actionsFollow current Q-function to choose actions Converges to accurate Q-functionConverges to accurate Q-function
Q-function
state
actionvalue
policy(state) =
argmaxaction
LimitationsLimitations
Agents begin without any informationAgents begin without any information
Random exploration required in early Random exploration required in early stages of learningstages of learning
Long training times can resultLong training times can result
Naturally-Inspired Naturally-Inspired ExtensionsExtensions
Advice TakingAdvice Taking
Transfer LearningTransfer Learning
RL AgentHumanTeacher
Knowledge
Target-taskAgent
Knowledge
Source-taskAgent
Potential BenefitsPotential Benefits
perf
orm
ance
training
with knowledgewithout knowledge
higher start
higher slope higher asymptote
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
The RoboCup DomainThe RoboCup Domain
KeepAway
BreakAway
MoveDownfield
+1 per time step
+1 per meter
+1 upon goal
The RoboCup DomainThe RoboCup Domain
distBetween(a0, Player)distBetween(a0, GoalPart)distBetween(Attacker, goalCenter)distBetween(Attacker, ClosestDefender)distBetween(Attacker, goalie)angleDefinedBy(topRight, goalCenter, a0)angleDefinedBy(GoalPart, a0, goalie)angleDefinedBy(Attacker, a0, ClosestDefender)angleDefinedBy(Attacker, a0, goalie)timeLeft
state
actions
move(ahead) shoot(GoalPart) pass(Teammate)move(away)move(right)move(left)
Q-LearningQ-Learning
Q-function
state
actionvalue
policy(state) =
argmaxaction
State Action
Q
1 1 0.5
1 2 -0.5
1 3 0
2…
1…
0.3…
Function approximation
Approximating the Q-functionApproximating the Q-function
Feature vectorWeight vector ●
Linear support-vector Linear support-vector regression:regression:
Q-value =
Set weights to minimizeSet weights to minimize:
ModelSize + C × DataMisfit
distBetween(a0, a1)distBetween(a0, a2)
distBetween(a0, goalie)…
0.2-0.10.9…
T
●
RL in 3-on-2 BreakAwayRL in 3-on-2 BreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000
Training Games
Pro
bab
ilit
y o
f G
oal
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
Extension #1: Advice TakingExtension #1: Advice Taking
IF an opponent is near
AND a teammate is open
THEN pass is the best action
Advice in RLAdvice in RL
Advice sets constraints on Q-values Advice sets constraints on Q-values under specified conditionsunder specified conditions
IF an opponent is near meAND a teammate is openTHEN pass has a high Q-value
Apply as Apply as softsoft constraints in constraints in optimizationoptimization
ModelSize + C × DataMisfit + μ × AdviceMisfit
Advice PerformanceAdvice Performance
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
Extension #2: TransferExtension #2: Transfer
3-on-2 BreakAway
3-on-2 KeepAway
3-on-2 MoveDownfield
Relational TransferRelational Transfer
First-order logic describes relationships First-order logic describes relationships between objectsbetween objects
distBetween(a0, Teammate) > 10
distBetween(Teammate, goalCenter) < 15
We want to transfer relational We want to transfer relational knowledgeknowledge Human-level reasoningHuman-level reasoning General representationGeneral representation
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
Skill TransferSkill Transfer Learn advice about good actions from the source Learn advice about good actions from the source
tasktaskgood_action(pass(Teammate)):-
distBetween(a0, Teammate) > 10,
distBetween(Teammate, goalCenter) <15.
Example 1:Example 1:distBetween(a0, a1) = 15distBetween(a0, a1) = 15distBetween(a0, a2) = 5distBetween(a0, a2) = 5distBetween(a0, goalie) = 20distBetween(a0, goalie) = 20......action = pass(a1)action = pass(a1)outcome = caught(a1)outcome = caught(a1)
Select positive and negative examples of good actions Select positive and negative examples of good actions and apply inductive logic programming to learn rulesand apply inductive logic programming to learn rules
User Advice in Skill TransferUser Advice in Skill Transfer
There may be new skills in the target There may be new skills in the target that cannot be learned from the sourcethat cannot be learned from the source E.g., shooting in BreakAwayE.g., shooting in BreakAway
We allow users to add their own advice We allow users to add their own advice about these new skillsabout these new skills
User advice simply adds to transfer User advice simply adds to transfer advice advice
Skill Transfer to 3-on-2 Skill Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000
Training Games
Pro
bab
ilit
y o
f G
oal
Standard RL
Skill Transfer from 2-on-1 BreakAway
Skill Transfer from 3-on-2 MoveDownfield
Skill Transfer from 3-on-2 KeepAway
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
Macro TransferMacro Transfer
Find an action sequence that separates Find an action sequence that separates good games from bad gamesgood games from bad games
Learn first-order rules to control transitions Learn first-order rules to control transitions along the sequencealong the sequence
move(ahead) pass(Teammate) shoot(GoalPart)
Learn a strategy from the source taskLearn a strategy from the source task
Transfer via DemonstrationTransfer via Demonstration
Games played in target task
0 100
…
Execute macro strategy
Perform standard RL
Agent learns an initial Q-function
Agent adapts to the target task
Macro Transfer to 3-on-2 Macro Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000
Training Games
Pro
bab
ilit
y o
f G
oal
Standard RL
Skill Transfer from 2-on-1 BreakAway
Macro Transfer from 2-on-1 BreakAway
OutlineOutline
RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning
Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer
MLN TransferMLN Transfer Learn a Markov Logic Network to Learn a Markov Logic Network to
represent the source-task policy represent the source-task policy relationallyrelationally
Apply the policy via demonstration in the Apply the policy via demonstration in the target tasktarget task
MLNQ-function
state
actionvalue
Markov Logic NetworksMarkov Logic Networks A Markov network models a joint distributionA Markov network models a joint distribution
A Markov Logic Network combines probability with A Markov Logic Network combines probability with logic logic Template: a set of first-order formulas with weightsTemplate: a set of first-order formulas with weights Each grounded predicate in a formula becomes a nodeEach grounded predicate in a formula becomes a node Predicates in grounded formula are connected by arcsPredicates in grounded formula are connected by arcs
Probability of a world: (1/Z) exp( Probability of a world: (1/Z) exp( ΣΣ W WiiNNi i ))
X Y Z
A B
MLN Q-functionMLN Q-function
IF distance(me, Teammate) < 15
AND angle(me, goalie, Teammate) > 45
THEN Q є (0.8, 1.0)
IF distance(me, GoalPart) < 10
AND angle(me, goalie, GoalPart) > 45
THEN Q є (0.8, 1.0)
Formula 1
W1 = 0.75
N1 = 1 teammate
Formula 2
W2 = 1.33
N2 = 3 goal parts
Probability that Q є (0.8, 1.0): __exp(W1N1 + W2N2)__
1 + exp(W1N1 + W2N2)
Using an MLN Q-functionUsing an MLN Q-function
Q є (0.8, 1.0) P1 = 0.75
Q є (0.5, 0.8) P2 = 0.15
Q є (0, 0.5) P2 = 0.10
Q = P1 ● E [Q | bin1]
+ P2 ● E [Q | bin2]
+ P3 ● E [Q | bin3]
Q-value of most similar
training example in bin
MLN Transfer to 3-on-2 MLN Transfer to 3-on-2 BreakAwayBreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000Training Games
Pro
bab
ilit
y o
f G
oal
MLN Transfer
Macro Transfer
Value-function Transfer
Standard RL
ConclusionsConclusions
Advice and transfer can provide RL agents with Advice and transfer can provide RL agents with knowledge that improves early performanceknowledge that improves early performance
Relational knowledge is desirable because it is Relational knowledge is desirable because it is general and involves human-level reasoninggeneral and involves human-level reasoning
More detailed knowledge produces larger initial More detailed knowledge produces larger initial benefits, but is less widely transferrablebenefits, but is less widely transferrable
AcknowledgementsAcknowledgements
DARPA grant HR0011-04-1-0007DARPA grant HR0011-04-1-0007 DARPA grant HR0011-07-C-0060DARPA grant HR0011-07-C-0060 DARPA grant FA8650-06-C-7606DARPA grant FA8650-06-C-7606 NRL grant N00173-06-1-G002NRL grant N00173-06-1-G002
Thank You