advice taking and transfer learning: naturally-inspired extensions to reinforcement learning lisa...

Advice Taking and Transfer Learning:

Naturally-Inspired Extensionsto Reinforcement Learning

Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik

University of Wisconsin - Madison

University of Minnesota - Duluth*

Reinforcement LearningReinforcement Learning

Environment

Agent

action rewardstate

May be

delayed

Q-LearningQ-Learning

Update Q-function incrementallyUpdate Q-function incrementally Follow current Q-function to choose actionsFollow current Q-function to choose actions Converges to accurate Q-functionConverges to accurate Q-function

Q-function

state

actionvalue

policy(state) =

argmaxaction

LimitationsLimitations

Agents begin without any informationAgents begin without any information

Random exploration required in early Random exploration required in early stages of learningstages of learning

Long training times can resultLong training times can result

Naturally-Inspired Naturally-Inspired ExtensionsExtensions

Advice TakingAdvice Taking

Transfer LearningTransfer Learning

RL AgentHumanTeacher

Knowledge

Target-taskAgent

Knowledge

Source-taskAgent

Potential BenefitsPotential Benefits

perf

orm

ance

training

with knowledgewithout knowledge

higher start

higher slope higher asymptote

OutlineOutline

RL in a complex domainRL in a complex domain Extension #1: Advice TakingExtension #1: Advice Taking Extension #2: Transfer LearningExtension #2: Transfer Learning

Skill TransferSkill Transfer Macro TransferMacro Transfer MLN TransferMLN Transfer

The RoboCup DomainThe RoboCup Domain

KeepAway

BreakAway

MoveDownfield

+1 per time step

+1 per meter

+1 upon goal

The RoboCup DomainThe RoboCup Domain

distBetween(a0, Player)distBetween(a0, GoalPart)distBetween(Attacker, goalCenter)distBetween(Attacker, ClosestDefender)distBetween(Attacker, goalie)angleDefinedBy(topRight, goalCenter, a0)angleDefinedBy(GoalPart, a0, goalie)angleDefinedBy(Attacker, a0, ClosestDefender)angleDefinedBy(Attacker, a0, goalie)timeLeft

state

actions

move(ahead) shoot(GoalPart) pass(Teammate)move(away)move(right)move(left)

Q-LearningQ-Learning

Q-function

state

actionvalue

policy(state) =

argmaxaction

State Action

Q

1 1 0.5

1 2 -0.5

1 3 0

2…

1…

0.3…

Function approximation

Approximating the Q-functionApproximating the Q-function

Feature vectorWeight vector ●

Linear support-vector Linear support-vector regression:regression:

Q-value =

Set weights to minimizeSet weights to minimize:

ModelSize + C × DataMisfit

distBetween(a0, a1)distBetween(a0, a2)

distBetween(a0, goalie)…

0.2-0.10.9…

T

●

RL in 3-on-2 BreakAwayRL in 3-on-2 BreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000

Training Games

Pro

bab

ilit

y o

f G

oal

OutlineOutline



Extension #1: Advice TakingExtension #1: Advice Taking

IF an opponent is near

AND a teammate is open

THEN pass is the best action

Advice in RLAdvice in RL

Advice sets constraints on Q-values Advice sets constraints on Q-values under specified conditionsunder specified conditions

IF an opponent is near meAND a teammate is openTHEN pass has a high Q-value

Apply as Apply as softsoft constraints in constraints in optimizationoptimization

ModelSize + C × DataMisfit + μ × AdviceMisfit

Advice PerformanceAdvice Performance

OutlineOutline



Extension #2: TransferExtension #2: Transfer

3-on-2 BreakAway

3-on-2 KeepAway

3-on-2 MoveDownfield

Relational TransferRelational Transfer

First-order logic describes relationships First-order logic describes relationships between objectsbetween objects

distBetween(a0, Teammate) > 10

distBetween(Teammate, goalCenter) < 15

We want to transfer relational We want to transfer relational knowledgeknowledge Human-level reasoningHuman-level reasoning General representationGeneral representation

OutlineOutline



Skill TransferSkill Transfer Learn advice about good actions from the source Learn advice about good actions from the source

tasktaskgood_action(pass(Teammate)):-

distBetween(a0, Teammate) > 10,

distBetween(Teammate, goalCenter) <15.

Example 1:Example 1:distBetween(a0, a1) = 15distBetween(a0, a1) = 15distBetween(a0, a2) = 5distBetween(a0, a2) = 5distBetween(a0, goalie) = 20distBetween(a0, goalie) = 20......action = pass(a1)action = pass(a1)outcome = caught(a1)outcome = caught(a1)

Select positive and negative examples of good actions Select positive and negative examples of good actions and apply inductive logic programming to learn rulesand apply inductive logic programming to learn rules

User Advice in Skill TransferUser Advice in Skill Transfer

There may be new skills in the target There may be new skills in the target that cannot be learned from the sourcethat cannot be learned from the source E.g., shooting in BreakAwayE.g., shooting in BreakAway

We allow users to add their own advice We allow users to add their own advice about these new skillsabout these new skills

User advice simply adds to transfer User advice simply adds to transfer advice advice

Skill Transfer to 3-on-2 Skill Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000

Training Games

Pro

bab

ilit

y o

f G

oal

Standard RL

Skill Transfer from 2-on-1 BreakAway

Skill Transfer from 3-on-2 MoveDownfield

Skill Transfer from 3-on-2 KeepAway

OutlineOutline



Macro TransferMacro Transfer

Find an action sequence that separates Find an action sequence that separates good games from bad gamesgood games from bad games

Learn first-order rules to control transitions Learn first-order rules to control transitions along the sequencealong the sequence

move(ahead) pass(Teammate) shoot(GoalPart)

Learn a strategy from the source taskLearn a strategy from the source task

Transfer via DemonstrationTransfer via Demonstration

Games played in target task

0 100

…

Execute macro strategy

Perform standard RL

Agent learns an initial Q-function

Agent adapts to the target task

Macro Transfer to 3-on-2 Macro Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000

Training Games

Pro

bab

ilit

y o

f G

oal

Standard RL

Skill Transfer from 2-on-1 BreakAway

Macro Transfer from 2-on-1 BreakAway

OutlineOutline



MLN TransferMLN Transfer Learn a Markov Logic Network to Learn a Markov Logic Network to

represent the source-task policy represent the source-task policy relationallyrelationally

Apply the policy via demonstration in the Apply the policy via demonstration in the target tasktarget task

MLNQ-function

state

actionvalue

Markov Logic NetworksMarkov Logic Networks A Markov network models a joint distributionA Markov network models a joint distribution

A Markov Logic Network combines probability with A Markov Logic Network combines probability with logic logic Template: a set of first-order formulas with weightsTemplate: a set of first-order formulas with weights Each grounded predicate in a formula becomes a nodeEach grounded predicate in a formula becomes a node Predicates in grounded formula are connected by arcsPredicates in grounded formula are connected by arcs

Probability of a world: (1/Z) exp( Probability of a world: (1/Z) exp( ΣΣ W WiiNNi i ))

X Y Z

A B

MLN Q-functionMLN Q-function

IF distance(me, Teammate) < 15

AND angle(me, goalie, Teammate) > 45

THEN Q є (0.8, 1.0)

IF distance(me, GoalPart) < 10

AND angle(me, goalie, GoalPart) > 45

THEN Q є (0.8, 1.0)

Formula 1

W1 = 0.75

N1 = 1 teammate

Formula 2

W2 = 1.33

N2 = 3 goal parts

Probability that Q є (0.8, 1.0): __exp(W1N1 + W2N2)__

1 + exp(W1N1 + W2N2)

Using an MLN Q-functionUsing an MLN Q-function

Q є (0.8, 1.0) P1 = 0.75

Q є (0.5, 0.8) P2 = 0.15

Q є (0, 0.5) P2 = 0.10

Q = P1 ● E [Q | bin1]

+ P2 ● E [Q | bin2]

+ P3 ● E [Q | bin3]

Q-value of most similar

training example in bin

MLN Transfer to 3-on-2 MLN Transfer to 3-on-2 BreakAwayBreakAway

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000 2500 3000Training Games

Pro

bab

ilit

y o

f G

oal

MLN Transfer

Macro Transfer

Value-function Transfer

Standard RL

ConclusionsConclusions

Advice and transfer can provide RL agents with Advice and transfer can provide RL agents with knowledge that improves early performanceknowledge that improves early performance

Relational knowledge is desirable because it is Relational knowledge is desirable because it is general and involves human-level reasoninggeneral and involves human-level reasoning

More detailed knowledge produces larger initial More detailed knowledge produces larger initial benefits, but is less widely transferrablebenefits, but is less widely transferrable

AcknowledgementsAcknowledgements

DARPA grant HR0011-04-1-0007DARPA grant HR0011-04-1-0007 DARPA grant HR0011-07-C-0060DARPA grant HR0011-07-C-0060 DARPA grant FA8650-06-C-7606DARPA grant FA8650-06-C-7606 NRL grant N00173-06-1-G002NRL grant N00173-06-1-G002

Thank You

advice taking and transfer learning: naturally-inspired extensions to reinforcement learning lisa...

Documents