1 multi-agent reinforcement learning in a dynamic environment the research goal is to enable...
TRANSCRIPT
1
Multi-agent Reinforcement Learningin a Dynamic Environment
The research goal is to enable multiple agents to learn suitable behaviors in a dynamic environment using reinforcement learning.
We found that this approach could be available to create cooperative behavior among the agents without any prior-knowledge.
Footnote : Work done by Sachiyo Arai, Katia Sycara
2
Reinforcement Learning Approach
Overview:
TD [Sutton 88], Q-learning [Watkins 92]• Agent can estimate a model of state transition probabilities of
E(Environment), if E has a fixed state transition probability (; E is a MDPs) .
Profit sharing [Grefensttette 88]• Agent can estimate a model of state transition probabilities of E, even
though E does not have a fixed state transition probability.
c.f. Dynamic programming • Agent needs to have a perfect model of state transition probabilities of E.
Feature: The Reward won’t be given immediately after agent’s action. Usually, it will be given only after achieving the goal. This delayed reward is the only clue to agent’s learning.
State Recognizer
Action Selector
LookUp Table
W (S, a )
Learner
Agent Input
Action
En
vironm
ent
Reward
En
vironm
ent
E
3
Our Approach : Profit Sharing Plan (PSP)
wn+1(xt, at ) wn(xt, at ) + f (rT,t)
Usually, Multi-agent’s Environment is non-Markovian. Because : transition probability from St to St+1 could vary.Due to : agents’ “concurrent learning” and “perceptual aliasing”.
PSP is Robust against non-Markovian,Because : PSP does not require the environment to have a fixed transition probability from St to St+1.
f : Reinforcement function for a temporal credit assignment.
[Rationality Theorem] to suppress ineffective rules
t∀t=1,2,….T L∑fj < ft+1
j=0(L : the number of available actions at each time step.)
Example:
GxTxtx1 xt+1
Episode
rT
rT
ft t+11
0 0 0Real Reward of each time step
Assigned Reward of each time step
time0 T
rT : reward at time T (Goal)
Wn : weight of the state-action pair after n episodes,
(xt, at) : state and action at time t of n-th episode.
a2
a1
S G100
Episode: (s,a1) - (s,a1) - (s,a1) - (s,a2) - (G) r1 r2 r3 r4
Irrational assignment : (r1+r2+r3) >= r4e.g.: r1=r2=r3=r4=100 -> W(S,a1)>W(S,a2)
Rational assignment : (r1+r2+r3 ) < r4e.g : r1= 12, r2=25, r3=50,r4=100 -> W(S,a1)<W(S,a2)
In this environment, a1 should be reinforced less than a2
4
Our ExperimentsHunter Prey
Goal StateInitial State
1. Pursuit Game 4Hunters and 1Prey Torus Grid World,
Required Agents’ Cooperative work to capture the prey.
2. Pursuit Game 3Hunters and multiple Preys Torus Triangular World,
Required Agents’ Cooperative work includes Task Scheduling
to capture the preys.
1
2
1
2
2
Initial State First Goal State Second Goal State
3. Neo “Block World” domain 3 groups of evacuees and 3 shelters of varying degree of safety Grid World,
Required Agents’ Cooperative work includes Conflict Resolution and Information Sharing to evacuate.
5
Experiment 1 : 4Hunters and 1 Prey Pursuit Game
Objective: To make sure that cooperative behavior is emerged by Profit Sharing.
Hypothesis: Cooperative Behavior, such as Result sharing, Task sharing and Conflict resolution will be emerged.
Setting : Torus Grid World, Size: 15x15, Sight size of each agent: 5x5.- Each hunter modifies its own lookup table by PSP independently.- Hunters and Prey are located randomly at initial state of each episode.- Hunters learn by PSP and Prey move randomly.
Modeling : Each hunter consists of State Recognizer, Action Selector, LookUp Table, and PSP module as a learner.
Input
StateRecognizer
ActionSelector
LookUpTable
W (S, a )
Profit Sharing
Agent = Hunter
Action
Reward
Agent
Agent
Agent
4 Hunters 1 Prey
6
2 3
41
1
2
3
p
123
4
5
6
7
8
9 10 11,12
1314
15-1718-25
1-5
67
8,9
10,11
1213
14151617181920,212
223
2425
456-89,10
11,12
13
1415
12,13
1617
18
19
20
2223
24
25
3
2
1
6,7
5
4
8,9
10,11
14
15
16
17
18
19
2021
22
23
24,25
1. Emerged Behavior 2. Convergence
Experiment 1 : Results
: Trace of Hunter1: Trace of Hunter2: Trace of Hunter3: Trace of Hunter4
: Trace of Prey
4 Hunteres and 1 Prey Learning Curves
0
100
200
300
400
500
600
700
800
900
1000
0 20000 40000 60000 80000 100000
Number of Episodes
Req
uire
d S
teps
to
Cap
ture
the
Pre
y
QL:Envionmental size=10x10
Environmental size 15x15
Environmental size 10x10
Environmental size 7x7
Environmental size 5x5
3. Discussion
1. A hunter takes advantage of other hunters as a landmark to capture the prey. (Result sharing)
2. Each hunter plays its own role for capturing the prey. (Task sharing)
3. No deadlock or conflict situation will happen if each hunter follows its own strategy, after learning. (Conflict resolution)
7
Experiment 2 : 3 Hunters and Multiple Prey Pursuit Game
Objective: To make sure that “Task Scheduling” knowledge is emerged by PSP in the environment of conjunctive multiple goals.
Which Proverb is true in the Reinforcement learning agents ? proverb 1. He who runs after two hares will catch neither.
proverb 2. Kill two birds with one stone.
Hypothesis: If the agent know about location of “prey and other agents”, agent realize proverb 2, but sensory limitation makes them behave like proverb 1.
Setting: Torus Triangular World where 7 triangles are on each edge. - Sight size ; 5 triangles on each edge, 7 triangles on each edge. - Prey moves randomly. - Each hunter modifies its own lookup table by PSP independently.
Modeling : Each hunter consists of State Recognizer, Action Selector, LookUp Table, and PSP module as a learner.
StateRecognizer
ActionSelector
LookUpTable
W (S, a )
Profit Sharing
Agent = Hunter
Action
Reward
Input 3 Hunters 2 Prey
Agent
12
Agent
8
Comparison "Required steps to Capture the 1st Prey "
0
10
20
30
40
50
60
70
80
90
100
0 200000 400000 600000 800000 1000000
Number of Episodes
Re
qu
ire
d S
tep
s t
o C
ap
ture
th
e 1
st
Pre
y
Steps to Capture the 1st Prey in H3P1 Env.Steps to Capture the 1st Prey in H3P2 Env.Steps to Capture the 1st Prey in H3P3 Env.
1. ConvergenceExperiment 2 : Results
Comparison "Required Steps to Capture One Prey"
0
10
20
30
40
50
60
70
80
90
100
0 200000 400000 600000 800000 1000000Number of Episodes
Re
qu
ire
d S
tep
s to
Ca
ptu
re O
ne
Pre
y Steps to Capture the 1st Prey in H3P3 Env.Additional Steps to Capture the 2nd Prey in H3P3 Env.Additional Steps to Capture the 3rd Prey in H3P3 Env.c.f. Steps to Capture the 1st Prey in H3P1 Env.
2. Discussion
1. Without global scheduling mechanism, hunters capture the prey in reasonable order. (e.g. capture closest prey first.)
2. The larger the number of prey in the environment, the more steps are required to capture the 1st prey . Because it is getting more difficult to coordinate decision of each hunter’s target.This facts implies that target of each hunter is scattered. (Proverb 1)
3. The required steps to capture the “last prey” in the multiple prey environment is less than that to capture the “1st prey” in the single prey environment. This facts implies that hunters pursuit multiple prey simultaneously.(Proverb 2)
9
Agent
Neo Domain No.1
Agent
Experiment 3 : Neo “Block World” domain -No.1-
Objective: To make sure that “Opportunistic” knowledge is emerged by PSP in the environment of disjunctive multiple goals. When there are more than 1 alternatives to get rewards in the environment, agent can behave reasonably ?
Hypothesis: If the agent knows about location of “safe places” correctly, each agent can select the best place to evacuate, but sensory limitation makes them back and forth in confusion.
Setting : Graph World ; Size 15 x 15. Sight size ; 7 x 7.
- 2 groups of evacuees, 2 shelter groups.- Each group of evacuees learns by PSP independently. - The groups and Shelters are located randomly at initial state of each episode.
Input of Group: 7x7 sized agent’s own input; no input sharing.
Output of Group: {walk-north, walk-south, walk-east, walk-west, stay}
Reward : - Each group gets a reward only when it moves into the shelter. - The amount of reward is dependent on the degree of shelter’s safety.
- Shelter has unlimited capacity.
Modeling : Each hunter consists of State Recognizer, Action Selector, LookUp Table, and PSP module as a learner.
10
Neo Domain Preliminaly Experiment
0
10
20
30
40
50
60
70
80
90
100
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
Number of Episodes
Req
uire
d St
eps
to S
helte
r
degree of safty is very different
degree of safty is the Same
Required Number of Steps to move to the Shelter Environments: Amount of Reward Shelter1 : Shelter2
After 5,000 Episodes
After 10,000 Episodes
After 50,000 Episodes
1: 1 15.3 (16.8) 10.3(10.9) 8.5(7.5) 1: 0.01 9.5(6.6) 8.7( 4.7) 7.3(3.1)
1. ConvergenceExperiment 3 : Results
2. Discussion
1. Agents learned so that they could get larger amount of reward. So, if the reward amount of shelter1’s is same as the one of shelter2’s, they learned stochastic policies.On the other hand, if their amount difference is large, they learned the deterministic policies which seems to be nearly optimal .
2. In the latter case (reward difference is large), the other agent works as a landmark to search the shelter.
Available pathUnavailable path
Safe node A group of evacuees
11
Experiment 4 : Neo “Block World” domain
Objective: To make sure of the effects of “Sharing Sensory Information” on the agents’ learning and their behaviors.
Hypothesis: Sharing their sensory input increases the amount of state spaces,
and the required time to converge. But the policy of the agents become more optimal than that of agents without
sharing information, because it reduces perceptual aliasing problem of the agents.
Setting : Graph World ; Size 15 x 15. Sight size ; 7 x 7.
- 3 groups of evacuees, 3 shelters.- Each group of evacuees learns by PSP independently. - The groups are located randomly at initial state of each episode.
Input of Group: 7x7 sized agent’s own input, plus, information from Blackboard.
Output of Group: {walk-north, walk-south, walk-east, walk-west, stay} Reward : - Each group gets a reward only when it moves into the shelter. - The degree of safety is the same for each shelter. - The rewards are not shared among the agents.
- Shelter has unlimited capacity.
12
Experiment 4 : Neo “Block World” domain -No.2-
Modeling : Model1 Each hunter consists of State Recognizer, Action Selector, LookUp
Table, PSP module as a learner. Agents share the sensory input by means of B.B. , combine them with their own input.
State Recognizer
Action Selector
LookUp Table
Wnml(O, a )
Size m*l
Profit Sharing f (Rn, Oj) (j=1,..,T)
Agent Observation: Ot
Ot={O1, O2,…,Om}
(t=1,..,T)
Action: at
At={a1, a2,…,al}
(t=1,..,T)
RewardRn(t=T)n
BlackBoardBlackBoard Other Agents
En
viron
men
t
13
1. Convergence
Experiment 4 : Results
2. Discussion
1. In the Initial Stage: The required steps to shelter of a Non-sharing-Agent reduces faster than that of a Sharing-Agent. Non-sharing-Agent seems to be able to generalize the state and behave rationally even in inexperienced state. On the other hand, Sharing-Agent needs to experience discriminated state spaces, the numbers of which is larger than generalized state space. Therefore, it takes longer time to reduce the number of steps than Non-sharing agent does.
2. In the Final Stage: The performance of a Sharing-Agent is better than that of a Non-Sharing-Agent. Non-sharing-Agent seems to overgeneralize the spaces and to be confused by aliases. On the other hand, Sharing-Agent seems to refine the policy successfully and hard to be confused.
Comparison : Sharing vs. Without Sharing-Initial stage of the learning-
0
50
100
150
200
250
300
350
400
0 100 200 300 400 500 600 700 800 900 1000
Number of Episodes
Re
qu
ire
d S
tep
s o
f a
gro
up
to
Sh
elt
er
With Sharing
Without Sharing
Comparison : Sharing vs. Without Sharing-Final stage of the learning-
0
10
20
30
40
50
60
70
80
90
100
99000 99200 99400 99600 99800 100000
Number of Episodes
Re
qu
ire
d S
tep
s o
f a
ag
en
t to
Sh
alt
er
With Sharing
Without Sharing
14
Future Works
Development of the structured mechanism of Reinforcement Learning . Hypothesis : Structured mechanism facilitates knowledge transfer.
Agent learns knowledge about appropriate generalization level of the state spaces.
Agent learns knowledge about appropriate amount of communication with others.
Competitive Learning Agents compete for resources.
We need to resolve structural credit assignment problem.
Conclusion Agent learns suitable behaviors in a dynamic environment including multiple
agents and goals, if there are no aliasing due to the sensory limitation, concurrent learning of other agents, and the existence of multiple sources of reward.
The strict division of the state space causes the state explosion and the worse performance in the early stage of learning.