![Page 1: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/1.jpg)
Cooperative Q-Learning
Lars Blackmore and Steve Block
Multi-Agent Reinforcement Learning: Independent vs. Cooperative AgentsTan, MProceedings of the 10th International Conference on Machine Learning, 1993
Expertness Based Cooperative Q-learningAhmadabadi, M.N.; Asadpour, MIEEE Transactions on Systems, Man and CyberneticsPart B, Volume 32, Issue 1, Feb. 2002, Pages 66–76
An Extension of Weighted Strategy Sharing in Cooperative Q-Learning for Specialized AgentsEshgh, S.M.; Ahmadabadi, M.N.Proceedings of the 9th International Conference on Neural Information Processing, 2002.Volume 1, Pages 106-110
![Page 2: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/2.jpg)
Overview
• Single agent reinforcement learning– Markov Decision Processes– Q-learning
• Cooperative Q-learning– Sharing state, sharing experiences and sharing policy
• Sharing policy through Q-values– Simple averaging
• Expertness based cooperative Q-learning– Expertness measures and weighting strategies– Experimental results
• Expertness with specialised agents– Scope of specialisation– Experimental results
![Page 3: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/3.jpg)
Markov Decision Processes
G• Goal: find optimal policy
π*(s) that maximises lifetime reward
G0 100
0
• Framework– States S– Actions A– Rewards R(s,a)– Probabilistic transition Function T(s,a,s’)
![Page 4: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/4.jpg)
Reinforcement Learning
• Want to find π* through experience – Reinforcement Learning– Intuitive approach; similar to human and
animal learning– Use some policy π for motion– Converge to the optimal policy π*
• An algorithm for reinforcement learning…
![Page 5: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/5.jpg)
Q-Learning• Define Q*(s,a):
– “Total reward if an agent in state s takes action a, then acts optimally at all subsequent time steps”
• Optimal policy: π*(s)=argmaxaQ*(s,a)
• Q(s,a) is an estimate of Q*(s,a)
• Q-learning motion policy: π(s)=argmaxaQ(s,a)
• Update Q recursively:)','(max),(
'asQrasQ
a 10
![Page 6: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/6.jpg)
Q-learning
• Update Q recursively:)','(max),(
'asQrasQ
a
s s’
a
![Page 7: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/7.jpg)
Q-learning
• Update Q recursively:)','(max),(
'i
aasQrasQ
i
50)','(max'
ia
asQi
Q(s’,a1’)=25
r=100
Q(s’,a2’)=50
50100),( asQ145
![Page 8: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/8.jpg)
Q-Learning• Define Q*(s,a):
– “Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps”
• Optimal policy: π*(s)=argmaxaQ*(s,a)
• Q(s,a) is an estimate of Q*(s,a)
• Q-learning motion policy: π(s)=argmaxaQ(s,a)
• Update Q recursively:
• Optimality theorem:– “If each (s,a) pair is updated an infinite number of
times, Q converges to Q* with probability 1”
)','(max),('
asQrasQa
![Page 9: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/9.jpg)
Cooperative Q-Learning
• An example situation:– Mobile robots
• Why cooperate?
• Learning framework– Individual learning for ti trials
– Each trial starts from a random state and ends when robot reaches goal
– Next, all robots switch to cooperative learning
![Page 10: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/10.jpg)
Cooperative Q-learning
• How should information be shared?• Three fundamentally different approaches:
– Expanding state space– Sharing experiences– Sharing policy through Q-values
• Methods for sharing additional state information and experiences are straightforward– These showed some improvement in testing
• Best method for sharing Q-values is not obvious– This area offers the greatest challenge and the
greatest potential for innovation
Multi-Agent Reinforcement Learning: Independent vs. Cooperative AgentsTan, MProceedings of the 10th International Conference on Machine Learning, 1993
![Page 11: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/11.jpg)
Sharing Q-values
• An obvious approach?– Simple Averaging
• This was shown to yield some improvement
• What are some of the problems?
n
jji asQ
nasQ
1
),(1
),(
![Page 12: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/12.jpg)
Problems with Simple Averaging
• All agents have the same Q table after sharing and hence the same policy:
– Different policies allow agents to explore the state space differently
• Convergence rate may be reduced
![Page 13: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/13.jpg)
Problems with Simple Averaging
• Convergence rate may be reduced– Without co-operation:
Trial # Q(s,a)
Agent 1 Agent 2
0 0 0
1 10 0
2 10 10
3 10 10
G
![Page 14: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/14.jpg)
Problems with Simple Averaging
• Convergence rate may be reduced– With simple averaging:
Trial # Q(s,a)
Agent 1 Agent 2
0 0 0
1 5 5
2 7.5 7.5
3 8.625 8.625
… … …
10 10
G
![Page 15: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/15.jpg)
Problems with Simple Averaging
• All agents have the same Q table after sharing and hence the same policy:
– Different policies allow agents to explore the state space differently
• Convergence rate may be reduced– Highly problem specific
• Slows adaptation in dynamic environment
• Overall performance is task specific
![Page 16: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/16.jpg)
Expertness
• Idea: value more highly the knowledge of agents who are ‘experts’– Expertness based cooperative Q-learning
• New Q-sharing equation:
• Agent i assigns an importance weight Wij to the Q data held by agent j
• These weights are based on the agents’ relative expertness values ei and ej
n
jjiji QWQ
1
Expertness Based Cooperative Q-learningAhmadabadi, M.N.; Asadpour, MIEEE Transactions on Systems, Man and CyberneticsPart B, Volume 32, Issue 1, Feb. 2002, Pages 66–76
![Page 17: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/17.jpg)
Expertness Measures
• Need to define expertness of agent i– Based on the reinforcement signals agent i has
received
• Various definitions:– Algebraic Sum– Absolute Value– Positive– Negative
t
iABSi tre )(
t
iNRMi tre )(
t
iPi re
t
iNi re
![Page 18: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/18.jpg)
Weighting Strategies
• How do we come up with weights based on the expertnesses?
• Alternative strategies:– ‘Learn from all’:
– ‘Learn from experts’:
n
kk
jij
e
eW
1
otherwise
eeee
eeji
W ijn
kik
iji
i
ij
0
1
1
![Page 19: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/19.jpg)
Experimental Setup
• Mobile robots in hunter-prey scenario• Individual learning phase:
1. All robots carry out same number of trials2. Robots carry out different number of trials
• Followed by cooperative learning• Parameters to investigate:
– Cooperative learning vs individual– Similar vs different initial expertise levels– Different expertness measures– Different weight assigning mechanisms
• Performance measured by number of steps
![Page 20: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/20.jpg)
ResultsEqual Experience
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
SA Nrm Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
In
div
idu
al L
earn
ing
Learning from All
Learning from Experts
Different Experience
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
SA Nrm Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
In
div
idu
al L
earn
ing
Learning from All
Learning from Experts
![Page 21: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/21.jpg)
ResultsEqual Experience
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
SA Nrm Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
In
div
idu
al L
earn
ing
Learning from All
Learning from Experts
Different Experience
-55
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
10
15
SA Nrm Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
In
div
idu
al L
earn
ing
Learning from All
Learning from Experts
![Page 22: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/22.jpg)
ResultsDifferent Experience
0
2
4
6
8
10
12
14
16
Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
Ind
ivid
ual
Lea
rnin
g
Learning from All
Learning from Experts
![Page 23: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/23.jpg)
ResultsDifferent Experience (Random Prey)
0
2
4
6
8
10
12
14
16
Abs P N
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
Ind
ivid
ual
Lea
rnin
g
Learning from Experts
Different Experience (Intelligent Prey)
0
2
4
6
8
10
12
14
16
Abs P N
Perc
enta
ge Im
prov
emen
t Rel
ativ
e to
Indi
vidu
al L
earn
ing
Learning from Experts
![Page 24: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/24.jpg)
Conclusions
• Without expertness measures, cooperation is detrimental– Simple averaging shows decrease in performance
• Expertness based cooperative learning is shown to be superior to individual learning– Only true when agents have significantly different
expertness values (necessary but not sufficient)
• Expertness measures Abs, P and N show best performance– Of these three, Abs provides the best compromise
• ‘Learning from Experts’ weighting strategy shown to be superior to ‘Learning from All’
![Page 25: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/25.jpg)
What about this situation?• Both agents have accumulated the same rewards and
punishments• Which is the most expert?
R
RR
R
R
R
![Page 26: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/26.jpg)
Specialised Agents
• An agent may have explored one area a lot but another area very little– The agent is an expert in one area but not in
another
• Idea – Specialised agents– Agents can be experts in certain areas of the
world– Learnt policy more valuable if an agent is
more expert in that particular areaAn Extension of Weighted Strategy Sharing in Cooperative Q-Learning for Specialized AgentsEshgh, S.M.; Ahmadabadi, M.N.Proceedings of the 9th International Conference on Neural Information Processing, 2002.Volume 1, Pages 106-110
![Page 27: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/27.jpg)
Specialised Agents
• Scope of specialisation– Global– Local– State
R
RR
R
R
R
![Page 28: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/28.jpg)
Specialised Agents
• New Q-sharing equation:
• Agent i assigns an importance weight Wijk
to the Q data held by agent j, valid for a region k
n
jjiji QWQ
1
![Page 29: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/29.jpg)
Experimental Setup
• Mobile robots in a grid world• World is approximately segmented into three
regions by obstacles– One goal per region
• Individual learning followed by cooperative learning as before
R
RR
R
R
R
• Performance measured by number of steps to reach a goal.
![Page 30: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/30.jpg)
Results
-20
-10
0
10
20
30
40
50
60
70
80
Global Local State
Per
cen
tag
e Im
pro
vem
ent
Rel
ativ
e to
In
div
idu
al L
earn
ing Abs
P
N
![Page 31: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/31.jpg)
Overall Conclusions
• Expertness based cooperative learning without specialised agents can improve performance but can also be detrimental
• Cooperative learning with specialised agents greatly improved performance
• Correct choice of expertness measure is crucial– Test case highlights robustness of Abs to problem-
specific nature of reinforcement signals
![Page 32: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/32.jpg)
Future Directions
• Communication errors– Very sensitive to corrupted information– Can we distinguish bad advice?
• What are the costs associated with cooperative Q-learning?– Computation– Communication
• How can the scope of specialisation be selected?– Automatic scoping– Dynamic scoping
• Dynamic environments– Discounted expertness
![Page 33: Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the](https://reader036.vdocuments.net/reader036/viewer/2022062423/56649f4e5503460f94c6ecca/html5/thumbnails/33.jpg)
Any questions …