[ieee 2009 international conference on advanced computer control - singapore, singapore...

Learning the Point Gathering Task Using Shared Value Function in Mobile Robots

Sriram Raghavan Network Systems Laboratory

Dept. of Computer Science and Engg. Indian Institute of Technology Madras

[email protected]

Raghavan S V Network Systems Laboratory

Dept. of Computer Science and Engg. Indian Institute of Technology Madras

[email protected]

Abstract— Managing the actions of several agents to perform tasks which require coordination and cooperation pose significant research challenges. One such challenge is to synchronize the agents’ view of the system to help them take the ‘right’ actions. In this paper, we propose an algorithm called MRCPG (Mobile Robot Coordination Point Gathering Algorithm) for coordinating the actions of a team of mobile robots. The aim is to gather these robots at a particular location in a 2-dimensional plane which is determined during execution. The robots are randomly deployed in the plane and they achieve the goal by communicating periodically. In addition, we impose a Reinforcement Learning framework and the robots learn a Shared Value Function (SVF) based on scalar rewards received. The SVF is used to select the best possible action in each state until at least one robot reaches the goal. Then a Reach-distance heuristic is used to direct the remaining robots to the goal. The algorithm was analyzed through simulations for up to 5 robots and the analysis indicates that communication helped robots perform significantly faster than when they acted independently - measured using the path-length of the first robot to reach the goal as the metric. We also observed that increasing team size enhances the effect of communication and hastens task completion..

Keywords- Multi Agent Robot Systems, Point Gathering Task, Communication, Coordination, Shared Value Function, Multi agent Reinforcement Learning

I. INTRODUCTION Multiagent point gathering task is a classic example for a

coordination task that elicits the usefulness of communication and its role in enhancing the learning rate of an intelligent multiagent system. This paper discusses robots in the context of multiple agents that are required to assemble at a point in a 2-D plane, not fixed a priori. These robots are autonomous, anonymous and capable of direct communication with others. Owing to their autonomous nature, the coordination mechanisms employed to perform a given task must be decentralized, i.e., no central control may be used. Gathering is one of the basic interaction primitives in systems of autonomous mobile robots, and has been studied in robotics and in artificial intelligence. The point gathering task has been formally defined in [7] as below:

Definition 1: Given n robots r1; r2 … rn, arbitrarily placed in the

plane, with no two robots at the same position, make them gather at one point in a finite number of cycles.

In most cases, the problem has been approached from an experimental point of view; algorithms were designed using heuristics [2, 4, 5, 6, 10, 11], and then tested either by means of computer simulations or with real robots. Limited research has been reported in providing theoretical proofs of correctness for such algorithms or in analyzing relationships between the tasks, the robot capabilities and robots' knowledge of the environment. In recent times, concerns on computability and complexity of coordination problems have motivated algorithmic investigations and the problems have also been approached from a computational point of view [2, 3, 5, 9].

The techniques employed to solve the problem must clearly be derived out of the capabilities of mobile robots intended for such a task. Pre-existing literature has identified a very weak model of autonomous robots [5, 8, 11, 12]; the robots are anonymous (i.e., identical), have a common coordinate system and environment is partially observable to each robot (i.e., each robot has only a local view of the environment). It has been shown in [9] that a simple robot model is not sufficient to solve the point gathering task, i.e. the problem is unsolvable, for such a simple robot model. It is hence essential that the robots be empowered with some additional capability to be able to solve the task in finite time. In this paper, a robot is permitted to communicate with the others in a periodic fashion.

Since the robots can communicate in a periodic fashion, they share information and experiences between themselves and this is incorporated by means of shared value function (SVF). While all robots share this function, when a decision is to be taken, a robot observes only that portion of this utility function which is pertinent to its state. This ensures that memory requirements are not a major concern as long as robots may be able to store and retrieve this information from some shared source.

This paper is organized as follows. In Section 2, we survey the forerunners in the field of coordination in mobile robots and more specifically, the gathering task. This section also highlights how our paper differs from existing literature. In Section 3, we describe the robot model under which the algorithm has been designed. In Section 4, we describe the MRCPG algorithm. In section 5, we describe the experiments performed and in Section 6, analyze the results obtained providing vital inferences. In Section 7, we conclude by

International Conference on Advanced Computer Control

978-0-7695-3516-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICACC.2009.49

9

International Conference on Advanced Computer Control

978-0-7695-3516-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICACC.2009.49

9

providing a summary of the work achieved and providing scope for future work.

II. FORERUNNERS Coordination Strategies required to coordinate a team of

mobile robots for a task involves overcoming several challenges and a lot of work has been reported. Gathering mobile robots at a point not fixed a priori in space often requires the robots to understand what is globally happening in order that it optimize its resources, both computation and physical. The original version of this problem was first studied in [1, 2] where the authors analyze the problem of gathering several small anonymous, autonomous, and asynchronous robots at a point on the 2-dimensional plane. There have also been several algorithms proposed [2, 4, 6, 10, 11] to solve different versions of the gathering problem. These problems also treat the robots as small anonymous, autonomous, and asynchronous but assume their ability in viewing the plane whenever they wake up to recognize the locations of the other robots and also detect multiplicity at a particular location. Typically, these algorithms use different heuristics in detecting multiplicity and then using a deterministic algorithm to direct their movement towards the point of multiplicity. While this problem has been solved for upto 5 robots in the plane for various start locations, a generalized algorithm is yet to be proposed as the number of number of robots and the plane size is scaled upwards.

The work in [7] assumes the robots as autonomous, anonymous and oblivious (i.e., robots have no memory of their actions in the past) but capable of observing the positional information of the other robots when required. The robots also have a duty cycle of sleep, wake, compute & move. Based on this input a new destination is computed and the robot moves to that location deterministically. Having arrived at the required destination the robot sleeps for an indefinite period. The process is repeated when the robot wakes up again until the goal is reached.

Balch and Arkin [15, 16] have studied formation and navigation problems in multi-robot teams. In particular in [17] the problem of specifying the behavior for the navigation of a mobile unit is analyzed, and results of both computer simulation and real experiments are reported. In [16], the approach is extended to multi-robot teams that navigate the environment maintaining particular formations: in particular the cases of a line, column, diamond and wedge are examined. In their study, the authors assumed that the path along which the group of robots has to move is known in advance to every unit.

A similar problem is studied in [18], where the author derives equations describing navigational strategies for robots moving in formation, and following the movement described by one (ore more) leader. In the studied framework, the robots have identities; hence their positions in the formation are fixed. Moreover, in order for the i-th robot to compute its position at time t, it has to know the position of either the (i-1)-th robot or the leader at time t. Hence, some degree of

synchrony has to be introduced in order to implement these strategies.

In this paper, we adopt a similar weak model for robots but assume that the robots execute their actions from start until reaching the goal. While there are several algorithms to solve the point gathering task, our algorithm emphasizes the importance of coordination through communication. Furthermore, this paper employs a distributed multiagent reinforcement learning framework to enable the robots to learn to pick the right actions while coordinating. The algorithm enforces quick completion of the task by tracking the path length to goal as the metric. The algorithm also does not permit multiplicity at any location, other than at the final location. In the context of the task, congregation at point other than the goal is considered sub-optimal as it restricts the robot movements and must be avoided.

III. MOBILE ROBOT MODEL A robot is a mobile computational unit provided with

sensors, and it is viewed as a point in the plane. The robots are anonymous, meaning that they are a priori indistinguishable by their appearance, and they do not have any unique identifiers that can be used during the computation. In this work, we assume that each robot shares the same origin and the coordinate system with the robot team (e.g. Cartesian). Once activated, the sensors of a robot can sense the robot’s current position within that common coordinate system (by means of GPS) and also estimate the future positions for each possible action in that location. This forms the current local view of the robot. Hence, the state s of a robot is described by its positional information and its actions are the directions in which the robot is capable of moving from that state. The robot then calculates its destination (i.e., computes the most favorable direction and hence new position) according to the value function Q(s, a). The robot then executes the action and moves to the new point; if the goal location is reached by the current movement, the robot terminates its actions and communicates this location to the other robots. For sake of simplicity, we assume that all robots are within communication range at all times. All actions performed by a robot take finite time, and specifically, in our model, each action requires one unit of simulation time to complete. All actions of a robot are deterministic, except in cases where an action could take the robot out of the grid; in such cases, no action is performed and the robot is stationary. Finally, the robots have direct communication capability; they communicate every action chosen and the corresponding scalar reward obtained.

IV. MRCPG ALGORITHM The MRCPG algorithm for multirobot point gathering

task is described in Figure 1. At a high level, the strategy of the algorithm is as follows. Initially all robots are in distinct locations; that is, in the initial configuration, there is strictly no point with multiplicity. Each robot picks its most valuable action and moves until one of them reaches the goal. On reaching it, the location is communicated to the other robots. The remaining robots then use a heuristic to arrive at the goal.

1010

MRCPG Algorithm

1. Initialize action preferences AP(s, a) and Value Function Q(s, a) to zero

Repeat (for each episode) 2. Spread robots in the region and initialize state s for each

robot i 3. For each robot i, do steps 4 and 5 4. While (robot_not_at_goal equals TRUE) do

a. Select direction ‘a’ with highest action preference from AP(s, a)

b. If all actions are equally-likely, select action ‘a’ randomly

c. Perform action ‘a’, obtain reward ‘r’ d. Communicate selected action ‘a’ and reward

obtained ‘r’ to other robots e. Perform Value function update for Q

i. � � � (r + �Q(s’, a’) – Q(s, a)) ii. Q(s, a) � Q(s, a) + �

iii. AP(s, a) � AP(s, a) + � f. On receiving message, perform similar update on

Q(s, a) for self g. s � s’ Until s is terminal Endwhile

5. Until (goal_reached equals TRUE) Endfor 6. If (goal_reached equals TRUE) communicate location

information to other robots. 7. If (goal_location_msg received equals TRUE)

a. Compute Reach-distance heuristic measure b. While (Reach-distance NOT EQUALS zero)

Move towards goal Until (goal_reached equals TRUE)

Endwhile

Figure 1: MRCPG Algorithm

A robot maintains a vector of action preferences over all permissible actions (north, east, west and south) from a given state and picked the action with highest preference value. In addition, the reinforcement framework employed an explore-exploit strategy with a threshold of 0.2. This means that 20% of the time, a robot picks a random action in an attempt to explore the state space. On picking the action, the robot performs the corresponding movement and is rewarded by the environment as an evaluation of its action. This information doublet (action selected, reward) is broadcasted to all robots. The reward is used to update the SVF associated with this action selection strategy. This value function maintains the usefulness of a given state. In our simulations, we assumed that each location in the plane is a state defined by the SVF. This ensured that the preferences are modified to reflect the values associated with the actions guide the robot towards the target location. Since all robots have information on the action selected and the corresponding reward received by a particular robot, this update is performed for all of the robots alike. The value function update for executing an action ‘a’ in state ‘s’ with next state ‘ s’ ’ and next state action ‘ a’ ’ having obtained reward ‘r’ is given by the following three equations:

� � �(r + �Q(s’, a’) – Q(s, a)) (1)

Q(s, a) � Q(s, a) + � (2)

AP(s, a) � AP(s, a) + � (3)

…where � is a scalar representing discounted reward in future states.

When at least one robot has reached the goal, it senses this information and terminates further movements from that location. This robot then communicates its position information to other robots in order that they direct their movements towards the goal. This is achieved by means of a heuristic called R-distance. This heuristic provides a qualitative and quantitative measure of the distance of each robot from the goal. The R-distance for a robot in location (x, y) in the Cartesian coordinate system is computed thus:

X-dist �difference (goal_x_pos, x) (4)

Y-dist � difference (goal_y_pos, y) (5)

…where goal_x_pos and goal_y_pos are the (X, Y) positional value of the goal in the same coordinate system

Having computed this, it is not difficult to note that the sign of the distance measure is an indication of the direction of the goal while the magnitude is an indicator of the minimum number of steps to reach the goal

V. SIMULATION SETUP The environment used in the simulated analysis of the

MRCPG algorithm is pictorially described in Figure 2. It consists of 12 × 12 grid with structure as shown. The robots were deployed at random points and the goal was to rendezvous at a point. The point is indicated by the letter G in figure. A reward of +10 was awarded to the robot on reaching the goal. Further, a step reward of -1 was given for each step when goal was not reached. The shaded region in the center represents marshy land which the robot likes to avoid. Every transition into a marshy cell, gives an additional negative reward depending on the depth of the marsh at that point, as indicated in the figure. The reinforcement learning framework maintained the explore-exploit threshold at 0.2 even after identifying the optimal action in a state.

The information about the exact location of the goal was available to the robot team only when at least one of them reached it. This coordination algorithm was compared in terms of steps to completion and was compared with algorithms that achieve the same without coordination, i.e. purely based on individual experience.

Each robot is deployed at random location on the grid and it has a simple view of the state i.e. a robot has knowledge only its current location and its immediate next states for all permissible actions from that state. For this experiment, the

1111

robots were allowed to move along the four cardinal directions within the described environment but no robot was allowed to get off the grid.

GG

- 4

- 3

- 2

G - Target Location

Figure 2: Simulation Environment for Point Gathering Task All robots were scattered in the grid and the SVF and the

action preferences were initialized to zero. One run of the algorithm refers to one time completions of the point gathering task, i.e., all robots reach the goal location once. This experiment was performed for 10000 runs and the SVF was carried forward from one run to next.

VI. RESULTS AND DISCUSSIONS In this section, we present the results of our studies

carried out for validating the MRCPG algorithm. The algorithm strictly forbids multiplicity at a location other than at the goal and is not directly comparable with other point gathering algorithms discussed in literature [2, 4, 6, 10, 11]. Besides, our algorithm lays emphasis on quick task completion as against other algorithms implementing similar models which include a duty cycle and emphasize mainly on completion. Figure 3 shows the plot for average steps to completion of the 2 robots point gathering task in the environment described in Section 5. Figure 4 presents the same graph in a magnified view to amplify the difference. For the 2-robot case, the mean number of steps to completion was 86 without information sharing and this dropped to 13 when it was shared. This difference can be attributed to the fact that communication and hence the SVF reflected the effects of a bad action in a state which helped the agent to almost always pick the goal-directed action.

0100200300400500600700800900

1 1519 3037 4555 6073 7591 9109

2 Agents No Comm2 Agents with Comm

Figure 3: Plot comparing average path length to completion

for 2 robots with & w/o communication

020406080

100120140160

1 2002 4003 6004 8005

2 Agents NoComm

2 Agents w ithComm

Figure 4: Magnified view of figure 3 showing improvement in

average path-length with communication

We note that it takes the robots several runs of the simulation setup to learn to pick the right actions. This can be attributed to the explorative nature of the reinforcement learning framework which forces robots to take sub-optimal actions from time to time. On the other hand, this strategy enables the robots to explore the grid quickly and learn to pick the goal directed action from each state. Figure 5 presents the results of the simulation for 5 robots on the same grid with and without communication.

0100200300400500600700800900

1 1519 3037 4555 6073 7591 9109

5 Agents No Comm5 Agents with Comm

Figure 5: Plot comparing average path length to completion

for 5 robots with & w/o communication Since there was a negative reward for every action not

reaching the goal, some goal-directed actions were discounted and in such cases, a general direction was selected. In the case where robots do not communicate, a robot had to personally experience each state in the grid several times before it learnt to pick the best action in that state. When communication was possible however, robots were able to take the right decision even in states they didn’t previously visit since experiences of other robots were shared through the SVF.

It is also interesting to note that increasing the team size enhanced the effect of communication on speed of task completion. We observed that while 2 robot teams completed in about 13 steps, the task was achieved in about 8 steps when 5 robots teams were deployed. Our study assumed that communication between the robots always took the same amount of time, which is not always the case, especially in unfriendly terrains. While it is interesting to understand and study where the system may saturate, the size of the grid restricted our ability to extend beyond this size.

VII. CONCLUSIONS AND FUTURE WORK In this paper, we proposed a distributed coordination

algorithm MRCPG for coordinating the actions of autonomous mobile robots in gathering them at an unknown

1212

point. The robots take decisions based on local environment information and communicate their state information periodically. The robots use a vector of action preferences to select an action at any state and communicate the selected action to the other robots. Each action selected is evaluated by a reinforcement learning environment that awards a scalar reward. This message is used to update the action preferences of all robots and hence speedup completion. The system model was presented and tested in a simulated rescue mission environment setting with up to 5 robots. We conclude that communication significantly speeds up task completion measured as path length to goal. In addition, we also conclude that increasing team size enhances the effect of communication on task completion.

It is interesting to study and understand the convergence guarantees on such systems when the communication becomes on-demand or otherwise, especially in unfriendly terrains. It is also interesting to study the wireless networking aspects of communication for coordination. Our work is currently focused on extending this idea. While this work assumes homogeneous robots in the point gathering task, typical scenarios have various other requirements and all of them cannot be satisfied with identical robots. Future research is also planned on extending this algorithm to heterogeneous robots.

REFERENCES [1] G. Prencipe. “The Effect of Synchronicity on the Behavior of

Autonomous Mobile Robots”, Theory of Computing Systems TOCS 2004

[2] P. Flocchini, G. Prencipe, N. Santoro and P. Widmayer. “Gathering of Asynchronous Robots with Limited Visibility”, Theoretical Computer Science, TCS 2004.

[3] V. Gervasi and G. Prencipe. “Coordination Without Communication: The Case of the Flocking Problem” Discrete Applied Mathematics, Vol. 143, 2003, pp. 203-223.

[4] P. Flocchini, G.Prencipe, N. Santoro, and P. Widmayer. “Pattern Formation by Autonomous Mobile Robots” InterJournal of Complex Systems, Article, 395.

[5] G. Prencipe. “Basic Algorithms for the MRMW PRAM Model”, Calcolo.Vol.34, 1997, pp. 135-144.

[6] G. Prencipe. “On The Feasibility of Gathering by Autonomous Mobile Robots”, In Proc. XI International Colloquium on Structural Information and Communication Complexity, SIROCCO 2005.

[7] M. Cieliebak, P. Flocchini, G. Prencipe and N. Santoro. “Solving the Robots Gathering Problem” In 30th International Colloquium on Automata, Languages and Programming (ICALP 2003), pp. 1181-1196. Eindhoven, The Netherlands, 2003.

[8] M. Cieliebak and G. Prencipe. “Gathering Autonomous Mobile Robots”, In Proc. VIII International Colloquium on Structural Information and Communication Complexity (SIROCCO 2002), pp. 57--72. Andros, Greece, 10-12, 2002

[9] G. Prencipe. “Instantaneous Actions vs. Full Asynchronicity: Controlling and Coordinating a Set of Autonomous Mobile Robots”, In VII Italian Conference on Theoretical Computer Science (ICTCS 2001), LNCS 2202, pp. 154-171. Torino, Italia, 2001.

[10] G. Prencipe. “CORDA: Distributed Coordination of a Set of Autonomous Mobile Robots”, In Proc. Fourth European Research Seminar on Advances in Distributed Systems (ERSADS 2001), pp. 185-190. Bertinoro, Italia, Maggio 2001.

[11] P. Flocchini, G.Prencipe, N. Santoro and P. Widmayer. “Distributed Coordination of a Set of Autonomous Mobile Robots”, In IEEE Intelligent Vehicle Symposium (IV 2000), pp. 480-485. Dearborn, USA, 3--5 Ottobre 2000.

[12] P. Flocchini, G.Prencipe, N. Santoro and P. Widmayer. “Hard Tasks for Weak Robots: The Role of Common Knowledge in Pattern Formation by Autonomous Mobile Robots”, In Proc. 10th Annual International Symposium on Algorithms and Computation (ISAAC 99), LNCS 1741, pp. 93--102. Chennai, India, 16-18 December 1999.

[13] H. Ando, Y. Oasa, I. Suzuki, and M. Yamashita. “A Distributed Memoryless Point Convergence Algorithm for Mobile Robots with Limited Visibility”, IEEE Transaction on Robotics and Automation, 15(5):818{828, 1999

[14] D. Jung, G. Cheng, and A. Zelinsky. “Experiments in Realizing Cooperation between Autonomous Mobile Robots”, In ISER, 1997.

[15] T. Balch and R. C. Arkin. “Behavior-based Formation Control for Multi-robot Teams”, IEEE Transactions on Robotics and Automation, 14(6), 1998.

[16] T. Balch, R. C. Arkin, “Behavior-based Formation Control for Multi-robot Teams”, IEEE Transaction on Robotics and Automation 14 (6) (1998) 926-939.

[17] R. C. Arkin, “Motor Schema-Based Mobile Robot Navigation”, International Journal of Robotics Research 8 (4) (1989) pp. 92-112

[18] P. K. C. Wang, “Navigation Strategies for Multiple Autonomous Mobile Robots Moving in Formation”, Journal of Robotic Systems 8 (2) (1991) pp. 177-195.

1313

[ieee 2009 international conference on advanced computer control - singapore, singapore...

Documents