scoutbot: a dialogue system for collaborative navigation

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 93–98Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

93

ScoutBot: A Dialogue System for Collaborative Navigation

Stephanie M. Lukin1, Felix Gervits2∗ , Cory J. Hayes1, Anton Leuski3, Pooja Moolchandani1,John G. Rogers III1, Carlos Sanchez Amaro1, Matthew Marge1, Clare R. Voss1, David Traum3

1 U.S. Army Research Laboratory, Adelphi, MD 207832 Tufts University, Medford MA 02155

3 USC Institute for Creative Technologies, Playa Vista, CA [email protected], [email protected], [email protected]

Abstract

ScoutBot is a dialogue interface to phys-ical and simulated robots that supportscollaborative exploration of environments.The demonstration will allow users to is-sue unconstrained spoken language com-mands to ScoutBot. ScoutBot will promptfor clarification if the user’s instructionneeds additional input. It is trainedon human-robot dialogue collected fromWizard-of-Oz experiments, where robotresponses were initiated by a human wiz-ard in previous interactions. The demon-stration will show a simulated groundrobot (Clearpath Jackal) in a simulated en-vironment supported by ROS (Robot Op-erating System).

1 Introduction

We are engaged in a long-term project to createan intelligent, autonomous robot that can collab-orate with people in remote locations to explorethe robot’s surroundings. The robot’s capabili-ties will enable it to effectively use language andother modalities in a natural manner for dialoguewith a human teammate. This demo highlights thecurrent stage of the project: a data-driven, auto-mated system, ScoutBot, that can control a simu-lated robot with verbally issued, natural languageinstructions within a simulated environment, andcan communicate in a manner similar to the in-teractions observed in prior Wizard-of-Oz experi-ments. We used a Clearpath Jackal robot (shownin Figure 1a), a small, four-wheeled unmannedground vehicle with an onboard CPU and iner-tial measurement unit, equipped with a camera andlight detection and ranging (LIDAR) mapping that

∗Contributions were primarily performed during an in-ternship at the Institute for Creative Technologies.

readily allows for automation. The robot’s task isto navigate through an urban environment (e.g.,rooms in an apartment or an alleyway betweenapartments), and communicate discoveries to a hu-man collaborator (the Commander). The Com-mander verbally provides instructions and guid-ance for the robot to navigate the space.

The reasons for an intelligent robot collabora-tor, rather than one teleoperated by a human, aretwofold. First, the human resources needed forcompletely controlling every aspect of robot mo-tion (including low-level path-planning and nav-igation) may not be available. Natural languageallows for high-level tasking, specifying a desiredplan or end-point, such that the robot can figureout the details of how to execute these natural lan-guage commands in the given context and reportback to the human as appropriate, requiring lesstime and cognitive load on humans. Second, theinteraction should be robust to low-bandwidth andunreliable communication situations (common indisaster relief and search-and-rescue scenarios),where it may be impossible or impractical to ex-ercise real-time control or see full video streams.Natural language interaction coupled with low-bandwidth, multimodal updates addresses bothof these issues and provides less need for high-bandwidth, reliable communication and full atten-tion of a human controller.

We have planned the research for developingScoutBot as consisting of five conceptual phases,each culminating in an experiment to validate theapproach and collect evaluation data to informthe development of the subsequent phase. Thesephases are:

1. Typed wizard interaction in real-world environment2. Structured wizard interaction in real environment3. Structured wizard interaction in simulated environment4. Automated interaction in simulated environment5. Automated interaction in real environment

94

(a) Real-World Jackal (b) Simulated Jackal

Figure 1: Jackal Robot

The first two phases are described in Margeet al. (2016) and Bonial et al. (2017), respec-tively. Both consist of Wizard-of-Oz settings inwhich participants believe that they are interact-ing with an autonomous robot, a common tool inHuman-Robot Interaction for supporting not-yetfully realized algorithms (Riek, 2012). In our two-wizard design, one wizard (the Dialogue Man-ager or DM-Wizard) handles the communicationaspects, while another (the Robot Navigator orRN-Wizard) handles robot navigation. The DM-Wizard acts as an interpreter between the Com-mander and robot, passing simple and full instruc-tions to the RN-Wizard for execution based on theCommander instruction (e.g., the Commander in-struction, Now go and make a right um 45 degreesis passed to the RN-Wizard for execution as, Turnright 45 degrees). In turn, the DM-Wizard informsthe Commander of instructions the RN-Wizardsuccessfully executes or of problems that ariseduring execution. Unclear instructions from theCommander are clarified through dialogue withthe DM-Wizard (e.g., How far should I turn?).Additional discussion between Commander andDM-Wizard is allowed at any time. Note that be-cause of the aforementioned bandwidth and relia-bility issues, it is not feasible for the robot to startturning or moving and wait for the Commander totell it when to stop - this may cause the robot tomove too far, which could be dangerous in somecircumstances and confusing in others. In the firstphase, the DM-Wizard uses unconstrained textingto send messages to both the Commander and RN-Wizard. In the second phase, the DM-Wizarduses a click-button interface that facilitates fastermessaging. The set of DM-Wizard messages inthis phase were constrained based on the messagesfrom the first phase.

This demo introduces ScoutBot automation ofthe robot to be used in the upcoming phases: asimulated robot and simulated environment to sup-port the third and fourth project phases, and initialautomation of DM and RN roles, to be used in the

fourth and fifth phases. Simulation and automa-tion were based on analyses from data collected inthe first two phases. Together, the simulated envi-ronment and robot allow us to test the automatedsystem in a safe environment, where people, thephysical robot, and the real-world environment arenot at risk due to communication or navigation er-rors.

2 System Capabilities

ScoutBot engages in collaborative exploration di-alogues with a human Commander, and controls arobot to execute instructions and navigate throughand observe an environment. The real-worldJackal robot measures 20in x 17in x 10in, andweights about 37lbs (pictured in Figure 1a). Bothit and its simulated counterpart (as seen in Fig-ure 1b) can drive around the environment, but can-not manipulate objects or otherwise interact withthe environment. While navigating, the robot usesLIDAR to construct a map of its surroundings aswell as to indicate its position in the map. TheCommander is shown this information, as well asstatic photographs of the robot’s frontal view in theenvironment (per request) and dialogue responsesin a text interface.

Interactions with ScoutBot are collaborative inthat ScoutBot and the Commander exchange in-formation through dialogue. Commanders issuespoken navigation commands to the robot, requestphotos, or ask questions about its knowledge andperception. ScoutBot responds to indicate whencommands are actionable, and gives status feed-back on its processing, action, or inability to act.When ScoutBot accepts a command, it is carriedout in the world.

Figure 2 shows a dialogue between a Comman-der and ScoutBot. The right two columns showhow the Commander’s language was interpreted,and the action the robot takes. In this dialogue, theCommander begins by issuing a single instruction,Move forward (utterance #1), but the end-point isunderspecified, so ScoutBot responds by request-ing additional information (#2). The Commandersupplies the information in #3, and the utteranceis understood as an instruction to move forward 3feet. ScoutBot provides feedback to the Comman-der as the action is begins (#4) and upon comple-tion (#5). Another successful action is executed in#6-8. Finally, the Commander’s request to knowwhat the robot sees in #9 is interpreted as a re-

95

Move forward

You can tell me to move a certain distance or to move

to an object.

1.

2.

3. move forward 3 feet Go forward 3 feet

Moving… 4. (starts moving)

Done. 5. (stops moving)

6. turn right 45 degreesPivot right about 45 degrees

Turning… 7. (starts turning)

Done. 8. (stops turning)

9. send image (takes and sends image)What do you see?

Commander ScoutBot Interpreted Action Robot Action

Figure 2: Dialogue between a Commander and ScoutBot

quest for a picture from the robot’s camera, whichis taken and sent to the Commander.

ScoutBot accepts unconstrained spoken lan-guage, and uses a statistical text classifier trainedon annotated data from the first two phases ofthe project for interpreting Commander instruc-tions. Dialogue management policies are used togenerate messages to both the Commander andinterpreted actions for automated robot naviga-tion. Our initial dialogue management policies arefairly simple, yet are able to handle a wide rangeof phenomena seen in our domain.

3 System Overview

ScoutBot consists of several software systems,running on multiple machines and operating sys-tems, using two distributed agent platforms. Thearchitecture utilized in simulation is shown in Fig-ure 3. A parallel architecture exists for a real-world robot and environment. Components pri-marily involving language processing use partsof the Virtual Human (VH) architecture (Hartholtet al., 2013), and communicate using the VirtualHuman Messaging (VHMSG), a thin layer on topof the publicly available ActiveMQ message pro-tocol1. Components primarily involving the realor simulated world, robot locomotion, and sens-ing are embedded in ROS2. To allow the VHMSGand ROS modules to interact with each other, wecreated ros2vhmsg, software that bridges the twomessaging architectures. The components are de-scribed in the remainder of this section.

The system includes several distinct worksta-

1http://activemq.apache.org2http://www.ros.org

VHMSG

ROSGazeboSim

Push-to-Talk Microphone

Google Chrome Client

NPCEditor

Google ASR

Automated Robot

Navigator

Jackal Simulator

Environment Simulator

Commander Display

Map & PosePhoto

Utterances

ros2vhmsg Bridge

Figure 3: ScoutBot architecture interfacing witha simulated robot and environment: Solid arrowsrepresent communications over a local network;dashed arrows indicate connections with externalresources. Messaging for the spoken language in-terface is handled via VHMSG, while robot mes-sages are facilitated by ROS. A messaging bridge,ros2vhmsg, connects the components.

tion displays for human participants. The Com-mander display is the view of the collaborativepartner for the robot (lower left corner of Fig-ure 3). This display shows a map of the robot’s lo-cal surroundings, the most recent photo the robothas sent, and a chat log showing text utterancesfrom the robot. The map is augmented as newareas are explored, and updated according to therobot’s position and orientation. There are alsodisplays for experimenters to monitor (and in thecase of Wizards-of-Oz, engage in) the interaction.These displays show real-time video updates ofwhat the robot can see, as well as internal com-munication and navigation aids.

http://activemq.apache.org

http://www.ros.org

96

3.1 VHMSG Components

VHMSG includes several protocols that imple-ment parts of the Virtual Human architecture. Weuse the protocols for speech recognition, as well ascomponent monitoring and logging. The NPCEd-itor and other components that use these proto-cols are available as part of the ICT Virtual Hu-man Toolkit (Hartholt et al., 2013). These proto-cols have also been used in systems, as reported byHill et al. (2003) and Traum et al. (2012, 2015).In particular, we used the adaptation of Google’sAutomated Speech Recognition (ASR) API usedin Traum et al. (2015). The NPCEditor (Leuskiand Traum, 2011) was used for Natural LanguageUnderstanding (NLU) and dialogue management.The new ros2vhmsg component for bridging themessages was used to send instructions from theNPCEditor to the automated RN.

3.1.1 NPCEditor

We implemented NLU using the statistical textclassifier included in the NPCEditor. The classi-fier learns a mapping from inputs to outputs fromtraining data using cross-language retrieval mod-els (Leuski and Traum, 2011). The dialoguescollected from our first two experimental phasesserved as training data, and consisted of 1,500pairs of Commander (user) utterances and theDM-Wizard’s responses. While this approach lim-its responses to the set that were seen in the train-ing data, in practice it works well in our domain,achieving accuracy on a held out test set of 86%.The system is particularly effective at translat-ing actionable commands to the RN for execu-tion (e.g., Take a picture). It is robust at handlingcommands that include pauses, fillers, and otherdisfluent features (e.g., Uh move um 10 feet). Itcan also handle simple metric-based motion com-mands (e.g., Turn right 45 degrees, Move forward10 feet) as well as action sequences (e.g., Turn 180degrees and take a picture every 45). The systemcan interpret some landmark-based instructions re-quiring knowledge of the environment (e.g., Go tothe orange cone), but the automated RN compo-nent does not yet have the capability to generatethe low-level, landmark-based instructions for therobot.

Although the NPCEditor supports simultane-ous participation in multiple conversations, exten-sions were needed for ScoutBot to support multi-floor interaction (Traum et al., 2018), in which

two conversations are linked together. Rather thanjust responding to input in a single conversation,the DM-Wizard in our first project phases oftentranslates input from one conversational floor toanother (e.g., from the Commander to the RN-Wizard, or visa versa), or responds to input withmessages to both the Commander and the RN.These responses need to be consistent (e.g. trans-lating a command to the RN should be accom-panied by positive feedback to the Commander,while a clarification to the commander should notinclude an RN action command). Using the di-alogue relation annotations described in Traumet al. (2018), we trained a hybrid classifier, includ-ing translations to RN if they existed, and negativefeedback to the Commander where they did not.We also created a new dialogue manager policythat would accompany RN commands with appro-priate positive feedback to the commander, e.g.,response of “Moving..” vs “Turning...” as seen in#4 and #7 in Figure 2.

3.1.2 ros2vhmsg

ros2vhmsg is a macOS application to bridge theVHMSG and ROS components of ScoutBot. Itcan be run from a command line or using a graph-ical user interface that simplifies the configura-tion of the application. Both VHMSG and ROSare publisher-subscriber architectures with a cen-tral broker software. Each component connects tothe broker. They publish messages by deliveringthem to the broker and receive messages by tellingthe broker which messages they want to receive.When the broker receives a relevant message, it isdelivered to the subscribers.

ros2vhmsg registers itself as a client for bothVHMSG and ROS brokers, translating and reissu-ing the messages. For example, when ros2vhmsgreceives a message from the ROS broker, it con-verts the message into a compatible VHMSG for-mat and publishes it with the VHMSG broker.Within ros2vhmsg, the message names to translatealong with the corresponding ROS message typesmust be specified. Currently, the application sup-ports translation for messages carrying either textor robot navigation commands. The application isflexible and easily extendable with additional mes-sage conversion rules. ros2vhmsg annotates itsmessages with a special metadata symbol and usesthat information to avoid processing messages thatit already published.

97

Figure 4: Real-world and simulated instances ofthe same environment.

3.2 ROS Components

ROS provides backend support for robots to oper-ate in both real-world and simulated environments.Here, we describe our simulated testbed and auto-mated navigation components. Developing auto-mated components in simulation allows for a safetest space before software is transitioned to thereal-world on a physical robot.

3.2.1 Environment SimulatorRunning operations under ROS makes the tran-sition between real-world and simulated testingstraightforward. Gazebo3 is a software packagecompatible with ROS for rendering high-fidelityvirtual simulations, and supports communicationsin the same manner as one would in real-world en-vironments. Simulated environments were mod-eled in Gazebo after their real-world counterpartsas shown in Figure 4. Replication took into con-sideration the general dimensions of the physicalspace, and the location and size of objects thatpopulate the space. Matching the realism and fi-delity of the real-world environment in simulationcomes with a rendering trade-off: objects requir-ing a higher polygon count due to their natural ge-ometries results in slower rendering in Gazebo anda lag when the robot moves in the simulation. As apartial solution, objects were constructed startingfrom their basic geometric representations, whichcould be optimized accordingly, e.g., the illusionof depth was added with detailed textures or shad-ing on flat-surfaced objects. Environments ren-dered in Gazebo undergo a complex workflow tosupport the aforementioned requirements. Objectsand their vertices are placed in an environment,and properties about collisions and gravity are en-coded in the simulated environment.

3.2.2 Jackal SimulatorClearpath provides a ROS package with a simu-lated model of the Jackal robot (Figure 1b) and

3http://gazebosim.org/

customizable features to create different simu-lated configurations. We configured our simulatedJackal to have the default inertial measurementunit, a generic camera, and a Hokuyo LIDAR laserscanner.

As the simulated Jackal navigates through theGazebo environment, data from the sensors is re-layed to the workstation views through rviz4, a vi-sualization tool included with ROS that reads anddisplays sensor data in real-time. Developers canselect the data to display by adding a panel, se-lecting the data type, and then selecting the spe-cific ROS data stream. Both physical and simu-lated robot sensors are supported by the same rvizconfiguration, since rviz only processes the datasent from these sensors; this means that an rvizconfiguration created for data from a simulatedrobot will also work for a physical robot if the datastream types remain the same.

3.2.3 Automated Robot Navigator

Automated robot navigation is implemented with apython script and the ROSPY package5. The scriptruns on a ROS-enabled machine running the sim-ulation. The script subscribes to messages fromNPCEditor, which are routed through ros2vhmsg.These messages contain text output from the NLUclassifier that issue instructions to the robot basedon the user’s unconstrained speech.

A mapping of pre-defined instructions was cre-ated, with keys matching the strings passed fromNPCEditor via ros2vhmsg (e.g., Turn left 90 de-grees). For movement, the map values are a ROSTWIST message that specifies the linear motionand angular rotation payload for navigation. TheseTWIST messages are the only way to manipulatethe robot; measures in units of feet and degreescannot be directly translated. The mapping isstraightforward to define for metric instructions.Included is basic coverage of moving forward orbackward between 1 and 10 feet, and turning rightand left 45, 90, 180, or 360 degrees. Pictures fromthe robot’s simulated camera can be requested bysending a ROS IMAGE message. The current map-ping does not yet support collision avoidance orlandmark-based instructions that require knowl-edge of the surrounding environment (e.g., Go tothe nearby chair).

4http://wiki.ros.org/rviz5http://wiki.ros.org/rospy

http://gazebosim.org/

http://wiki.ros.org/rviz

http://wiki.ros.org/rospy

98

4 Demo Summary

In the demo, visitors will see the simulated robotdynamically move in the simulated environment,guided by natural language interaction. Visitorswill be able to speak instructions to the robotto move in the environment. Commands can begiven in metric terms (e.g., #3 and #6 in Figure 2),and images requested (e.g., #9). Undecipherableor incomplete instructions will result in clarifica-tion subdialogues rather than robot motion (e.g.,#1). A variety of command formulations can beaccommodated by the NLU classifier based onthe training data from our experiments. Visual-izations of different components can be seen in:http://people.ict.usc.edu/˜traum/Movies/scoutbot-acl2018demo.wmv.

5 Summary and Future Work

ScoutBot serves as a research platform to supportexperimentation. ScoutBot components will beused in our upcoming third through fifth develop-ment phases. We are currently piloting phase 3using ScoutBot’s simulated environment, with hu-man wizards. Meanwhile, we are extending theautomated dialogue and navigation capabilities.

This navigation task holds potential for collab-oration policies to be studied, such as the amountand type of feedback given, how to negotiate tosuccessful outcomes when an initial request wasunderspecified or impossible to carry out, and theimpact of miscommunication. More sophisticatedNLU methodologies can be tested, including thosethat recognize specific slots and values or more de-tailed semantics of spatial language descriptions.The use of context, particularly references to rel-ative landmarks, can be tested by either usingthe assumed context as part of the input to theNLU classifier, transforming the input before clas-sifying, or deferring the resolution and requiringthe action interpreter to handle situated context-dependent instructions (Kruijff et al., 2007).

Some components of the ScoutBot architecturemay be substituted for different needs, such as dif-ferent physical or simulated environments, robots,or tasks. Training new DM and RN componentscan make use of this general architecture, and theresulting components aligned back with ScoutBot.

Acknowledgments

This work was supported by the U.S. Army.

ReferencesClaire Bonial, Matthew Marge, Ashley Foots, Felix

Gervits, Cory J Hayes, Cassidy Henry, Susan GHill, Anton Leuski, Stephanie M Lukin, PoojaMoolchandani, Kimberly A. Pollard, David Traum,and Clare R. Voss. 2017. Laying Down the YellowBrick Road: Development of a Wizard-of-Oz Inter-face for Collecting Human-Robot Dialogue. Proc.of AAAI Fall Symposium Series.

Arno Hartholt, David Traum, Stacy C Marsella,Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013.All Together Now Introducing the Virtual HumanToolkit. In Proc. of Intelligent Virtual Agents, pages368–381. Springer.

Randall Hill, Jr., Jonathan Gratch, Stacy Marsella,Jeff Rickel, William Swartout, and David Traum.2003. Virtual Humans in the Mission Rehearsal Ex-ercise System. KI Embodied Conversational Agents,17:32–38.

Geert-Jan M Kruijff, Pierre Lison, Trevor Benjamin,Henrik Jacobsson, and Nick Hawes. 2007. Incre-mental, Multi-Level Processing for ComprehendingSituated Dialogue in Human-Robot Interaction. InSymposium on Language and Robots.

Anton Leuski and David Traum. 2011. NPCEditor:Creating virtual human dialogue using informationretrieval techniques. AI Magazine, 32(2):42–56.

Matthew Marge, Claire Bonial, Kimberly A Pollard,Ron Artstein, Brendan Byrne, Susan G Hill, ClareVoss, and David Traum. 2016. Assessing Agree-ment in Human-Robot Dialogue Strategies: A Taleof Two Wizards. In Proc. of Intelligent VirtualAgents, pages 484–488. Springer.

Laurel Riek. 2012. Wizard of Oz Studies in HRI: ASystematic Review and New Reporting Guidelines.Journal of Human-Robot Interaction, 1(1).

David Traum, Priti Aggarwal, Ron Artstein, SusanFoutz, Jillian Gerten, Athanasios Katsamanis, An-ton Leuski, Dan Noren, and William Swartout. 2012.Ada and Grace: Direct Interaction with MuseumVisitors. In Proc. of Intelligent Virtual Agents.

David Traum, Kallirroi Georgila, Ron Artstein, andAnton Leuski. 2015. Evaluating Spoken DialogueProcessing for Time-Offset Interaction. In Proc. ofSpecial Interest Group on Discourse and Dialogue,pages 199–208. Association for Computational Lin-guistics.

David Traum, Cassidy Henry, Stephanie Lukin, RonArtstein, Felix Gervits, Kimberly Pollard, ClaireBonial, Su Lei, Clare Voss, Matthew Marge, CoryHayes, and Susan Hill. 2018. Dialogue StructureAnnotation for Multi-Floor Interaction. In Proc. ofLanguage Resources and Evaluation Conference.

http://people.ict.usc.edu/~traum/Movies/scoutbot-acl2018demo.wmv

http://people.ict.usc.edu/~traum/Movies/scoutbot-acl2018demo.wmv

scoutbot: a dialogue system for collaborative navigation

Documents