shape game a multimodal game featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfshape...

12
Shape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein [email protected] 1 Introduction The goal of much research in speech under- standing systems is to design interfaces which allow for humans to interact with computers in a more natural and intuitive fashion. Intu- itively, speech is a very natural, convenient, and fast way of communicating information – and people produce and understand speech with lit- tle overt effort. Despite the ease with which humans communicate with one another using speech, many find it frustrating to communicate with computers via speech. This frustration can be linked to many limitation of speech under- standing systems, including limitations in the accuracy of automatic speech recognition, pars- ing, referent resolution, and a general lack of world knowledge. But even if these challenges were overcome, the vast majority of speech understanding systems would likely still feel quite unnatural, owing to the strong turn-based model that most are designed around. Typically, speech understanding systems are designed fol- lowing a model in which users are expected to formulate complete, well-formed ideas (typ- ically in the form of commands or queries) in a fully fluent utterance. The reality is, most peo- ple don’t talk this way – those of us who can clearly stand out: television personalities like American Idol host Ryan Seacrest or news an- chors like Tom Brokaw come to mind. The reality is, most of us aren’t very good at staring at a television camera – or a computer monitor – and speaking fluently, without im- mediate feedback that anyone (or anything) is listening or understanding. In my experience designing and testing several different multi- modal speech understanding systems, this strict idea of turn taking has always stood out as be- ing extremely awkward whenever naive users try to interact with a system. One of the most challenging aspects of designing speech under- standing systems is giving a sense of what the system can understand, especially conveying to a user why the system has made an error in un- derstanding. Humans are quite good at convey- ing what they have or have not understood about what another person has said, and one strong mechanism for signaling this is through imme- diate feedback to a person as they are speak- ing. This feedback comes in many different forms, including gestures, facial expressions, and backchanelling; and it can play a vital role in helping a speaker know when to rephrase, speak more slowly, and so on. However, the strict turn-taking structure of most speech un- derstanding systems makes this sort of subtle, yet critical, interaction impossible. After at- tempting to formulate a fluent utterance, users are left to then try to figure out what went wrong and why. In this report, I describe Shape Game, a novel speech understanding system which provides non-intrusive incremental feedback to users as they speak in the course of the game. While extremely simple, Shape Game affords both a test-bed and framework for the types of capabil-

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

Shape GameA Multimodal Game Featuring Incremental Understanding

Alexander [email protected]

1 Introduction

The goal of much research in speech under-standing systems is to design interfaces whichallow for humans to interact with computersin a more natural and intuitive fashion. Intu-itively, speech is a very natural, convenient, andfast way of communicating information – andpeople produce and understand speech with lit-tle overt effort. Despite the ease with whichhumans communicate with one another usingspeech, many find it frustrating to communicatewith computers via speech. This frustration canbe linked to many limitation of speech under-standing systems, including limitations in theaccuracy of automatic speech recognition, pars-ing, referent resolution, and a general lack ofworld knowledge. But even if these challengeswere overcome, the vast majority of speechunderstanding systems would likely still feelquite unnatural, owing to the strong turn-basedmodel that most are designed around. Typically,speech understanding systems are designed fol-lowing a model in which users are expectedto formulate complete, well-formed ideas (typ-ically in the form of commands or queries) in afully fluent utterance. The reality is, most peo-ple don’t talk this way – those of us who canclearly stand out: television personalities likeAmerican Idol host Ryan Seacrest or news an-chors like Tom Brokaw come to mind.

The reality is, most of us aren’t very good atstaring at a television camera – or a computermonitor – and speaking fluently, without im-

mediate feedback that anyone (or anything) islistening or understanding. In my experiencedesigning and testing several different multi-modal speech understanding systems, this strictidea of turn taking has always stood out as be-ing extremely awkward whenever naive userstry to interact with a system. One of the mostchallenging aspects of designing speech under-standing systems is giving a sense of what thesystem can understand, especially conveying toa user why the system has made an error in un-derstanding. Humans are quite good at convey-ing what they have or have not understood aboutwhat another person has said, and one strongmechanism for signaling this is through imme-diate feedback to a person as they are speak-ing. This feedback comes in many differentforms, including gestures, facial expressions,and backchanelling; and it can play a vital rolein helping a speaker know when to rephrase,speak more slowly, and so on. However, thestrict turn-taking structure of most speech un-derstanding systems makes this sort of subtle,yet critical, interaction impossible. After at-tempting to formulate a fluent utterance, usersare left to then try to figure out what went wrongand why.

In this report, I describe Shape Game, a novelspeech understanding system which providesnon-intrusive incremental feedback to users asthey speak in the course of the game. Whileextremely simple, Shape Game affords both atest-bed and framework for the types of capabil-

Page 2: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

ities which could be integrated into more com-plex systems. The system is multimodal, in thatit accepts speech input which modifies a gameboard which is shown to the user using a com-puter screen. Feedback to the user is providedgraphically, as they speak, through the use ofhighlighting, shading, and flashing. Designingeven such a relatively simple game with justa few feedback strategies proved surprisinglychallenging, and I believe the work yields someinteresting insight which can be incorporatedinto more complex projects.

In this report I’ll cover some related workwhich inspired Shape Game. I’ll then describethe various components which make ShapeGame possible, and provide an analysis of asmall user study. Finally, I wrap up, and dis-cuss future directions of exploration.

2 Related Work

James Allen’s group at the University ofRochester has recently published papers on in-cremental speech understanding, focusing on adomain called the “fruit carts” domain [see, forexample, (Allen et al., 2005; Aist et al., 2006)].Figure 1 shows the on-screen display used inthis domain. Subjects are given a card similar tothe top half of this figure, in which several dif-ferent locations contain shapes filled with fruitsof various colors, oriented at various rotations.Subjects then speak commands into a micro-phone, until their (initially unpopulated) maplooks like the card they were given. While thegroup’s goal is to eventually build a speech un-derstanding system which can perform this task,I am unaware of the existence of such a func-tioning system. At the moment, all data is col-lected via wizard-of-oz methods, where an un-seen human manipulates what the subjects seeon the screen in response to their commands.

This domain is particularly interesting be-cause it is ripe for providing incremental feed-back to users as they speak. The authors haveobserved the users tend to give small, incremen-

1 okay so2 we’re going to put a large triangle with nothing into morningside3 we’re going to make it blue4 and rotate it to the left forty five degrees5 take one tomato and put it in the center of that triangle6 take two avocados and put it in the bottom of that triangle7 and move that entire set a little bit to the left and down8 mmkay9 now take a small square with a heart on the corner10 put it onto the flag area in central park11 rotate it a little more than forty five degrees to the left12 now make it brown13 and put a tomato in the center of it14 yeah that’s good15 and we’ll take a square with a diamond on the corner16 small17 put it in oceanview terrace18 rotate it to the right forty five degrees19 make it orange20 take two grapefruit and put them inside that square21 now take a triangle with the star in the center22 small23 put it in oceanview just to the left of oceanview terrace24 and rotate it left ninety degrees25 okay26 and put two cucumbers in that triangle27 and make the color of the triangle purple

Figure 2: Example of a user interacting withthe Fruit Carts domain, reproduced from (Aistet al., 2006)

tal commands; and that they use system feed-back to judge what future steps should be taken.The following actions are possible in the do-main, with examples of how they are expressed[quoted directly from (Allen et al., 2005)]. Andfigure 2 reproduces an example of speech datacollected through the experiments.

1. Select an object (“take the large plainsquare”)

2. Move it (“move it to central park”)

3. Rotate it (“and then turn it left a bit that’sgood”)

4. Paint it (“and that one needs to be purple”)

5. Fill it (“and there’s a grapefruit inside it”)

3 Shape Game: Functional Descriptionand Design

In designing Shape Game, I used the Fruit Cartsdomain as inspiration, however I developed agreatly simplified version of the “game” whichcould realistically be implemented given the

Page 3: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

Figure 1: Fruit Carts domain screenshot, reproduced from (Allen et al., 2005)

time constraints. By far the best way to under-stand the game is by watching a video of the in-teraction, and actually playing the game. Pleasesee http://people.csail.mit.edu/alexgru/shapegame/.

Figures 3, 4, and 5 show examples of thegame being played. The game supports threedifferent commands:

1. Selection, as in “select the small green cir-cle”

2. Put/Drop as in “put it in slot five”

3. Make as in “make it blue”

The game is very easy to play. A player usesthese three commands to first select one of the12 shapes along the bottom (called morphableshapes); she must provide enough attributes touniquely identify a shape. Once a shape isselected, it is dropped into one of the num-bered slots. Then, any number of “make” com-mands are used to change various attributes ofthe shape. The goal is to make all six shapes in

the numbered slots look like the six shape tem-plates above them. Currently there is no scor-ing, but it would be very natural to set a timelimit, or to count the number of commands usedto accomplish the task.

3.1 Incremental UnderstandingShape Game is designed to process incrementalrecognition results from the speech recognizer,as the user speaks. It tries to indicate that it isunderstanding (or not understanding) what theuser is saying as quickly as possible, giving theuser the chance to correct the system’s under-standing. Understanding is indicated throughvisual feedback to the user as she speaks com-mands, mostly through the use of highlight-ing. While designing the incremental process-ing component, it became apparent that therewere two stages to the incremental processing.First, the system should indicate that it has un-derstood what has been said so far, while stillgiving the user a chance to revise that under-standing. Only once the user has indicated (usu-

Page 4: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

ally implicitly) that the system’s understandingis correct, should the system actually execute acommand.

Each of figures 3, 4, and 5 demonstrate dif-ferent techniques Shape Game uses to provideincremental visual feedback. Figure 3 showsthe most straightforward use of visual feed-back: the incremental processing of the follow-ing selection command: select the red rectangle[pause] the large red rectangle. As this com-mand is given, highlighting is used to show theset of shapes which match the current descrip-tion. While speaking, the user immediately no-tices that “the red rectangle” in fact describestwo rectangles, and is able to revise the selec-tion simply by revising the description of theshape. Notably, in a typical speech understand-ing system, the user would have to hold downthe hold-to-talk button, say the command, andthen let go before seeing the result. Then shewould have to press and hold the button againto make the revision. In Shape Game, she canjust hold down the button the whole time she’sspeaking, and repair her utterance on the fly.

Figure 4 shows how highlighting is used dur-ing a sequence of commands: drop it in slotfive slot six and make it small. As soon as theuser says “drop” all of the possible drop slotsare highlighted, indicating the system’s expec-tation that we are in the process of a drop com-mand. Then, once “slot five” is mentioned, thisslot is highlighted. However, the drop com-mand is not actually executed at this point –which gives the user a chance to revise. Theuser changes his mind, noticing that the slot tothe right is actually a better match, and makesa correction mid-stream simply by saying “slotsix.” This slot is then highlighted. Once the userthen begins the “make” command, the actualdrop occurs; as moving on to a new command istaken as an implicit signal that the user is happywith the system’s understanding of the previouscommand. Next, the utterance of make it smallis given and the triangle is made smaller; how-ever it is rendered translucently, to indicate that

this is still a tentative change. At this point, theuser could still say hesitate and say, for exam-ple, “blue” to indicate that they actually wantto change the color instead of the shape. Fi-nally, the user releases the hold-to-talk button,signalling that this command is complete, andthe triangle becomes opaque, to indicate that thecommand has been executed.

Finally, figure 5 shows the last type of in-cremental feedback which can be provided inShape Game. Here, the user first issues thiscommand: select the large triangle. How-ever, this does not uniquely identify a shape,as there are two large triangles. Oblivious,the user begins the next command: drop itin .... However, as soon as the system un-derstands “drop”, it indicates visually that itdoesn’t know which shape to drop by flashingquestion marks in the locations of the two largetriangles. Thus, whereas in a typical speechunderstanding system the user would have tocomplete the “drop” command before receiv-ing feedback that it can’t be executed, with in-cremental understanding the previous utterancecan be immediately corrected. This feels ex-tremely natural: a human playing the role of thecomputer would likely try to interrupt the userto indicate the problem before moving on.

4 System Architecture

Now that I’ve introduced Shape Game on afunctional level, I’ll describe the architectureunderlying it. The overall architecture can befound in figure 6. The interface is based on aclient-server architecture, in which the speechand natural language processing componentsreside on a server. The interface to the game it-self is web-based, and can be accessed in a stan-dard web browser. A Java applet streams audioto the speech recognizer, while AJAX (Asyn-chronous Javascript And XML) techniques areused to dynamically update the game interfaceshown on the web page. The large gren buttonat the top of the page is a hold-to-talk button,

Page 5: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

1

2

4

5

3

Figure 3: Screenshots of the system as it incrementally responds to a “selection” command: (1)select (2) the red (3) rectangle (4) [pause] the large (5) red rectangle. (Numbers indicate points inthe incremental recognition in which each screenshot occurs.) The user gets immediate feedbackabout which shapes the given constraints select.

Page 6: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

1

2

4

5

3 6

Figure 4: Screenshots of the system as it incrementally responds to a “drop” command followed bya “make” command: drop (1) it in slot five (2) slot six (3) and make (4) it small (5) [release hold-to-talk-button] (6). We note that the selected slot for the drop is highlighted, but the user can stillchange her mind mid-utterance to change it; it is not until the beginning of the “make” commandthat the actual drop action occurs. Similarly, during the “make” command, a small triangle isshown in a translucent green color until the button hold-to-talk button is released, at which time itbecomes solidly colored.

Page 7: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

1 2

Figure 5: Screenshots of the system as it incrementally responds to the following utterance: selectthe large triangle (1) drop (2) it in. The selection is ambiguous (it selects two shapes), as soonas the user begins the “drop” command, the system flashes question marks over those two shapesto indicate that it doesn’t know which one to drop. This allows the user to make a correctionimmediately, without having to complete her command.

which the user must hold down while speak-ing. The architecture was developed for the CityBrowser application (Gruenstein et al., 2006),and now serves as a platform for several multi-modal dialogue systems in our lab.

On the server, we use the SUMMIT auto-matic speech recognizer (Glass et al., 1999),which interacts with our Java Servlet basedGame Engine via the GALAXY framework(Seneff et al., 1998). Incremental speech re-sults are sent as speech is processed from therecognizer to the Game Engine, where process-ing occurs. Details of various key aspects of thesystem follow.

4.1 Incremental RecognitionIn order to obtain incremental recognition re-sults, I had to make modifications to the SUM-MIT recognizer, which is implemented in C++.Typically, in the recognizer, speech is processedas it is received, however the best (or n-best)recognition hypotheses are not determined untilthe utterance is complete. In order to calculatethe best overall candidate hypothesis for the ut-

Architecture

GALAXYHUB

ASR

Game Engine Servlet

AJAX

Figure 6: Shape Game Architecture

terance, it is necessary to have the entire utter-ance, as the language model will be used as aconstraint. However, as we are processing theutterance, it is possible to determiine our “bestguess” so far. We employ a method which isalmost identical to the viterbi backtrace proce-dure used once the utterance is complete. How-ever, while the normal backtrace enforces theconstraint that the final state be a final state (onethat we have evidence showing it should occur

Page 8: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

at the end of an utterance), this constraint is re-laxed when we perform the search while stillprocessing the utterance.

The next important question is how often todo the calculations required to find the best par-tial hypothesis. In a very small domain likeShape Game, I’ve found it to be no problemto update the partial hypothesis as soon as anynew chunk of data is received for processing.This means that the calculation occurs quite fre-quently – sometimes on the order of every fewmilliseconds. However, new partial results areonly sent to the Game Engine if the new partialhypothesis differs from the previous one. Thismeans that in Shape Game, partial results areusually send after every word, or almost everyword. In a larger domain, this might represent asignificant overhead, and a balance would haveto be found between the need for partial resultsand computation.

4.2 Language ModelI used a very small context free grammar in theJava Speech Grammar Format (JSGF) as thespeech recognizer’s language model. Conve-niently, it is also possible to embed semantictags in the grammar which are output alongsidethe recognized words. This means that I did nothave to use a separate parser to understand ut-terances, instead I am simply able to discard theactual recognized words and look only at the se-mantic tags which are output in the stream. Theentire grammar is given in figure 7.

The grammar is also notable in that it allowsmany different types of false starts and correc-tions. At the same time, it allows an arbitrarynumber of commands to be looped together.The following utterance illustrates the flexibil-ity of the grammar:select the large green select the large green bluered triangle red circle and drop it in slot five sixslot four and make it green blue a square makeit blue select the small large red rectangleThis flexibility allows for a large amount offreedom on the part of the user. The down side

#JSGF V1.0;grammar ShapeGame;

public <top> = (<command> [and])+ ;

<command> = <select>| <drop>| <make> ;

<select> = (select | choose) {[command=select]} <shape_desc>+ ;

<drop> = (drop | put) {[command=drop]} it in <slot>+ ;

<slot> = slot <number>+ ;

<make> = make {[command=make]} it [a] <shape_attr>+ ;

<shape_desc> = [the] <shape_attr>+ ;

<shape_attr> = <size> | <color> | <shape_type> ;

<size> = <size_adj> [sized] ;

<size_adj> = (small | tiny) {[size=small]}| medium {[size=medium]}| (large | big) {[size=large]} ;

<color> = red {[color=red]}| green {[color=green]}| blue {[color=blue]} ;

<shape_type> = (rectangle | bar) {[shape_type=rectangle]}| square {[shape_type=square]}| circle {[shape_type=circle]}| triangle {[shape_type=triangle]}| one| shape ;

<number> = one {[number=1]}| two {[number=2]}| three {[number=3]}| four {[number=4]}| five {[number=5]}| six {[number=6]};

Figure 7: The context free grammar used for therecognizer’s language model. Words enclosedin < > are nonterminals. Text enclosed insidecurly braces {} are semantic tags.

Page 9: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

of it is that the language model is not very wellconstrained. In this very narrow domain, thisdoes not cause problems. However it wouldlikely lead to a degradation of accuracy in a do-main with a larger vocabulary and more com-plex rules. Recognition difficulties could bemitigated by using a probabilistic (or weighted)context free grammar, trained with collecteduser data (or via intuition). This could be usedto lend greater weights to false starts and cor-rections where they are observed to be more fre-quent. In addition, I believe that adding termi-nals like “uh”, “um” and “i mean” which typi-cally signail repairs, as well as an overt model-ing of pauses, could improve performance. So,too, could the utilization of prosodic informa-tion. Alas, this speculation all represents futureareas of exploration.

4.3 Game EngineI have established the mechanism by which therecognizer outputs partial results, and shownhow our language model can be used to directlyprovide “semantic” information. I now turn tothe Game Engine, which processes these partialrecognition results, embedded with semantictags, and updates the game display. The GameEngine is implemented as a Java servlet, whichupdates the game displayed in the browser bysending XML messages to the client browser.The browser uses javascript to dynamically ren-der the web page, based on the contents of themessages it receives from the servlet.

The Game Engine must process incrementalrecognition results sent from the recognizer asthey occur. Two typical sequences of partialresults from the recognizer are shown in figure8. The game engine processes these partials bythrowing away the recognized words and pro-cessing the list of semantic tags. The list ofsemantic tags is segmented into separate listsfor each command in the sequence. The indexof the currently active command is stored aftereach partial result is processed. This is critical,as a partial result might contain several com-

mands (if the user holds down the hold-to-talkbutton for a long time), however many of thesecommands may have already been executed.

To determiine the the specifics of each com-mand, a simple method is employed. The En-gine simply iterates over each semantic tag inthat command’s list, in the order the tags wererecognized, and sets its value in a hashtable.That way, the most recent value for that seman-tic tag will be stored in the hashtable after wefinish iterating over all the semantic tags in thelist. This simple method accounts for most falsestarts and corrections, simply by throwing awayearlier information whenever it is replaced.

While a command is active, the system indi-cates that it understands the command, howeverit doesn’t execute it. This is done multimodally,as was described extensively in section 3. Whenthe user indicates that she is satisified with thecommand, either by moving on to a new com-mand or by letting go of the hold-to-talk but-ton, the system executes the command and ad-vances the stored index of the currently activecommand. This is important both because auser may revise a command, and because thecurrent partial recognition result is often errorprone towards the end of the hypothesis. Errorsarise because the recognizer is quite aggressivein matching the best word in its vocabulary witha what it hears; so, often as you are in the mid-dle of saying a word, the recognizer will alreadybe trying to make its best guess about what thatword is. As the partial results in figure 8 show,this often leads to very brief errors in the par-tial results – we note that “slot three” is initiallyhypothesized to be “slot six,” but this is sub-sequently revised. While it might make senseto ignore the last word or two in the hypothe-sis all together, in our small domain the recog-nizer is often, in fact, correct – and this meansthat sometimes the system feels as though its al-most reading your mind, understanding wordsyou have just begun to say. Since we can up-date the display pretty rapidly, the user may noteven notice if the wrong slot is highlighted for

Page 10: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

select [command=select]select [command=select] small [size=small]select [command=select] theselect [command=select] the big [size=large]select [command=select] the big [size=large] red [color=red]select [command=select] the big [size=large] red [color=red] rectangle [shape_type=rectangle]

drop [command=drop] itdrop [command=drop] it in slotdrop [command=drop] it in slot six [number=6]drop [command=drop] it in slot three [number=3]drop [command=drop] it in slot three [number=3] slotdrop [command=drop] it in slot three [number=3] slot four [number=4]

Figure 8: Partial recognition results for two commands: select the big red rectangle and drop it inslot three slot four

a fraction of a second.The Game Engine minimizes traffic over the

HTTP connection to the web browser, as thislink is potentially slow. In order to do this, itstores the state of the display locally and onlysends messages to the client when the displaymust be changed. In addition, it knows the po-sition of each shape, and generates a randomset of shapes at the beginning of each game.Shapes are generated so that no two shapes arethe same anywhere on the board. This guar-antees that each morphable shape on the bot-tom can be uniquely identified, and that alwaysat least one move is required to change a mor-phable shape to match a template.

4.4 Game Display

The game is displayed as standard HTML in aweb browser. Javascript is used to update thisdiplay as new messages are pushed from theserver. The shapes for the game were createdusing a script which produced SVG (ScalableVector Graphics) descriptions of each shapewhich were then converted to PNG files suit-able for display in a web browser. Clearly, theresponsiveness of the system is highly depen-dent on the speed and latency of the user’s net-work connection – and given that we are inter-ested in incremental understanding, latency is ahuge potential pitfall. I have tested the systemin three scenarios: both server and client on the

same computer, client on MIT’s wireless net-work, and client on my cable modem at home.I’ve found the three different locations to be in-distinguishable with regards to responsiveness.

5 User Evaluation

I performed a small user evaluation of the ShapeGame by asking naive users to interact withit. 1 pilot user tried the system early on, andhelped me to identify the need to be more care-ful to only show understanding until a commandis ready to be executed. Early versions of thegame immediately executed commands as theycame in. While this model works fine for shapeselection, it is confusing and error prone fordropping shapes and changing their attributes

After refining the system based on the pi-lot user’s experiences, I asked 4 subjects touse Shape Game. Each user was given a briefdemonstration of how to use the system, andthen asked to play the game in two differentmodes marked as “Mode 1” and “Mode 2” onthe web page. In Mode 1, incremental under-standing was performed as has been descripedin this report. In Mode 2, no incremental feed-back was given to users; instead, they could sayas many commands as they liked in a row whilepressing the hold-to-talk button, but would notsee the results of these commands until releas-ing the button. Three subjects played the gamefirst in Mode 1 and then Mode 2, while one sub-

Page 11: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

ject played in the reverse order.After using the system, I asked each subject

to give me feedback about whether they likedthe game, and about which mode they preferred.Three indicated that they prefered the modewith incremental feedback, while one thoughtthey were both different but equal. This userperceived the incremental understanding modeas more of a “beginners” mode, while the modewithout incremental understanding was thoughtto be more of an “advanced” mode. I think thisobservation came out of the fact that users tendto speak a little more slowly in the incremen-tal mode, as they wait for feedback from thesystem before moving on. In the other mode,they felt as though they were going faster be-cause they weren’t afraid to just say severalcommands and hope for the best. While accu-racy in this domain is high enough that this isa viable strategy, in a larger, more errorful do-main, users would likely be afraid to partake insuch behavior.

In addition, I received feedback that the in-cremental mode would be more helpful if thetask was more difficult. Along these same lines,I also received feedback that this might be agood game to play in a non-native language.It would be a good way to practice vocabu-lary, and at the same time the incremental modemight be more useful because the task would bemore difficult for a non-native speaker. Unfor-tunately, all of my subjects were native englishspeakers.

Finally, I observed that in both modes, peopledid make use of the system’s ability to handledisfluences, false starts, and corrections. Sub-jects tended to have more successful interac-tions because of these capabilities; incrementalmode is a nice way for them to gain confidencethat this sort of processing is possible.

6 Conclusion and Future Work

I have presented Shape Game, a novel multi-modal speech understanding game, which fea-

tures incremental understanding. I have dis-cussed the various components which make thisapplication possible. I have shown that usersdo tend to prefer the incremental mode, even inthis simple task. I believe that in more complextasks, incremental understanding with graphicalfeedback has the potential to play a more impor-tant role.

There are several paths to go down in further-ing work on this project. After getting a verypositive response from people in SLS who areinterested in games to help people learn newlanguages, a very interesting path to go downwould be to develop the game further in severaldifferent languages. The game could be used tohelp build vocabulary, and need not be limitedto shapes. Different sets of vocabulary could beused (animals, foods, cars, etc) in this game, ora modified version of the game. In addition, astage where vocabulary is incrementally intro-duced could be useful. In any case, the workI’ve presented here serves as a nice model fora fairly painless way to create new kinds ofspeech understanding games. It would be veryinteresting to see how well the partial recogni-tion results would scale up to more complicatedgames, and to determine if they are useful inlanguage acquisition. It is especially difficult toformulate a complete, fluent sentence in a for-eign language, so the ability to make repairs asyou speak and have the system show that it un-derstands, could be extremely useful.

I am also interested in taking what I’velearned here and incorporating it into multi-modal dialogue systems which are an order ofmagnitude more complex than Shape Game.The same principles of showing understand-ing without executing commands immediatelycould be extremely useful in more complex do-mains. Scaling it up could be quite complicated,especially as the partials would likely becomemuch more error prone. It would be key tofind ways to give feedback about what has beenunderstood so far, without necessarily indicat-ing that you’ve understood the most recent few

Page 12: Shape Game A Multimodal Game Featuring …people.csail.mit.edu/alexgru/shapegame/shapegame.pdfShape Game A Multimodal Game Featuring Incremental Understanding Alexander Gruenstein

words.

7 Acknowledgements

Stephanie Seneff, Chao Wang, Lee Hether-ington, and Scott Cyphers all provided valu-able help in conceiving and implementing var-ious aspects of the game. The volunteers whoplayed the game also provided valuable feed-back: Anna Liess, Sean Liu, Ian McGraw,Asthma Mohammad, and Justin Weinstein-Tull.Finally, thanks to Ali Mohammad who designedShape Game’s logo, after seeing my own pa-thetic attempt.

ReferencesGregory Aist, James Allen, Ellen Campana, Lucian

Galescu, Carlos A. Gomez Gallo, Scott C. Stoness,Mary Swift, and Michael Tanenhaus. 2006. Softwarearchitectures for incremental understanding of humanspeech. In Proc. of INTERSPEECH-ICSLP.

J. Allen, G. Ferguson, M. Swift, A. Stent, S. Stoness,L. Galescu, N. Chambers, E. Campana, and G. Aist.2005. Two diverse systems built using generic compo-nents for spoken dialogue (recent progress on TRIPS).In Proc. of the ACL Interactive Poster and Demonstra-tion Sessions.

Jim Glass, T. J. Hazen, and I. Lee Hetherington. 1999.Real-time telephone-based speech recognition in theJUPITER domain. In Proceedings of ICASSP, March.

Alexander Gruenstein, Stephanie Seneff, and ChaoWang. 2006. Scalable and portable web-basedmultimodal dialogue interaction with geographicaldatabases. In Proc. of INTERSPEECH.

S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, andV. Zue. 1998. Galaxy-II: A reference architecture forconversational system development. In Proc. ICSLP.