02 multi modal speech-gesture interface for handfree painting on a virtual paper using partial...

8/8/2019 02 Multi Modal Speech-gesture Interface for Handfree Painting on a Virtual Paper Using Partial Recurrent Neural Net

1/6

Multimodal Speech-Gesture Interface for Handfree Painting on a VirtualPaper Using Partial Recurrent Neural Networks as Gesture Recognizer

Andrea Corradini, Philip R. CohenOregon Graduate Institutefor Science and Techn ologyCenter forHuman-Computer Communication20000 N.W. Walker Rd, 97006 Beaverton, OR

[email protected] ABSTRACT - We describe a pointing and speech alternativeto the current paint programs based on traditional devices likemouse, pen or keyboard.We used a simple magnetic field tracker-based pointing systemas input device for a painting system to provide a convenientmeans for the user to specify paint locations on any virtualpaper. The virtual paper itself is determined by the operatoras a limited plane surface in the three dimensional space.Drawing occurs with natural human pointing by using thehand to define a line in space, and considering its possibleintersection point with this plane. The recognition of pointinggestures occurs by means of a partial recurrent artificialneural network. Gestures along with several vocal commandsare utilized to act on the current painting in conformity with apredefined grammar.KeywordsUser-centered Interface, Painting Tool, Pointing Gesture, SpeechRecognition, Communication Agent, Multimodal System,Augmented and Virtual reality, Partial Recurrent Artificial NeuralNetwork.1. INTRODUCTIONThe natural combination of a variety of modalities such asspeech, gesture, gaze, and facial expression makes human-human communication easy, flexible and powerful.Similarly, when interacting with computer systems, peopleseem to prefer a combination of several modes to a singleone alone [12, 191. Despite the strong efforts and deepinvestigations in the last decade, human-computerinteraction (HCI) is in its childhood and therefore itsultimate goal, aiming at building n a t u r a l perceptual userinterfaces, remains a challenging problem.Two concurrent factors produce awkwardness. First, currentHC I systems make use of both rigid rules and syntax overthe individual modalities involved in the dialogue. Second,speech and gesture recognition, gaze tracking, and otherchannels are isolated because we do not understand how tointegrate them to maximize their joint benefit [20,2 1,25,30].While the first issue is intrinsically dificult (everyoneclaims to know what a gesture is, but nobody can tell youprecisely), progress is being made in combining differentmodalities into a unified system. Such a multimodal system,allowing interactions that more resemble everydaycommunication, becomes more attractive to users.

1.1 Related Wor kLike speech, gestures vary both from instance to instancefor a given human being and among individuals. Beside thistemporal variability, gestures vary even spatially makingthem more difficult to deal with. For the recognition ofthose single modalities, only few systems make use ofconnectionist models [3,7,17,27] for they are notconsidered well suited to completely address the problem oftime alignment and segmentation. However, some neuralarchitecture [10,14,29] has been put forward andsuccessfully exploited to partially solve problems involvingthe generation, learning or recognition of sequence ofpatterns.Recently, several research groups have more thoroughlyaddressed the issue of combining verbal and nonverbalbehavior. In this context, most of such multimodal systemshave been quite successful in combining speech and gesture[4,6,26,28] but, to our knowledge, none exploits artificialneural networks.One of the first such systems is Put-That-There [4] whichuses speech recognition and allows for simple deicticreference to visible entities. A text editor featuring a multi-modal interface that allows users to manipulate text using acombination of speech and pen-based gestures has beenpresented in [28]. Quickset [6] along with a novelintegration strategy offers mutual compensation betweenpen and voice modalities.Among gestures, pointing is a compelling input modalitythat has led to friendlier interfaces (such as the mouse-enabled GUI) in the past. Unfortunately, few 3D systemsthat integrate speech and deictic gesture have been built todetect when a person is pointing without special hardwaresupport and to provide the necessary information todetermine the direction of pointing. Most of those systemshave been implemented by applying computer visiontechniques to observe and track finger and hand motion.The hand gesture-based pointing interface detailed in [24]was proposed to track the position of the fingertip in whichthe user points map directly into the 2D cursor movementon the screen. Fukumoto et a1 [ l l ] report a glove-freecamera based system providing pointing input forapplications requiring computer control from a distance(such as a slide presentation aid). Further stereo-camera

0-7803-7278-6/02/$10.00 Q2002 IEEE 2293
mailto:[email protected]:[email protected]


2/6

techniques for detection of real time pointing gestures andestimation of direction of pointing have been exploited in[S, 8, 131. More recently, [26] describes a bimodalspeechlgesture interface, which is integrated in a 3D visualenvironment for computing in molecular biology. Theinterface lets researchers interact with 3D graphical objectsin a virtual environment using spoken words and simplehand gesture.In our system, we make use of the Flock of Birds (FOB)[11, a six-degree-of-freedom tracker device based onmagnetic fields to estimate the pointing direction. In aninitialization phase, the user is required to set the targetcoordinates in the 3D space that bound his painting region.With natural human pointing behavior, the hand is used todefine a line in space, roughly passing through the base andthe tip of the index finger. This line does not usually lie inthe target plane, but may intersect it at some point.We recognize pointing gestures by means of a hybridpartial recurrent artificial neural network (RNN) consistingof a Jordan network [141 and a static network with bufferedinput to handle the temporal structure of the movementunderlying the gesture. Concurrently, several speechcommands can be issued asynchronously. They arerecognized using Dragon 4.0, a commercial speech engine.Speech along with gestures is then used to put the systeminto various modes to affect the appearance of the currentpainting. Depending on the spoken command, we solve forthe intersection point and use it either to directly render inkor to draw a graphical object (e.g. circle, rectangle, or line)at this position in the plane. Since we implemented thespeech and tracking modules are on different machines, weemployed our agent architecture to allow the differentmodules to exchange messages and information.2. DEICTIC GESTURESAs for HCI, currently no comprehensive classification ofnatural gestures exists that would help in establishing amethodology for gesture understanding. However, there is ageneral agreement in defining the class of deictic orpointing gestures [9,15,18]. The term deictic is used inreference to gestures or words that draw attention to aphysical point or area in course of a conversation.Among natural human gestures, pointing gestures are theeasiest to identify and interpret. There are three body partswhich can conventionally be used to point: the hands, thehead, and the eyes. Here we are concerned only withmanual pointing.In the western society, there are two distinct forms ofmanual pointing which regularly CO-occur with deicticwords (like this, that, those etc.): the one-finger-pointing to identify a single or a group of objects, a place ora direction, and the flat-hand-pointing to describe paths orspatial evolutions of roads, range of hills. Some researchers[23] argue that pointing has iconic properties and represents

a prelinguistic and visually perceivable event. In fact, inface-to-face communication deictic speech never occurswithout an accompanying pointing gesture. In a sharedvisual context, any verbal deictic expression like there isunspecified without a parallel pointing gesture. Thesemultiple modes may seem redundant until we consider apointing gesture as a complement to speech, which helpsform semantic units. We can easily realize that by speakingwith children. They compensate for their limited vocabularyby pointing more, probably because they cannot convey somuch information about an objecdlocation by speaking asthey could by directing their interlocutor to perceive it withhis own eyes.2.1 An Empirical StudyAs described in the previous section, pointing is anintentional behavior that aims at directing the listenersvisual attention to either an object or a direction. It iscontrolled both by the pointers eyes and by muscular sense(proprioception).In the real world, our pointing actions are not coupled withcursors, yet our interlocutors can often discern theintended referents, processing the pointing action and thedeictic language together.We conducted an empirical experiment to investigate howprecise pointing is when no visual feedback is available.We invited four subjects to point at a target spot on the wallusing a laser pointer. They did this task from six differentdistances away from the wall (equally distributed from 0.5mto 3m), ten times for each. Each time the subject attemptedto point at the target with the beam turned off. Once asubject was convinced that he had directed a laser pointertoward the bulls-eye, we turned on the laser and determinedthe point the user was really aiming at. We determined thedistance between the bulls-eye the user was aiming for andthe actual area he indicated with the laser pointer. We thencomputed overall error for each distance as the averagedistance between desired and actual points on the wall overall the trials for that distance from the wall.The subjects were requested to perform this experimenttwice and in two different ways: in a natural way and inan improved way. As natural way we asked the personsinvolved in this experiment to naturally point at the target,while in the improved way we specifically asked the personto try to achieve the best result (some people put the laserpointer right in front of the eye and closed the other, otherput it right in front of the nose, etc.). The outcome of theexperiment is shown in Figure 1.1 This dos ot happen in smlences hat are used far referencing places, objects or wm t ~he interlocutors

have clear their mink because of the dialogue context. E.g., n Have you b m o Italy? YS, haveb m here twice or I watched Nuovo Cincma Paradiso at the TV yesterday. DiBt that film win anOscar in 1989?, dcictic w o r k are UOI accompanied by pointing gst u res . Neither they are in sentnrslike e.g. Thereshall come a time, or They all know that Lam is cute, or The house thot she built ishuge, where they arc used as conjunctionor pronoun.

0-7803-7278-6/02/$10.00 02002 IEEE 2294


3/6

As expected, we can see how with increasing distance theerror increases as well. In addition, when the user pointed atthe given spot from a distance of 1 meter the errordecreased from 9.08 to 3.89 centimeters from the naturalto the improved way.Pointing Inaccuracy

0.5 I 1.5 2 2.5 3distancefrom wall (m)

I+improved pointing +natural pointing IFigure 1: target pointing precision.

In light of this experiment, reference resolution of deicticgestures without verbal language is an issue. In particular,when small objects are placed close together, referenceresolution via deictic gesture can be impossible without thehelp of spoken specification. In addition, direct mappingbetween the 3D user input (the hand movement) and userintention (pointing on the target plane) can be carefullyperformed only with visual feedback (information oncurrent position). In the next session, we describe thesystem that has been built according to theseconsiderations.3. THE PA INTING SYSTEM3.1 Estimating the pointing directionFor the whole system to work, the user is required to wear ahand glove on whose top we put one FOBs sensor. TheFOB is a six-degree-of-freedom tracker device based onmagnetic fields which we exploit to track the position andorientation of the users hand with respect to the coordinatesystem determined by the FOBs transmitter The handsposition is given by the position vector reported by thesensor at a frequency of approximately 50Hz. For theorientation, we put the sensor almost at the back of theindex finger with its relative x-coordinate axis directedtoward the index fingertip. In this way, using the quatemionvalues reported by the sensor, we can apply mathematicaltransformations within quatemion algebra to determine theunit vector X which unambiguously defines the direction ofthe sensor and therefore that of pointing (Figure 2) .The point along with vector X s then used to determinethe equation of the imaginary line passing through andhaving direction X.When the system is started for the firsttime, the user has to choose the region he wants to paint in.This is accomplished by letting the user choose three of the

0-7803-7278-6/02/$10.00 02002 IEEE 2295

vertices of the future rectangular painting region. Thesepoints are chosen by pointing at them. However, since thisprocedure is to be done in the 3D space, the user has to aimat each of the vertices from two different positions. The twodifferent vectors triangulate to select a point as vertex. In3D space, two lines will generally not have an intersection.In such cases, we will use the point of minimum distancefrom both lines.With natural human pointing behavior, the hand is used todefine a line in space, roughly passing through the base andthe tip of the index finger. Normally, this line does not lie inthe target plane but may intersect it at some point. It is thispoint that we aim to recover.For this reason, when the region selected in the 3D space isneither a wall screen, nor a general surface on which theinput can be directly output (tablet, the computers monitoretc.), the system can be properly used only when themagnetic sensor is aligned and used together with a lightpointer. However, in this situation we also implemented arendering module to draw the actual painting on the screenregardless of the target plane chosen in the 3D space.

Figure 2: selecting a graphic tablet as target region for paintingenables directly visual feedback. The frame of reference of the sensoris shown on the left. On the right an exemplarity painting is shown asit appears on the tablet.3.2 Motion Detection for SegmentationIn order for us to describe in detail the motion detector, wefirst need to give some definitions.We consider the FOB data stream static anytime the sensorattached to the users hand remains stationary for at leastfive consecutive FOBS reports. In this case, we also referto the user as to be in the resting position. In a similar way,we say the user is moving and we consider the data streamdynamic whenever the incoming reports change in theirspatial location for at least five times in a row.Static and dynamic data stream are defined in such a waythat they are mutually exclusive, but not exhaustive. Inother words, if one definition is satisfied, that implies thatthe other is not. However, the converse situation is not thecase, since if one definition is not satisfied, this does notimply that the other is. Such non-complementarity makesthe motion detector module robust against noisy data.For real-time performance purposes, the FOBs data arecurrently downsampled to lOHz so that, both static and


4/6


5/6

neurons compute the sigmoid activation function. Theadditional output neuron is checked anytime a classificationresult is required.We tested the recognizer on four sequences, two of a personperforming pointing gestures toward a given virtual paper,and two of the same person gesticulating during amonologue without deictic gestures. Each sequence lasts 10minutes and is sampled at 10Hz. One sequence for eachclass was used for training and testing. The recognition ratewas up to 89% for pointing gestures, and up to 76 % fornon-pointing gestures. While only a few sequences from thedeictic gesture data set were misrecognized (false negative),much more movements from the non-pointing gesture dataset were misrecognized as pointing gesture (false positive).This is not surprising since long lasting gestures, whichoccur frequently during a monologue/conversation, are verylikely to contain segment patterns that are very similar todeictic gestures. Due to the nature of the training data, theperformed test looks only at the boundary conditions (falsepositivehegative). We plan to collect and transcribe datafrom users during a conversational event where both deicticand non-deictic gestures occur. Testing the system with thismore natural data will permit to assess more precisely theperformance of the recognizer.3.4 The Speech AgentWe make use of Dragon 4.0, a Microsoft SAP1 4.0compliant speech engine. This speech recognition enginecaptures an audio stream and produces a list of textinterpretations (with association probabilities of correctrecognition) of that speech audio. These text interpretationsare limited by a grammar that is supplied to the speechengine upon startup.The following grammar specifies the possible self-explanatory sentences:1: = I I I 2: =3: =4: =

5 : > =

no / yesgreen / red / blue /yello w / white /magenta / cyandraw on / draw off / z oom in / zoom ou t /cursor on / cursor off/ ine begin /p as te /select end / select begin / line end / copy /circle end / circle begin / rectangle end /rectangle beginexit / he& / undo / switch to foregroun d /save /f re e buffer / switch to background /send to background / cancel / restart 1delete / load

Here, and refers to the sets ofcommands which need to be issued without and with anaccompanying pointing gesture, respectively.

The user uses voice commands to put the system intovarious modes that remain in effect until he changes them.Speech commands can be entered at anytime and arerecognized in continuous mode.3.5 The Fusion AgentThe Fusion Agent is a finite state automaton that is incharge for two major functions, i.e., the rendering, and thetemporal fusion of speech and gesture information.This rendering is implemented with OpenGL on a SGImachine utilizing the Virtual Reality Peripheral Network(VRF) [2] driver for the FOB.The fusion bases on a time-out variable. Once a pointinggesture is recognized, a valid spoken command must beentered within a given time (currently 4 seconds, as usuallyspeech follows gestures [22]) or another pointing gesturemust occur. Eventually, the Fusion Agent either takes theaction (such as changing drawing color, select the first pointof a line, etc.) associated with the speech command orissues an acoustic warning signal.The model nature of the state machine ensures consistentcommand sequences (e.g., line begin can only befollowed by undo, cancel or line end). Depending onthe performed action, the system may undergo a statechange.3.6 Agent ArchitectureThe modules implemented for tracking, pointing andpainting, and speech command recognition, need tocommunicate with each other. Agents communicate bypassing Prolog-type ASCII strings (Horn clauses) via.TCP/IP.

AudioTFigure 4: agent communication within the entire system.

The central agent is the facilitator. Agents can inform thefacilitator of their interest in messages which match(logically unify) with a certain expression. Thereafter, whenthe facilitator receives a matching message from some otheragent, it will pass it along to the interested agent. SinceASCII strings and TCP/IP are common across various

0-7803-7278-6/02/%10.00 c92002 IEEE 2297


6/6

platforms, agents can be used as software components thatcan communicate across platforms.In this case, the Speech Agent is running on a Windowsplatform. The best off-the-shelf speech recognition enginesavailable to us (currently, Dragon) are on the Windowsplatform. On the other hand, the Flock of Birds and theVRPN server are set up for Unix. Therefore, it makes senseto tie them together with the agent architecture (Figure 4).Communication is straightforward. The Speech Agentproduces messages of the type parse-speech(Message)which the facilitator forwards to the Fusion Agent. Thisladder, with some simple parsing, can then extract speechrecognition alternate interpretations and their associatedprobabilities from the message strings. The commandassociated with the highest probability value above anexperimental threshold (currently 0.85) is chosen.4. Conclusions and Fu ture WorkThe presented system represents a real-time application ofdrawing in space on a two-dimensional limited rectangularsurface. This is a first step toward a 3D multimodal speechand gesture system for computer aided design andcooperative tasks. A system might perhaps recognize fromthe users input some 3D objects from an iconic library andrefine the users drawings accordingly. We anticipateexpanding the use of speech to operate with 3D objects.Since the h i o n component is an agent, we are going tomake it a module in the entire QuickSet Adaptive AgentArchitecture [161, to further use it as a sort of virtual mousefor the QuickSet [6] user interface. Possible alternativeapplications for this system range from hand cursor controlby pointing to target selection in virtual environments.5. ACKNOWLEDGMENTSThis research is supported by the Office of Naval Research, GrantN00014-99-1-0377 and N00014-99-1-0380. Thanks to Rachel Coulstonfor help editing and Richard M. Wesson for programming support.6. REFERENCES[11 httu://www.ascension-tech.com[2] Taylor R.M., VRPN: A Device-Independent, Network-TransparentVR Peripheral System, Proceedings of the ACM Symposium onVirtual Reality Software and Technology, 2001.[3] Boehm K., Broll W., Sokolewicz M., Dynamic Gesture Recognitionusing Neural Networks; A Fundament for Advanced InteractionConstruction, SPIE Conf. Elect. Imaging Science & Tech., 1994.[4] Bolt R.A. Put-That-There: voice and gesture at the graphicsinterface. Computer Graphics, Vol. 14, No. 3, 1980 ,262- 270.[5] Cipolla R., Hadfield P.A., Hollinghurst, N.J., Uncalibrated StereoVision with Pointing for a Man-Machine Interface, in Proc. of the

IAPR Workshop on Machine Vision Application, 163-166, 1994.[6] Cohen P.R., et al., Quickset: Multimodal interactions for distributedapplications. Proc. of the 5th Intl Multimedia Conf., 31-40, 1997.[7] Corradini A., Gross H.-M., Camera-based Gesture Recognition forRobot Control, Proceedings of the IEEE-INNS-ENNS InternationalJoint Conference on Neural Network, vol. IV, 13 3-138,2 000,

[8] Crowley J.L., Berard F., and Coutaz J., Finger Tracking as an InputDevice for Augmented Reality, in Proc. of the Intl Workshop onAutomatic Face and Gesture Recognition, 195-200, 1995.[9] Efron D., Gesture, Race and Culture, Mouton and Co., 1972.[lo]Elman J.L., Finding Structure in Time, Cog. Sci., 14:179-211, 1990[11] Fukumoto M., Mase K., and Suenaga Y., Realtime detection ofpointing actions for a glove-free interface, in Proceedings of IAPR

Workshop on Machine Vision Applications, 473-476, 1992.[12]Hauptmann A.G., and McAvinney P., Gesture with speech forgraph ics manipulation. International Journal of Man-MachineStudies, Vol. 38,231-249, February 1993.[13]Jojic N., et al., Detection and Estim ation of Pointing Gestures inDense Disparity Maps, in Proceedings of International Conferenceon Automatic Face and Gesture Recognition, 468474,2000,[141 ordan M., Serial Order: A Parallel Distributed Processing ApproachAdvances in Connectionist Theory, Lawrence Erlbaum, 1989.[151Kendon A., The Biological Fundations of Gestures: Motor andSemiotic Aspects, Lawrence Erlbaum Associates, 1986.[16]Kumar S., Cohen P.R., Levesque, H.J., The Adaptive AgentArchitecture: Achieving Fault-Tolerance Using Persistent BrokerTeams, Proc. 4th Intl Conf. Multi-Agent Systems, 159-1 66,2000.[17]Lippmann R.P., Review of Neural Networks for SpeechRecognition, Neural Computation, 1:1--38, 1989.[181McNeill D., Hand and Mind: what gestures reveal about thought, theUniversity of Chicago Press, 1992.[191Oviatt S.L., Multimodal interfaces for dynamic interactive maps, inProceedings of Conference on Human Factors in ComputingSystems: CHI, 95-102, 1996.[20]Oviatt S.L., Cohen, P.R., Multimodal interfaces that process whatcomes naturally. Communication of the ACM, 43(3):45-53,2000,[2 13Oviatt S.L., et al., Designing the user interface for multimodalspeech and gesture applications: State-of-the-art systems andresearch directions, Human Comp. Interaction, 15(4):263-322,2000,[22] Oviatt S., De Angeli A., Kuhn K., Integration and Synchronizationof Input Modes during Multimodal HCI, Proceedings of CHI 97,

415-422, 1997.[23]Place U.T. The Role of the Hand in the Evolution of Language.Psycoloquy, Vol. 11, No. 7,2000, httn://www.coasci.soton.ac.uk[24] Que k F., Mysliwiec T.A., Zhao M., FingerMOuse: A FreehandComputer Pointing Interface in Proc. of Intl Conf. on AutomaticFace and Gesture Recognition, 372-377, 1995.[25] Queck F., et al., Gesture and Speech Multimodal ConversationalInteraction, Tech. Rep., VISLab-01-01, University of Illinois, 2001.[26] Sharma R., et al., SpeecWGesture Interface to Visual-computingEnvironment, IEEE Com p. Graphics and Appl., 20(2):29-37,2000,[27] Tank D.W., Hopfield J.J., Concentrating Information in Time:Analog Neural Networks with Applications to Speech Recognition,Proc. of the 1st Intl Conf. on Neural Nets, Vol. IV, 45 54 68 , 1987.[28] Vo M.T., Waibel A.A., Multimodal human-computer interface:

combination of gesture and speech recognition, InterCHI, 1993.[29]Waibel A., et al., Phoneme Recognition Using Time-Delay NeuralNetworks, IEEE Transactions on Acoustics, Speech, and SignalProcessing, 37( 12): 1888-1898, 1989.[30]Wu L., Oviatt S., Cohen P.R., Multimodal Integration - A StatisticalView, IEEE Transactions on M ultimedia, 1 4):334-34 1, 2000

0-7803-7278-6/02/$10.00 02002 IEEE 2298
http://httu//www.ascension-tech.comhttp://httn//www.coasci.soton.ac.ukhttp://httn//www.coasci.soton.ac.ukhttp://httu//www.ascension-tech.com

02 multi modal speech-gesture interface for handfree painting on a virtual paper using partial...

Documents