multi modal interaction in collaborative virtual environments

8/3/2019 Multi Modal Interaction in Collaborative Virtual Environments

1/5

Invited

Multimodal interaction in collaborative virtual environmentsTaro Goto, Marc Escher, Christian Zanardi, Nadia Magnenat-ThalmannMiraLab, University of Geneva

http://www.miralab.unige.chE-m ail: {goto, escher, zanardi, thalman n)@cu i.unige.ch

AbstractHuman interfaces for computer graphics systems are nowevolving towards a total multi-modal approach.Information gathered using visual, audio and motioncapture systems are now becoming increasingly importantwithin user-controlled virtual environments. This paperdiscusses real-time interaction through the visual analysisof human face feature. The underlying approach torecognize and analyze the facial movements of a realperformance is described in detail. The output of theprogram is directly compatible with MPEG-4 standardparameters and therefore enhances the ability to use theavailable data in any other MPEG-4 compatibleapplication. The real-time facial analysis system gives theuser the ability to control the graphics system by means offacial expressions. This is used primarily with real-timefacial animation systems, where the synthetic actorreproduces the animators expression. The MPEG-4standard mainly focuses on networking capabilities and ittherefore offers interesting possibilities forteleconferencing, as the requirements for the networkbandwidth are quite low.Keywords: Facial analysis, Real-time feature tracking,MPEG-4, Real-time Facial Animation.1. Introduction

In the last few years, the number of application thatrequires a hlly multimodal interface with the virtualenvironment has steadily increased. Within this field ofresearch, recognition of facial expressions is a verycomplex and interesting subject where there have beennumerous research efforts.For instance, DeCarlo and Metaxas [l] have appliedoptical flow and a generic face model based algorithm.This method is robust but it takes much time to recognizeface, and it is not in real-time. Cosatto and Graf [2] used asample-based method. This method needs to make a

sample for each person. Kouadi et al. [3] also usedsample based database and some face markers to processin real-time. However, use of markers is not alwayspractical and it is attractive to allow recognition withoutthem, Igor et al . [4] use edge extraction based algorithmin real-time, without markers. This paper describes amethod to track face features in real-time without markersor lip creams, and more details. The output is converted toMPEG-4 based FAF points. These feature points are sentto a real-time player that deforms and displays a synthetic3D animated face.

In the next section, the complete system forinteractive facial animation is described. A simpledescription of the MPEG4 standard and the related facialanimation parameters is given in Section 3. In Section 4,the facial feature tracking system will be described indetails. The paper will conclude on real-time resultsapplied to a compatible facial animation system.2. System OverviewFigure 1sketches the different tasks and interactions togenerate a real-time virtual dialog between a syntheticclone and an autonomous actor [l]. The video and thespeech of the user drive the clone facial animation, whilethe autonomous actor uses the information of speech andfacial emotions from the user to generate an automaticbehavioral response. MPEG-4 Facial AnimationParameters are extracted in real-time from the video inputof the face. These FAPs can be either used to animate thecloned face or are processed to compute high-levelemotions transmitted to the autonomous actor. Out of theusers speech, phonemes and text information can beextracted. The phonemes are then blended with the FAPsfrom the video to enhance the animation of the clone. Thetext is then sent to the autonomous actor and processedtogether with emotions to generate a coherent answer tothe user. Our system is compliant with the MPEG-4definition [2] briefly described in the next section.

10-7803-5467-2/99/ $10.000 1999 IEEE
http://www.miralab.unige.ch/http://www.miralab.unige.ch/


2/5

3. MPEG-4The newly standardized MPEG-4 is developed inresponse to the growing need for a coding method thatcan facilitate access to visual objects in natural and

synthetic video and sound for various applications such asdigital storage media, internet, and various forms of wiredor wireless communication. ISO/IEC JTC 1/SC29/WG11(Moving Pictures Expert Group - MPEG) had beenworking on this to make it International in March 1999[SIError! Reference source not found.. It will providesupport for 3D Graphics, synthetic sound, text to speech,as well as synthetic faces and bodies. This paper willdescribe the use of facial definitions and animationparameters in an interactive real-time animation system.3.1 Face definition and animation

characters). The animation parameters are preciselydefined in order to allow an accurate implementation onany facialhody model. Here we will mostly discuss facialdefinitions and animations based on a set of feature pointslocated at morphological places on the face. Thefollowing section will shortly describe the FaceAnimation Parameters (FAP).3.2 FAP

The FAP are encoded for low-bandwidthtransmission in broadcast (one-to-many) or dedicatedinteractive (point-to-point) communications. FAPsmanipulate key feature control points on a mesh model ofthe face to produce animated visemes (visual counterpartof phonemes) for the mouth (lips, tongue, teeth), as wellas animation of the head and facial features like the eyesor eyebrows.

Figure 1 System OverviewThe Face and Body animation Ad Hoc Group (FBA) All the FAP parameters involving translationalhas defined in detail the parameters for both the definition movement are expressed in terms of Facial Animation

and animation of human faces and bodies. Definition Parameter Units (FAPU). These units are defined in orderparameters allow a detailed d e f ~ t i o nf bodylface shape, to allow the interpretation of FAPs on any facial model insize and texture. Animation parameters allow the a consistent way, producing reasonable results in terms ofdefinition of facial expressions and body postures. These expression and speech pronunciation. They correspond toparametersexpressionsexpressionsare designed to cover all natural possible fractions of distances between some essential facialand postures, as well as exaggerated features (e.g. eye distance). The fractional units used areand motions to some extent (e.g. for cartoon chosen to allow enough accuracy.

2


3/5

4. Facial feature tracking systemThe facial feature tracking system is described inFigure 2. This system not only recognizes face motionbut it is also animating an MPEG4 compatible virtualface. Using a camera, face features are tracked in real-time by the computer and extracted face features motionand shape are converted to MPEG-4 FAPs and are thensent to a virtual environment over the Internet.To obtain real-time tracking, several problems mustbe resolved. The main problem lies in the variety ofindividual persons appearance, such as skin color, eyecolor, beard, glasses, and so on. The facial features aresometimes not separated by sharp edges, or edges appearat unusual places. This diversity increases the difficultyfor the recognition of faces and the tracking of facialfeatures.

Motion Tracking 1

nImage Capture8

Figure 2 TrackingIn this application, important face featurescharacteristic and their associated information are set

during an initialization phase that will help solve the facediversity problem. Figure 3(a) shows this simpleinitialization. The user only moves some feature boxes,like pupil boxes, a mouth box etc, to intuitive positions.Once the user sets the feature positions, informationaround the features, edge information, and face colorinformation are extracted automatically together withface-dependent parameters containing all the relevantinformation for real-time tracking the face position and itscorresponding facial features without any marker.

Figure 3 Initialization and sam ple tracking results

The tracking process is separated into twoparts: 1) mouthtracking and 2) eye tracking. Edge and gray levelinformation around the mouth and the eyes are the maininformation used during tracking Figure 3(b) displays asample result of features tracked superimposed on the faceimage. The tracking method for the mouth and eye isdescribed in the next two sections.4.1MouthTracking

The mouth is one of the most difficult face feature toanalyze and track. Indeed, the mouth has a very versatileshape and almost every muscle of the face drives itsmotion. Furthermore, beard, mustache, the tongue or theteeth might appear sometimes and further increases thealready difficult tracking.Our method is taking into account some intrinsicproperties of the mouth: 1)Upper teeth are attached to thehead bone and therefore their position remains constant.

2) Conversely, lower teeth move down from their initialposition according to the rotation of the jaw joints. 3)Basic mouth shape (open Vs closed) depends upon bonemovement. From these properties it follows that thedetection of the positions of hidden or apparent teeth fioman image is the best way to make a robust trackingalgorithm of the mouth shape and its associated motion.

The system proceeds first with the extraction of alledges crossing the vertical line going fiom the nose to thejaw. In a second phase, a pattern-matching algorithm isused to compute what we call he?e energy thatcorresponds to the similarity with initial parametersextracted for the mouth. Finally, among all possiblemouth shape, the best candidate is chosen according to ahighest energy criterion.Edge value Edge value

Figure 4 Edge configuration for possible mouth shapeFigure 4 presents the gray level value along thevertical line fiom the nose to the jaw for different possibleshape of the mouth and the corresponding detected edges:

0 Closed mouth: In this case, the center edge appearsstrong, the other two edges appear normally weak, andteeth are hidden inside. But lip cream make three strong

3


4/5

e

edges or center edge would be the weakest edge. Likethis, there are some personalities in the result of theedge detection. Which is also happen when the mouth isopened.Opened mouth: As shown in the figure, when teethare present, the edges are stronger than the edge on theoutside lips, or between a lip and teeth or between thelip and the inside of the mouth. If teeth are hidden bylips (upper or lower) then of course the edge of theteeth is not detected.Once this edge detection process is finished, this extractededge information is compared with the data from ageneric shape database and a first selection of possiblecorresponding mouth shape is done as show in Figure 5.

II I LIPS

Figure 5 Possible candidatesAfter this selection, the corresponding energy for thisposition is calculated for every possible candidate. Theposition that has the largest energy is defined as the nextmouth shape.But this method will fails with a closed mouthbecause there is no region inside the lips. To compensatefor this problem, the center of the mouth is first calculated

as in the open case, then the edge between the upper lipand the lower lip is followed to the left and right of thecenter. Figure 6 shows an example of this lip separationdetection.

. -I REGIONS

Figure 6Detection of lips separationThe result of the algorithm is the opened or closedmouth cases are then transformed into FAP values to betransmitted over the network to the facial animationsystem.

4.2. Eye TrackingThe eye tracking system includes the followingsubsystem: pupil tracking, eyelid position recognition, and

eyebrow tracking. These three subsystems are deeplydependent on one another. For example, if the eyelid isclosed, the pupil position is hidden, and it is obviouslyimpossible to detect its position. Some researchers haveused a deformable generic model for detecting the eyeposition and shape. This approach has a serious drawbackwith it comes to real-time face analysis because it isusually quite slow. In our first attempt, we considered alsoa generic model of an eye, but the system was failingwhen the eye was closed, and the stabilization of the resultwas difficult to obtain. We improve this first method by 1)calculating both pupil positions, 2) eyebrow positions arecalculated, 3) eyelid positions are extracted with respectto their possible position, 4) every data is checked for thepresence of movement. At the last stage, inconsistenciesare checked again and a new best position is chosen ifnecessary.For the pupil tracking, the same kinds of energyfunctions used for the mouth are applied. The maindifference, here, is that during the tracking the pupil mightdisappear when the eyelid are closing. Our method willtake into account such cases. This eye tracking systemfirst proceeds by tracing a box area around the completeeye and finds the largest energy value, secondly, theposition of the pupil point is extracted according to themethod described before for the mouth center. When theeyelid is completely closed, the position of the pupil isobviously undefined but in return the eyelid has a greatchance to be detected as closed.

Figure 7 Eyebrow searchThe method to detect the position of the eyebrowthrough the small box that is defined during initializationfollows a similar approach to the mouth shape recognitionsystem. Vertical line goes down from the forehead untilthe eyebrow is detected (Figure 7). The eyebrow positionis given by the maximum value of the energy.After the center of the eyebrow is found, the edge ofthe brow is followed to the left and to the right to

recognize the shape as shown inFigure 7As soon as the pupil and eyebrow locations aredetected using the methods described previously, it ispossible to guess an eyelid location. When the energy ofpupil is large or almost the same as its initial value, thismeans that the eye is opened. When the energy is small,

4


5/5

the eye may be closed or the person is looking up. Theeyebrow position narrows further the possible eyelidposition. When the eyebrow is lower than its initialposition, the eye is considered to be closed or half closed.It helps the detection of the true eyelid position opposedto a possible wrong detection that may occurred with awrinkle. After this process, the strongest edge in theconsidered area is detected, and set as the eyelid position.

expressions in our system does not use any specialmarkers or make-up. It does not need training but a simpleinitialization of the system allowing new user to adaptimmediately. Extracted data is transformed into MPEG-4FAPs that can be used easily for any compatible facialanimation system.6. Acknowledgement

Figure 8 Result with FAPs5. Results and conclusion

Figure 8 presents several example of the real-timetracking of face feature with different persons and withthe associated animated face. This program works onWindows NT and 95/98. The speed for the recognition is10 to 15 frames per second with Pentium I1 300Mhzprocessor and with capturing 15 images of 320x240 persecond. The recognition time depends upon the size ofmarkers that are set in initialization and the possiblemotion of feature that we consider.This paper has shown the method to track facialfeatures in real-time. The recognition method for facial

The authors would like to thank every member ofMIRALab who helped create this system, especially Dr.IS. Pandzic who create the original face trackingsoftware, and W.S. Lee for the cloning system. Theresearch is supported by European project eRENA(Electronic Arenas for Culture, Performance, Art andEntertainment)- ESPRIT IV Project 25379.7. References

11 D. DeCarlo, D. Metaxas, Optical Flow Constraintson Deformable Models with Applications to FaceTracking, CIS technical report MS-CIS-97-2321 E. Cosatto, H.P. Graf, Sample-Based Synthesis ofPhoto-Realistic Talking Heads, Computer Animation31 C. Kouadio, P. Poulin, P. Lachapelle, Real-TimeFacial Animation based upon a Bank of 3D FacialExpressions, CA98, 1999, pp128- 136[4] I.S. Pandzic, T.K. Capin, N. Magnenat-Thalmann, D.Thalmann, Towards Natural Communication inNetworked Collaborative Virtual Environments,FIVE 96, December 1996[5] SNHC, Information Technology - Generic CodingOf Audio - Visual Objects Part 2: Visual, ISO/IEC14496-2, Final Draft of International Standard,Version ofi13, Nov., 1998, ISOAECJTCl/SC29/WGll N2502a, Atlantic City, Oct. 1998[6] P. Doenges, F. Lavagetto, J. Ostermann, I.S. Pandzicand E. Petajan, MPEG-4: AudioNideo andSynthetic Graphics/Audio for Mixed Media, ImageCommunications Journal, Vo1.5, No.4, May,1997[7] I.S. Pandzic, T.K. Capin. E. Lee, N. Magnenat-Thalmann, D. Thalmann (1997). A flexibleArchitecture for Virtual Humans in NetworkedCollaborative Virtual Environments, ProceedingsEurogrhics 97, Budapest, Hungary, 1997.[8] G. Sannier, S . Balcisoy, N. Magnenat-Thalmann, D.

Thalmann, VHD: A System for Directing Real-TimeVirtual Actors, The Visual Computer, Springer,1999.[9] W. S. Lee, M. Escher, G. Sannier, N. Magnenat-Thalmann, MPEG-4 Compatible Faces fromOrthogonal Photos, CA 99, 1999 pp 186-194.

1998, pp103 - 110

5

multi modal interaction in collaborative virtual environments

Documents