voice puppetry

Voice Puppetry

Matthew Brand

MERL — a Mitsubishi Electric Research Laboratory201 Broadway, Cambridge, MA 02139

Frames from a voice-driven animation, computed from a single baby picture and an adult model of facial control. Note thechanges in upper facial expression. See figures 5, 6 and 7 for more examples of predicted mouth shapes.

Abstract

We introduce a method for predicting a control signal from anotherrelated signal, and apply it tovoice puppetry: Generating full facialanimation from expressive information in an audio track. Thevoice puppet learns a facial control model from computer visionof real facial behavior, automatically incorporating vocal and facialdynamics such as co-articulation. Animation is produced by usingaudio to drive the model, which induces a probability distributionover the manifold of possible facial motions. We present a linear-time closed-form solution for the most probable trajectory overthis manifold. The output is a series of facial control parameters,suitable for driving many different kinds of animation ranging fromvideo-realistic image warps to 3D cartoon characters.

CR Categories: I.3.7 [Computer Graphics]: Three-DimensionalGraphics and Realism—Animation; I.2.9 [Artificial Intelligence]:Robotics—Kinematics and Dynamics; I.4.8 [Image Processing andComputer Vision]: Scene Analysis—Time-varying images; G.3[Mathematics of Computing]: Probability and Statistics—Timeseries analysis; E.4 [Data]: Coding and Information Theory—Datacompaction and compression; J.5 [Computer Applications]: Artsand Humanities—Performing Arts

Keywords: Facial animation, lip-syncing, control, learning,computer vision and audition.

1 Face-syncing and control

As rendering techniques begin to deliver realistic-looking scenes,people are beginning to expect realistic-looking behavior. There-

fore control is an increasingly prominent problem in animation.This is especially true in character facial animation, where goodlip-syncing and dynamic facial expression are necessary to make acharacter look lively and believable. Viewers are highly attentive tofacial action, quite sophisticated in their judgments of realism, andeasily distracted by facial action that is inconsistent with the voicetrack.

We introduce methods for learning a control program for speech-based facial action from video, then driving this program withan audio signal to produce realistic whole-face action, includinglip-syncing and upper-face expression, with correct dynamics,co-articulation phenomena, and ancillary deformations of facialtissues. In principle this method can be used to reconstructany process from a related signal, for example, to synthesizedynamically correct 3D body motion from a sequence of shadows.We demonstrate with facial animation because of the obviouscomplexity of the process being modeled—the human face hasmany degrees of freedom, many nonlinear couplings, and a rathercomplicated control program, the mind.

Voice puppetry provides a low-cost quick-turnaround alternativeto motion capture, with the additional flexibility that an actor’sfacial manner can be re-used over and over again to “face-sync”completely novel audio by other speakers and at other frame rates.It is fully automatic but an animator can intercede at any level to adddetail. All algorithms have time complexity linear in the length ofthe input sequence; production time is slightly faster than utterance-time on a contemporary mid-level workstation.

2 Background

Psychologists and storytellers alike have observed that there is agood deal of mutual information between vocal and facial gesture[27]. Facial information can add significantly to the observer’scomprehension of the formal [3] and emotional content of speech,and is considered by some a necessary ingredient of successfulspeech-based interfaces. Conversely, the difficulty of synthesizingbelievable faces is a widely-noted obstacle to producing acceptabledigital avatars and animations. People are highly specialized forinterpreting facial action; a poorly animated face can be disturbingand even can interfere with the comprehension of speech [20].

Lip-syncing alone is a laborious process: The voice track isdissected (often by hand) to identify features such as stops and

ACM Copyright Notice

Copyright 1999 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page or initial screen of the document. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM Inc., fax +1 (212) 869-0481, or [email protected].

vowels, then matched mouth poses are scheduled in the animationtrack, 2-10 per second. Because it can overwhelm productionschedules, lip-syncing has been the focus of many attempts atquasi-automation. Nearly all lip-syncing systems are based on anintermediate phonemic representation, whether obtained by hand[23, 24], from text [9, 12, 1, 17] or, with varying degrees of success,via speech recognition [18, 28, 7, 8, 29]. Typically, phonemic orvisemic tokens are mapped directly to lip poses, ignoring dynamicalfactors. Efforts toward dynamical realism have been heuristicand use limited contextual information (e.g., [10, 7]). Considerthe problem of co-articulation—the interaction between nearbyspeech segments due to latencies in tissue motion. To date, allattempts at co-articulation have depended on heuristic formulæandad hocsmoothing methods. E.g., BALDY [9] is a phoneme-driven computer graphics head that uses hand-designed vocalco-articulatory models inspired by the psychological literature.Although VIDEO REWRITE [7] works by re-ordering existingvideo frames rather than by generating animations, it deservesmention because it partially models vocal (but not facial) co-articulation with triphones—phonemes plus one 1 unit of left andright context. The quality of a video rewrite is determined by theamount of video that is available to be provide triphone examplesand how successfully it is analyzed; smoothing is necessary becausetriphones don’t fully constrain the solution and no video willprovide an adequate stock of triphones.

Considerable information can be lost when discretizing tophonemic or visemic representations. The international phoneticalphabet is often mistaken for a catalog of the sounds and articu-lations humans make while communicating; in fact, phonemes aredesigned only to provide the acoustic features thought necessaryto distinguish pronunciations of higher, meaning-carrying languageelements. Phonemic representations are useful for analysis butquite suboptimal for synthesis because they obliterate predictiverelationships such as those of vocal prosody to upper facial gesture,vocal energy to gesture magnitude, and vocal phrasing to liparticulation. There have been attempts to circumvent phonemes andgenerate lip poses directly from the audio signal (e.g., [22, 19]) butthese are limited to predicting instantaneous vowel shapes.

None of these methods address the actual dynamics of theface. Facial muscles and tissues contract and relax at differentrates. There are co-articulations at multiple time-scales—50-250milliseconds in the vocal apparatus [21], possibly longer on theface. Furthermore, there is evidence that lips alone convey lessthan half of the visual information that human subjects can useto disambiguate noisy speech [3]. Much of the expressive andemotional content of facial gesture occurs in the upper half of theface. This is not addressed at all in speech-driven systems; sometext-driven systems attempt upper-face animation, usually via handannotations orad hocrules that exploit clues to sentence meaningsuch as punctuation.

We propose a more realistic mapping from voice to face bylearning a model of a face’sobserveddynamics during speech, thenlearning a mapping from vocal patterns to facial motion trajectories.Animation is accomplished by using voice information to steer themodel. This strategy has several appealing properties:

• Voice is analyzed with regard to learned (equivalently,optimized) categories of facial gesture, rather than with regardto hypothesized categories of speech perception.

• A consistent probabilistic framework allows us to find theoptimal face trajectory for a whole utterance, making fulluse of forward and backward context and avoiding unjustifiedshortcuts such as smoothing or windowing.

• Video is analyzed just once, for training; the resulting modelcan be used re-used to face-sync any other person or creature tonovel audio.

• The puppet animates speech and non-speech sounds.

• It predicts full facial motion from the neck to the hairline.

• The output is a sequence of facial motion vectors that can beused to drive 2D, 3D, or image-based animations.

3 Modeling the facial behavior manifold

It is useful to think of control in terms of the face’s true behavioralmanifold—a surface of all possible facial pose and velocityconfigurations embedded in a high-dimensional measurementspace, like crumpled paper in 3-space. Actual performancesare trajectories over this manifold. Our learning strategyis to piecewise approximate this manifold with quasi-linearsubmanifolds, then glue together these pieces with transitionprobabilities. Approximation is unavoidable because there isn’tenough information in a finite training set to determine the shapeof the true manifold. Therefore our control model is a probabilisticfinite state machine, in which each state has an “output” probabilitydistribution over facial poses and velocities, including how theycovary. E.g., for every instantaneous pose each state predicts aunique most likely instantaneous velocity. The states are gluedtogether with a distribution of transition probabilities that specifystate-to-state switching dynamics and, implicitly, expected statedurations.

Formally, this is ahidden Markov model(HMM)—Markovbecause all the context needed to do inference can be summedup in a vector of current state probabilities, and hidden becausewe never actually observe the states; we must infer them from thesignal. For this we have the Viterbi algorithm [13], which identifiesthe most likely state sequence given a signal. The related Baum-Welch algorithm [2] estimates parameter values, given training dataand a prior specification of the model’s finite state machine. Bothalgorithms are based on dynamic programming and give locallyoptimal results in linear time. For this reason,HMMs dominate theliterature of speech and gesture recognition.

Unfortunately, even for very small problems such as individualphoneme recognition, finding an adequate state machine is a matterof guesswork. This limits the utility ofHMMs for more complexmodeling because the structure of the state machine (pattern ofavailable transitions) is the most important determinant of a model’ssuccess. Structure also determines the machine’s ability to carrycontext; the rate at which anHMM forgets context is determined byhow easily it can transition between any two states.

Voice puppetry features two significant innovations in the theoryof HMMs: In §4.2 we outline the mathematical basis for trainingalgorithms that estimate both theHMM parameter valuesand thestructure of its finite state machine. This substantially generalizesBaum-Welch. In §4.5 we introduce an efficient solution forsynthesizing the most probable signal from a state sequence.This can be thought of as an inverse Viterbi. Together, thesetechniques allow us to turnHMMs—traditionally only good enoughfor classification—into models of behavioral manifolds that areaccurate enough for synthesis. As such, they can be trained topredict any nonrandom time-varying signal from a coordinatedsignal.

4 System overview

Figure 1 schematically outlines the main phases of voice puppetry.In training (first line), video is analyzed to yield a probabilisticfinite state machine; a mapping from states into regions offacial configuration space; and an occupancy matrix givingstate probabilities for each frame of the training sequence. In

+

+ . . .

+ 3231 3

. . .

+ +

3231 3

+

Figure 1: Schematic of the training, remapping, analysis, andsynthesis steps of voice puppetry. See§4 overview.

A. tracking B. reconstruction C. controlFigure 2: (A) Tracking of facial features for training data. See§4.1.(B) Voice-predicted facial feature locations (blue) superimposedover ground-truth locations (red). See§4.5. (C) A 3D modelarticulated via vertex motions. See§4.6.

remapping (second line), the occupancy matrix is combined withthe synchronized audio to give each state a dual mapping intoacoustic feature space. These two steps define a puppet. Inanalysis of novel audio (third line), the state machine and thevocal distributions are combined to form anHMM which is usedto analyze control audio, resulting in a most likely facial statesequence. This is much like speech recognition, except thatthe units of interest are facial states rather than phonemes. Insynthesis (last line), the system solves for a trajectory throughfacial configuration space that is optimal with regard to the statesequence and learned facial output distributions.

4.1 Signal processing

To obtain facial articulation data, we developed a computer visionsystem that simultaneously tracks many individual features on theface, such as corners and creases of the lips. Taking Hager’sSSDtexture-based tracker [15] as a starting point, we developed a meshof such trackers to cover the face. Figure 2A shows the systemtracking 26 points on a face. We assigned spring tensions to eachedge connecting a pair of trackers, and the entire system was madeto relax by simultaneously minimizing the spring energies and theresiduals of the individual trackers. If a tracker falls off its landmarkfeature, spring forces from its neighbors tend to push it back intoplace. To estimate spring lengths and stiffnesses for a specificsequence, we run the video through the system, record the meanand variance of the distance between pairs of trackers, and usethis to re-estimate the spring properties. A few repetitions sufficed

to obtain stable and accurate tracking in our training videos. Bytracking from two views we can also obtain stereo estimates of 3Ddepth. Our tracker can track on unmarked faces but depends onhigh-quality video to deliver facial texture, e.g., wrinkles or beardshadow. Since obtaining accurate data was more important thanstress-testing our tracker, we marked low-texture facial areas andasked subjects to reduce head motions.

To obtain a useful vocal representation, we calculate a mixof LPC and RASTA-PLP audio features [16]. These are knownto be useful to speech recognition and somewhat robust tovariations between speakers and recording conditions. However,they are designed for phonemic analysis and aren’t necessarilygood indicators of facial activity. We also extract some prosodicfeatures such as the formant frequencies and the energy in sonorantfrequency bands.

Note that the puppet may work equally well with otherrepresentations of vocal and facial signals, and even different kindsof signals. Indeed, the success of our experiments notwithstanding,“Where in the signal is the information?” is still an open questionfor voice-driven facial animation and more generally for faceperception and speech recognition.

4.2 Learning by entropy minimization

The mapping from vocal configurations to facial configurations ismany-to-many: Many sounds are compatible with one facial pose;many facial poses are compatible with one sound. Were it notfor this ambiguity, we could use a regression method such as aneural network or radial basis function network. Since much ofthe complexity arises from causal factors such as co-articulation,the best remedy is to use context from before and after the frameof interest. The fact that the disambiguating context has no fixedlength or proximity to the current frame strongly recommends thatwe use a hidden Markov model, which (if properly trained) canmake optimal use of context across an entire utterance, regardlessof its length. AnHMM uses its hidden states to carry contextualinformation forward and backward in time; a sufficiently powerfultraining algorithm will naturally assign some states to that task.

Since the hidden state changes in each frame under the influenceof the observed data, it is important for the probability matrixgoverning state transitions to be sparse, otherwise a context-carrying state will easily transition to a data-driven state, andthe contextual information will be lost. We have developeda framework for training probabilistic models that minimizestheir internal entropy; inHMMs that translates to maximizingcompactness, sparsity, capacity to carry contextual information,and specificity of the states. The last property is also importantbecause conventionally trainedHMMs typically express the contentof a frame as a mixture of states, making it impossible to say thatthe system was in any one state.

We briefly review the entropic training framework here, andrefer readers to [5, 4] for details and derivations. We begin witha datasetX and a model whose parameters and structure arespecified by the vectorθ. In conventional training, one guessesthe sparsity structure ofθ in advance and merely re-estimatesnonzero parameters to maximize the likelihood functionf(X|θ).In entropic training, we learn the size of theθ, its sparsitystructure, and its parameter values simultaneously by maximizingthe posterior probability given by Bayes’ rule,

θ∗ = argmaxθ

[P (θ|X) ∝ f(X|θ)Pe(θ)] (1)

Bayes’ rule tells us how the probability of a hypothesisθ changesafter we have seen some evidenceX. The key to entropicestimation is that we derive the prior probability of a hypothesis

from its entropy,Pe(θ) ∝ e−H(θ) (2)

where H(·) is an entropy measure defined on the model’sparameters. Entropy measures uncertainty, thus we are seekingthe least ambiguous model that can explain the data. The entropicprior can be understood as a mathematization of Occam’s razor:Choose the simplest hypothesis that adequately explains the data.Simpler models are less ambiguous because they allow feweralternatives. Entropic estimation has interpretations as optimalcompression and free energy minimization [5], depending on theformulation ofH(θ). The free energy interpretation derives fromthe differential entropyHF(θ) = −

∫P (X|θ) logP (X|θ) dX

or the more tractable expected entropy; algorithmic complexitytheory recommends that we use the compression formulation,which upper-boundsHF with the sum the entropies of the model’scomponent distributions. For discrete-state Gaussian-outputHMMs,the likelihood function and compression entropy are:

f(X|θ) =∑S

T∏t

θs(t)|s(t−1)N (xt;µs(t),Ks(t)), (3)

HC(θ) =∑i

∑j

θj|i log 1/θj|i + 12 log(2πe)d|Ki|, (4)

where θj|i are transition probabilities andµi,Ki are the d-dimensional mean and covariance of theith state’s Gaussian outputprobability density function.

Given a factorizable model such as anHMM, the maximuma posteriori (MAP) problem decomposes into a separate equationfor each independent parameter, each having its own entropicprior. In [5] we present exact solutions for a wide variety of suchequations, yielding very fast learning algorithms; the case ofHMMsis extensively treated in [4]. MAP estimation extinguishes excessparameters and maximizes the information content of the survivingparameters. Consequently, if we begin with a large fully-connectedHMM, the training procedure whittles away all parts of the modelthat are not in accord with the hidden structure of the signal. Thisallows us to learn the proper size and sparsity structure of a model.Frequently, entropic estimation ofHMMs recovers a finite-statemachine that is very close to the mechanism that generated the data.

4.2.1 Example

The topmost illustration in figure 4 shows anHMM entropicallyestimated from very noisy samples of a system that orbits in afigure-eight. The true system is a 2D manifold (phase and its rateof change) embedded in a 4D measurement space (observed 2Dposition and velocity); theHMM approximates this manifold withneighborhoods of locally consistent curvature in which velocitycovaries linearly with position. Note that even though the datais noisy and has a continuation ambiguity where it crosses itself,the entropically estimatedHMM recovers the deterministic structureof the system. A conventionally estimatedHMM will get “lost”at the crossing, bunching states at the ambiguity and leavingmany of them incorrectly over-connected, thus allowing multiplecircuits on either loop as well as small circuits on the crossingitself. It is this additional concision and precision of entropicmodels makes them significantly outperform their conventionallyestimated counterparts in traditional inference tasks, and enablesnovel applications such as voice puppetry.

4.3 Training and remapping

Using entropic estimation, we learn a facial dynamical model fromthe time-series of poses and velocities output by the vision system.

Figure 3: Reuse of the facialHMM’s internal state machine inconstructing the vocalHMM. Circles signify hidden states; arrowssignify conditional probabilities; icons signify regions of facial andvocal configuration space contained within each output distribution.See§4.3.

The learning algorithm gives us anHMM plus an occupancy matrixγi,t = Prob(HMM hidden statei explains framet) of the trainingvideo. We useγ to estimate a second set of output probabilities,given each state, of the synchronized audio track. This associatesaudio features to each facial state, resulting in a new vocalHMMwhich has the dynamics of the face, but is driven by the voice(figure 3).

4.4 Analysis

Given a new vocal track, we apply the Viterbi algorithm to the vocalHMM to find the most likely sequence of predicted facial states.Although it is steered by information in the new vocal track, theViterbi sequence is constrained to follow the natural dynamics ofthe face.

4.5 Synthesis

We use the facial output probabilities to make a mapping fromthe Viterbi states to actual facial configurations. Were we tosimply pick the most probable configuration for each state—itsmean face—the animation would jerk from pose to pose. Mostphoneme- and viseme-based lip-sync systems address this problemby interpolating or splining between poses. This might amelioratethe jerkiness, but it is anad hocsolution that ignores the face’snatural dynamics.

A proper solution should yield a short, smooth trajectory thatpasses through regions of high probability density in configurationspace at the right time—in our framework, a geodesic onthe facial behavior manifold. Prior approaches to trajectoryestimation typically involve optimizing an objective functionhaving a likelihood term plus penalty terms for excess lengthand/or kinkiness and/or point clumpiness. The user mustchoose a parameterization and weighting for each term. Thisleads to variational algorithms that are often approximate andcomputationally intensive (e.g., [26]); often the objective functionis nonconvex and one cannot tell whether the found optimum isglobal or mediocre.

Our current setting constrains the problem so significantly thata globally optimal closed-form solution is available. Becausewe model both pose and velocity, the facial output probabilitiesalone contain enough information to completely specify the smoothtrajectory that is most consistent with the facial dynamics and agiven facial state sequence.

The formulation is quite clean: We assume that each state hasGaussian outputs that model positions and velocities. For simplicityof exposition, we’ll assume a single Gaussian per state, but ourtreatment trivially generalizes to Gaussian mixtures. Letµi, µi be

the mean position and velocity for statei, andK−1i =

[KxxiKxxi

KxxiKxxi

]be a full-rank covariance matrix relating positions and velocitiesin all dimensions. Furthermore, lets(t) be the state governingframet and letY = {y1,y2,y3, . . .}> be the variable of interest,namely, the points the trajectory passes through at frame 1,2,3,...(All vectors in this paper are row-major.) We seek the maximumlikelihood trajectory

Y ∗ = argmaxY

log∏t

N (yt;Ks(t))

= argminY

∑t

ytK−1s(t)y

>t /2 + c (5)

whereN (x;K) is the Gaussian probability ofx given covarianceK; and yt = [yt−µs(t), (yt−yt−1)−µs(t)] is a vector of themean-subtracted facial position and velocity at timet. Eqn. 5is a quadratic form having a single global optimum. Setting itsderivative to zero yields a block-banded system of linear equations:

Kxxs(t) +Kxx

s(t)

Kxxs(t) +Kxx

s(t)

Kxxs(t+1) +K xx

s(t+1)

Kxxs(t+1)

Kxxs(t+1)

⊥

yt −µs(t)yt − yt−1 − µs(t)yt − yt+1

µs(t+1)

µs(t+1)

⊥>

= 0

(6)

where the block-transpose[ABCD

]⊥=

[ACBD

]6=[A>C>

B>D>

]and

K xxi = (Kxx

i +K xxi )/2. For T frames andD = dim(yt)

dimensions, the system can be LU-decomposed and solved in timeO(TD3) [14, §4.3.1]. By scaling the velocity terms, one may alsosolve for this trajectory at frame rates other than that of the trainingvideo.

Figure 4 shows various ways of estimating trajectories from anHMM model of the manifold of motion in a figure-eight. Increasingthe number ofHMM states improves the quality of the synthesizedtrajectory, provided there is sufficient data to support estimates ofthe additional parameters. Entropic estimation will automaticallyremove insufficiently supported parameters.

Eqn. 5 is only justified when the Viterbi sequenceV ={s(1), s(2), . . . , s(T )} strongly dominates the distribution ofprobable sequences. The Viterbi sequence, while most likely, mayonly represent a small fraction of the total probability mass—theremay be thousands of slightly different state sequences that arenearly as likely. If this were to happen in the voice puppet,Vwould be a very poor representation of the relevant informationin the audio, and the animation quality would suffer greatly.Consequently, we found that voice puppetry works very poorlywith conventionally estimatedHMMs. These problems are virtuallybanished with entropically estimated models because entropyminimization concentrates the probability mass on the optimalViterbi sequence. (see§5, paragraph 2 for an empirical example).In [6] we present a full BayesianMAP solution which considersall possible state sequences and show that it and the maximumlikelihood solution are only valid for low-entropy models, wherethey give almost identical results inferring 3D full-body pose andmotion from shadows.

4.6 Animation

The puppet synthesizes would-be facial tracking data—what mostlikely would have been seen had the training subject produced the

HMM entropically estimated from noisy data

BEGIN

END

Geodesic calculated from random walk on the HMM

locally smoothed positional geodesic

Geodesics generated without probabilities on velocities

positional geodesic

Figure 4: TOP: An entropically estimatedHMM projected ontosynthetic training data. An× indicates the mean output of astate; an ellipse indicates its covariance; and arcs indicate allowabletransitions. See§4.2.1. SECOND: A trajectory generated usingour method based on positional and velocity distributions. Thestate sequence is obtained from a random walk through theHMM.(Irregularities are due to variations between state dwells in therandom walk.) See§4.5. THIRD: If one solves for a geodesicusing just positional constraints, all control points clump on themeans. BOTTOM: Traditionally, clumpiness is ameliorated bysmoothing terms, but the trajectory is still unacceptable. (Thiscould be improved if one is able and willing to hand-tune theobjective function.)

input vocalization. This can be used to control a 3D animatedhead model or to warp a 2D face image to give the appearanceof motion. Or, by learning an inverse mapping from trackingdata back to training video, we can directly synthesize new video.We chose a versatile solution which provides a surprisingly goodillusion—a 2D image such as a photograph is texture-mapped ontoa 3D model having a low triangle count—roughly 200 (figure 2C).Deformations of the 3D model give a naturalistic illusion of facialmotion while the smooth shading of the image gives the illusionof smooth surfaces. The deformations can be applied directlyby moving vertices according to puppet output, or indirectly byprojecting synthesized facial configurations onto a basis set of

Figure 5: Visualization of the mean configuration for some of the learned states. Dynamical content is not shown.

motion vectors (a.k.a. facial action units) that are defined on themodel (e.g., [25]). The latter approach has the advantage of givingus full 3D control of a model even when the training data is only2D. Action units are also commonly used for facial animation andimage coding, e.g.,MPEG-4.

5 Examples

We recorded subjects telling a variety of children’s stories andprocessed 180 seconds of video, tracking 25 features on the face,mostly around the mouth and eyes. Roughly 60 seconds of the datawere modeled with a 26-state entropically estimatedHMM. Manyof the learned states had mean outputs resembling visemes andcommon facial morph targets, augmented with dynamical content(figure 5). The perplexity (average branching factor) of the learnedfacial state machine was 2.08, indicating that the model is carryingcontext effects such as co-articulation an average of≈ 4.5 frames(≈ 150 milliseconds) in either temporal direction1. In practice,we have seen this model carry context over 330 milliseconds,indicating that the system has discovered facial co-articulationphenomena that last longer than vocal co-articulations (and haveyet to be mentioned in the speech psychology literature). Theseproperties are due to entropic estimation; anHMM conventionallytrained from the same initialization carried context an average ofslightly under 2 frames.

Figure 6 shows this model animating Mt. Rushmore underthe control of novel voice data. In this synthesis task, thepredicted face state sequence had an entropy rate of 0.0315. Thismeans that roughly one out of every 22 predictions in the facialstate sequence had a single plausible alternative. By contrast, aconventionally trainedHMM yielded an entropy rate of 0.875—roughly 2.4 plausible alternatives foreveryprediction in the facialstate sequence. As expected, the most probable sequence from theconventionally estimatedHMM yielded an unacceptably degradedanimation, while the properly weighted combination of all suchsequences produced only a slight improvement.

5.1 Evaluation via error and coding measures

Remarkably, we found that the training data could be quiteaccurately reconstructed (via the model) from its most probablestate sequence. After string compression, this works out to facialmotion coding of less than 4 bits per frame. Reconstruction of facialmotion from the vocal track was almost as good. We quantifiedthis with a squared error measure of divergence between ground-truth (x) and reconstructed (y) facial motion vectors, weighted to

1More precisely,log2.08 26 ≈ 4.5 is the average number of transitionsthe state machine takes to go between any two states. We use this as aheuristic indicator of the model’s memory. The actual amount of time theHMM takes to forget that it was in any particular state is a function of thedata and the output distributions, and can be determined empirically fromdifferences between the most likely state sequence and the set of states thatare most likely to output the observed data.

penalize motions in the wrong direction:

Err(x,y) = (x− y)(x− y)>/(x+ y)(x+ y)> (7)

We reconstructed facial motion from (1) most probable statesequences of the ground-truth motion; (2) the vocal track; and (3) aminimum squared error coding of the data via activations of actionunits of the facial action coding system (FACS2 [11, 25]). The tablebelow shows mean errors as well as coding costs for storing andtransmitting animations:

model bits/ reconstruction errorcoding Kbytes frame train test

state sequence ≈1.1 ≈4 0.1255 0.1698vocal features ≈2.0 <500 0.1731 0.2115

FACS ≈0.6 >600 0.4735 0.4692

The same ranking obtains if one switches to an unweightedsquared-error measure. Note that synthesis from voice issignificantly better than the reconstruction from action unit codings,indicating that the learned representation of theHMM is superior tothe psychologically-motivated but heuristic representation ofFACS.

We obtained even better results by training and using separatemodels for the lower and upper face (e.g., eyes and up).Surprisingly, even with a single model, motion in the upper faceis more accurately predicted than motion around the mouth. Onepossible explanation is that upper facial behavior is a much lesscomplicated phenomenon, even though it seems less directly linkedto vocal behavior.

5.2 Evaluation by naive viewers

In order to judge the subjective quality of the animations, wedesigned a set of blind trials in which naive observers tried todistinguish synthesized from real facial motion. We took 1500frames of tracked video that had not been used for training, set thetracking data aside, and synthesized new facial motion from theaudio. We generated animations from both the synthesized motionand the “ground-truth” tracked motion, broke each animation intothree segments, and presented all segments in random order to naiveobservers. The subjects were asked to select the “more natural”animations. Three observers consistently preferred the synthesizedanimation; three consistently preferred the ground-truth animation;and one preferred the ground-truth animation in two out of threesegments. This modest experiment indicates that while true andsynthesized facial action can be distinguished (real facial action ismore varied); they are almost equally plausible to naive viewers.

6 Discussion

The main determinant of puppetry quality is the extent and varietyof speech behavior in the training video. With 12 seconds of

2FACS, like phonemes and visemes, was designed for psychologicalanalysis, but has been pressed into service for modeling and coding bycomputer scientists.

Figure 6: President Jefferson at rest and face-syncing to novel audio. Animation runs from neck to hairline. On contemporary hardware,compute time is less than the duration of the utterance.

training video we can produce tolerable animation; with 3 minuteswe approach video-realism. The quality of puppeteering degradesgracefully as we increase acoustic noise levels or change tomicrophones or speakers unlike those in the training set. E.g.,when trained on adult men, the puppet has some difficulty withchildren’s and women’s voices. We would recommend separatemodels for each group because of large differences in facial mannerand spectral profiles between gender and age groups. It is possibleto train one large model on all groups, but this requires more datathan needed for separate models.

We have used French-trained puppets to produce Englishanimations and English-trained puppets to produce Russian andJapanese animations. This compares favorably with phoneme-based systems, which typically use an English subset of phonemes.We have also found it reasonably easy, via projection, to animateheads with substantially varied geometries, e.g., toddlers andanimals (figure 7).

We currently train withNTSC 29.97Hz video—a sampling ratetoo low to reliably capture fast facial transients such as plosivesand blinks. The puppet can infer most plosives from context, buttrue film-quality puppetry will probably require higher-resolutiontraining data, tracking a hundred or so points on the face over tensof minutes of 100Hz video. In addition, we currently make no effortto track and model the shape of the tongue; we are currently lookinginto using archival x-ray films to complement the training set.Finally, for photo-realism we must handle wrinkling and changes inskin translucency; we are exploring variants of voice puppetry thatpredict changes in both the facial geometry and the texture map.

The voice puppet is fully automatic. Animators, on the otherhand, want full control of an animation. Aside from adjusting theraw vertex motions predicted by the voice puppet, there are severalways an animator could intercede to customize the animation. Herewe list a few, beginning with the easiest: (1) Choose from apalette of puppets, each trained on a different style of speech andfacial mannerisms. (2) Increase the variance of the training data,which produces a cartoon-like exaggerated range of motion in facialexpression. (3) Add whole-face expression vectors (e.g., a grin)to those generated by the voice puppet. (4) Edit the facial statesequence. Options 3&4 are analogous to the present-day practicesof superimposing multiple morph targets and editing a phonemesequence, respectively.

7 Summary

Voice puppetry combines the voice, face, and facial mannerisms ofthree different people into a realistic speaking animation. Givennovel audio, the system accurately generates lip and whole-facemotions in the style of the training performance, even reproducingsubtle effects such as co-articulation. This purely data-drivenapproach stands on two innovations: An entropy-minimizationalgorithm learns extremely compact and accurate probabilistic

models of the facial behavior manifold from training video; aclosed-form solution for geodesics on this manifold yields facialmotion sequences that are optimally compatible with new audio andwith learned facial behavior.

8 Acknowledgments

Thanks to interns and code contributors: TheHMM and geodesiccode was optimized by Ken Shan. The face renderer [25] wasgraciously provided to us by Robert Forchheimer, and modified forreal-time animation by Ilya Baran. The texture tracker [15] wasobtained from Greg Hager and modified to include mesh constraintsby Ken Shan and Jon Yedidia. Ilya Baran, Jane Maduram, andKen Mathews performed for training data and helped to process andanalyze it. Thanks to the Rickover Science Institute for providingthese talented high school interns. The acoustic analysis code [16]was obtained from the Berkeley International Computer ScienceInstitute web site. Finally, thanks to anonymous reviewers andto colleagues who previewed this work at the 1998 Workshop onPerceptual User Interfaces for their comments and questions.

References

[1] J.E. Ball and D.T. Ling. Spoken language processing in thePersona conversational assistant. InProc. ESCA Workshop onSpoken Dialogue Systems, 1995.

[2] L. Baum. An inequality and associated maximizationtechnique in statistical estimation of probabilistic functions ofMarkov processes.Inequalities, 3:1–8, 1972.

[3] C. Benoit, C. Abry, M.-A. Cathiard, T. Guiard-Marigny, andT. Lallouache. Read my lips: Where? How? When? And so...What? In8th Int. Congress on Event Perception and Action,Marseille, France, July 1995. Springer-Verlag.

[4] M. Brand. Structure discovery in conditional probabilitymodels via an entropic prior and parameter extinction.NeuralComputation (accepted 8/98), October 1997.

[5] M. Brand. Pattern discovery via entropy minimization.In Proc. Artificial Intelligence and Statistics #7, MorganKaufmann Publishers. January 1999.

[6] M. Brand. Shadow puppetry. Submitted toInt. Conf. onComputer Vision, ICCV ’99, 1999.

[7] C. Bregler, M. Covell, and M. Slaney. Video Rewrite: Drivingvisual speech with audio. InProc. ACM SIGGRAPH ’97,1997.

[8] T. Chen and R. Rao. Audio-visual interaction in nultimediacommunication. InProc. ICASSP ’97, 1997.

Figure 7: The many faces of a voice puppet.

[9] M.M. Cohen and D.W. Massaro. Modeling coarticulation insynthetic visual speech. In N.M. Thalmann and D. Thalmann,editors, Models and Techniques in Computer Animation.Springer-Verlag, 1993.

[10] S. Curinga, F. Lavagetto, and F. Vignoli. Lip movementsythesis using time delay neural networks. InProc.EUSIPCO ’96, 1996.

[11] P. Ekman and W.V. Friesen.Manual for the Facial ActionCoding System. Consulting Psychologists Press, Inc., PaloAlto, CA, 1978.

[12] T. Ezzat and T. Poggio. MikeTalk: A talking facial displaybased on morphing visemes. InProc. Computer AnimationConference, June 1998.

[13] G.D. Forney. The Viterbi algorithm.Proc. IEEE, 6:268–278,1973.

[14] G.H. Golub and C.F. van Loan.Matrix Computations. JohnsHopkins, 1996. 3rd edition.

[15] G. Hager and K. Toyama. The XVision system: A general-purpose substrate for portable real-time vision applications.Computer Vision and Image Understanding, 69(1) pp. 23–37.1997.

[16] H. Hermansky and N. Morgan. RASTA processing ofspeech.IEEE Transactions on Speech and Audio Processing,2(4):578–589, October 1994.

[17] I. Katunobu and O. Hasegawa. An active multimodalinteraction system. InProc. ESCA Workshop on SpokenDialogue Systems, 1995.

[18] J.P. Lewis. Automated lip-sync: Background and techniques.J. Visualization and Computer Animation, 2:118–122, 1991.

[19] D.F. McAllister, R.D. Rodman, and D.L. Bitzer. Speakerindependence in lip synchronization. InProc. CompuGraph-ics ’97, December 1997.

[20] H. McGurk and J. MacDonald. Hearing lips and seeingvoices.Nature, 264:746–748, 1976.

[21] K. Stevens (MIT). Personal communication., 1998.

[22] S. Morishima and H. Harashima. A media conversion fromspeech to facial image for intelligent man-machine interface.IEEE J. Selected Areas in Communications, 4:594–599, 1991.

[23] F.I. Parke. A parametric model for human faces. TechnicalReport UTEC-CSc-75-047, University of Utah, 1974.

[24] F.I. Parke. A model for human faces that allows speechsynchronized animation.J. Computers and Graphics, 1(1):1–4, 1975.

[25] M. Rydfalk. CANDIDE, a parameterised face. TechnicalReport LiTH-ISY-I-0866, Department of Electrical Engineer-ing, Linkoping University, Sweden, October 1987. Java demoavailable at http://www.bk.isy.liu.se/candide/candemo.html.

[26] L.K. Saul and M.I. Jordan. A variational principle formodel-based interpolation. Technical report, MIT Center forBiological and Computational Learning, 1996.

[27] E.F. Walther.Lipreading. Nelson-Hall Inc., Chicago, 1982.

[28] K. Waters and T. Levergood. DECface: A system for syntheticface applications.Multimedia Tools and Applications, 1:349–366, 1995.

[29] E. Yamamoto, S. Nakamura, and K. Shikano. Lip movementsynthesis from speech based on hidden Markov models. InProc. Int. Conf. on automatic face and gesture recognition,FG ’98, pages 154–159, Nara, Japan, 1998. IEEE ComputerSociety.

voice puppetry

Documents