lip movement synthesis from text movement synthesis from text shishir mathur1 1department of...

Lip Movement Synthesis from Text

Shishir Mathur1

1Department of Computer Science and EngineeringIndian Institute of Technology, Kanpur

July 20, 2017

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 1 / 33

Outline

1 Objective and Motivation

2 Prerequisite KnowledgeGenerative Adversarial NetworksUnsupervised Representation Learning with Deep ConvolutionalGenerative Adversarial NetworksGenerating Videos with Scene DynamicsGenerative Adversarial Text to Image Synthesis

3 ApproachVideo PrepossessingBasic Video Generation NetworkBasic Video Generation with Text Embedding NetworkModified Video Generation with Embedding

4 Dataset Experiments

5 Result Visualization


Objective and MotivationLip Reading

Figure: Lip Reading Procedure


Objective and MotivationLip Writing

Figure: Lip Writing Procedure

Hallucinating lip movement for new words

Feature Vector for Lip Reading Tasks


Prerequisite KnowledgeGenerative Adversarial Network

An Unsupervised Machine Learning algorithm implemented by twoneural networks Generator and Discriminator who compete against eachother in a zero-sum game framework

minG

maxD

V (D,G ) = Ex∼pdata(x)[log(D(x)] + Ez∼pz (z)[log(1 − D(G (z)))]


Prerequisite KnowledgeDeep Convolution Generative Adversarial Network

DCGAN was the first attempt at implementing GAN in a DeepConvolutional framework.


1 Discriminator TrainingGet image data from the dataset.Find the Cross Entropy loss from the data through the discriminatorwith a true label.Generate a sample from the generator.Find the Cross Entropy loss from the generated data through thediscriminator with a false label.Backpropogate the loss through the discriminator update thediscriminator parameters.

2 Generator TrainingFind the Cross Entropy loss from the generated data through thediscriminator with a true label.Backpropogate the loss in the discriminator and find the loss at theimage level representation.Backpropogate the above calculated image level loss through thegenerator network and update its parameters.


Prerequisite KnowledgeGenerating Videos with Scene Dynamics

G (z) = m(z) � f (z) + (1 −m(z)) � b(z)


Prerequisite KnowledgeGenerative Adversarial Text to Image Synthesis


ApproachVideo Prepossessing

Figure: Dataset Preprocessing stepsShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 10 / 33

ApproachBasic Video Generation Network

The lip movement videos did not have any background to them and theonly dynamic aspect to them was the lip movement.

We simplified the network by just having the Foreground generationStream of the VideoGAN framework.

The training procedure was the standard GAN training procedure.


Figure: Basic Video Generation Network Generator and Discriminator


ApproachBasic Video Generation with Text Embedding Network

For video generation from text embedding we first set up a model whichwas just an amalgamation of our basic video generator model and ScottReed’s method of appending the embeddings.

The embedding is up sampled to 128 sized vector using a fully-connectedlayer which is then passed through a LeakyReLU layer. This embedding isthen appended to the initial noise vector.

The discriminator is also updated from the base model for the new task.At the layer when the spatio-temporal dimension of the discriminator is1024×4×4×4 ,the text embedding is again upsampled to 128 dimensionspassed through a LeakyReLU layer and then replicated and appended tothe discriminator so as to make the new dimension (1024 + 128) × 4 × 4.


Figure: Basic Video Generation with Text Embedding Generator Discriminator


Basic Video Generation with Text Embedding NetworkTraining Procedure

For the Discriminator

1 From the database get video frames, their corresponding textembeddings and a set of fake database videos having different textembeddings.

2 Calculate the error for the batch in the following way.

Get error from database video with the corresponding text embeddingwith label true.Get error from the generated video and the text embedding with labelfalse.Get error from mismatched data video and text embedding with labelfalse.

3 Use this error to backpropogate it through the discriminator networkand update the Discriminator parameters.


Basic Video Generation with Text Embedding NetworkTraining Procedure

For the Generator

1 Use the generated video in discriminator training with the textembedding and find the error with the true label.

2 This error is then backpropogated through the discriminator networkto find the error at the video level representation.

3 This video level error used for the generator network. Using this errorwe backpropogate the error through the Generator network andupdate its parameters.


ApproachModified Video Generation with Embedding

The results generated from the basic model, though were decipherable aslip-movement, they were blurry.

We expanded upon the basic model made some changes in the generatorand discriminator models as well as made some changes in the trainingprocedure.


Modified Video Generation with EmbeddingGenerator

Figure: Modified GeneratorShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 18 / 33

Modified Video Generation with EmbeddingDiscriminator

Figure: Modified DiscriminatorShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 19 / 33

Modified Video Generation with Embedding NetworkChanges in Training

1 We sampled from a Spherical Gaussian rather than a Uniformdistribution for sampling for the generator.

2 Replace ReLU layers with LeakyReLU in both generator anddiscriminator.

3 Rather than using two target labels (0,1) for true and false we usesoft labels (0-0.3) for true and (0.7-1.2) for false. This leads to bettertraining of the generator and discriminator.

4 The Discriminator was training and moving towards 0 error soonwhich was causing the Generator to go haywire during training. Toavoid this we added Dropout layers in both generator anddiscriminator for better training.


Dataset ExperimentsGrid Dataset

The dataset has 34 users saying sentences in the format<command><color ><preposition ><letter ><digit ><adverb >like”place blue at F 9 now”.

Type Number of Words Words

command 4 bin, lay, place, setcolor 4 blue, green, red, white

preposition 4 at, by, in, withletter 25 A-Z excluding Wdigit 10 0-9

adverb 4 again, now, please, soon


Various Datasets for Generation

Sub Sampling Dataset: Took the 75 frames of the video, sub sampled32 frames from it at regular intervals and used the full text embeddingassociated with them.

Multi Word Dataset: Broke down the 2 second videos into 2 parts ofalmost equal size according to the frames in which the words are spoken.The 2 videos were sub sampled for 32 frames with their correspondingword embedding.

One Word Dataset: Comprised of the frames of people saying a singleword which were super sampled from the corpus videos with one wordembedding.


ResultsBasic Video Generation with Sub Sampling Dataset

Figure: Basic Video Generation with Sub Sampling Dataset


ResultsBasic Embedding model with Sub Sampling Dataset

Figure: Basic Embedding model with Sub Sampling Dataset


ResultsModified Embedding Model with Sub Sampling Dataset

Figure: Modified Embedding Model with Sub Sampling Dataset


ResultsModified Embedding Model with Multi Word Dataset

Figure: Modified Embedding Model with Multi Word Dataset


ResultsModified Embedding Model with One Word Dataset

Figure: Modified Embedding Model with One Word Dataset


Quantitative ResultsStructural Similarity Index

SSIM is Structural Similarity Index introduced in 2004 Z. Wang et.al Itmeasures the similarity in structure of images.

The SSIM index is defined as

SSIM(x , y) =(2µxµy + c1)(2σxy + c2)

(µ2x + µ2y + c1)(σ2x + σ2y + c2)


Word SSIM Score Word SSIM Score Word SSIM Scorea 0.4996644682 in 0.5339761656 set 0.5494068596

again 0.5157354864 j 0.4691103475 seven 0.5109698342at 0.5325048724 k 0.4763904443 sil 0.5107815605b 0.4859383738 l 0.4938563569 six 0.5021996758

bin 0.5272700028 lay 0.5411741415 soon 0.5374915408blue 0.5473855642 m 0.4411119419 sp 0.486668573by 0.498573907 n 0.4966639662 t 0.4368403466c 0.4942128081 nine 0.5535959598 three 0.5040174459d 0.4700953782 now 0.5198881993 two 0.539838367e 0.4768304858 o 0.4852925663 u 0.5290575558

eight 0.4897522701 one 0.5018125984 v 0.4799423712f 0.4533106431 p 0.4722277124 white 0.5351952106

five 0.4702177553 place 0.4851515854 with 0.5211795036four 0.5259320842 please 0.5468605807 x 0.4876435969

g 0.5023067994 q 0.4857770015 y 0.4842674642green 0.5004838687 r 0.4713588346 z 0.4627526471

h 0.4562585593 red 0.5369716873 zero 0.5007434968i 0.4918623863 s 0.4510213242

Table: SSIM score between real and generated videos


Similar Lip Movement WordsWord1 Word2 Real Videos Generated Videos

u blue 0.7734897963 0.7988792887a e 0.8125130974 0.7845548215b bin 0.7974434175 0.7881345659

blue two 0.7828630665 0.7928570448blue bin 0.8102865591 0.7875543107

in nine 0.7805094837 0.7997297857

Different Lip Movement WordsWord1 Word2 Real Videos Generated Videos

four d 0.7400236139 0.7609438891seven t 0.7423081211 0.7207883703one e 0.7108516006 0.7364385358four k 0.7440401521 0.7207561837set place 0.6997665547 0.7223355249

seven place 0.7540461594 0.7261831424at five 0.70381279997 0.73909304439

Table: SSIM score between Similar and Different Lip Movement WordsShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 30 / 33

Qualitative Results

Figure: Four Eight M


Qualitative Results

Figure: Five Blue B


Thank You

Any Questions?


lip movement synthesis from text movement synthesis from text shishir mathur1 1department of...

Documents