lip movement synthesis from text movement synthesis from text shishir mathur1 1department of...

34
Lip Movement Synthesis from Text Shishir Mathur 1 1 Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur) Lip Movement Synthesis from Text July 20, 2017 1 / 33

Upload: nguyenhuong

Post on 09-May-2018

235 views

Category:

Documents


11 download

TRANSCRIPT

Page 1: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Lip Movement Synthesis from Text

Shishir Mathur1

1Department of Computer Science and EngineeringIndian Institute of Technology, Kanpur

July 20, 2017

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 1 / 33

Page 2: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Outline

1 Objective and Motivation

2 Prerequisite KnowledgeGenerative Adversarial NetworksUnsupervised Representation Learning with Deep ConvolutionalGenerative Adversarial NetworksGenerating Videos with Scene DynamicsGenerative Adversarial Text to Image Synthesis

3 ApproachVideo PrepossessingBasic Video Generation NetworkBasic Video Generation with Text Embedding NetworkModified Video Generation with Embedding

4 Dataset Experiments

5 Result Visualization

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 2 / 33

Page 3: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Objective and MotivationLip Reading

Figure: Lip Reading Procedure

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 3 / 33

Page 4: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Objective and MotivationLip Writing

Figure: Lip Writing Procedure

Hallucinating lip movement for new words

Feature Vector for Lip Reading Tasks

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 4 / 33

Page 5: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Objective and MotivationLip Writing

Figure: Lip Writing Procedure

Hallucinating lip movement for new words

Feature Vector for Lip Reading Tasks

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 4 / 33

Page 6: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Prerequisite KnowledgeGenerative Adversarial Network

An Unsupervised Machine Learning algorithm implemented by twoneural networks Generator and Discriminator who compete against eachother in a zero-sum game framework

minG

maxD

V (D,G ) = Ex∼pdata(x)[log(D(x)] + Ez∼pz (z)[log(1 − D(G (z)))]

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 5 / 33

Page 7: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Prerequisite KnowledgeDeep Convolution Generative Adversarial Network

DCGAN was the first attempt at implementing GAN in a DeepConvolutional framework.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 6 / 33

Page 8: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

1 Discriminator TrainingGet image data from the dataset.Find the Cross Entropy loss from the data through the discriminatorwith a true label.Generate a sample from the generator.Find the Cross Entropy loss from the generated data through thediscriminator with a false label.Backpropogate the loss through the discriminator update thediscriminator parameters.

2 Generator TrainingFind the Cross Entropy loss from the generated data through thediscriminator with a true label.Backpropogate the loss in the discriminator and find the loss at theimage level representation.Backpropogate the above calculated image level loss through thegenerator network and update its parameters.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 7 / 33

Page 9: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Prerequisite KnowledgeGenerating Videos with Scene Dynamics

G (z) = m(z) � f (z) + (1 −m(z)) � b(z)

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 8 / 33

Page 10: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Prerequisite KnowledgeGenerative Adversarial Text to Image Synthesis

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 9 / 33

Page 11: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ApproachVideo Prepossessing

Figure: Dataset Preprocessing stepsShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 10 / 33

Page 12: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ApproachBasic Video Generation Network

The lip movement videos did not have any background to them and theonly dynamic aspect to them was the lip movement.

We simplified the network by just having the Foreground generationStream of the VideoGAN framework.

The training procedure was the standard GAN training procedure.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 11 / 33

Page 13: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Figure: Basic Video Generation Network Generator and Discriminator

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 12 / 33

Page 14: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ApproachBasic Video Generation with Text Embedding Network

For video generation from text embedding we first set up a model whichwas just an amalgamation of our basic video generator model and ScottReed’s method of appending the embeddings.

The embedding is up sampled to 128 sized vector using a fully-connectedlayer which is then passed through a LeakyReLU layer. This embedding isthen appended to the initial noise vector.

The discriminator is also updated from the base model for the new task.At the layer when the spatio-temporal dimension of the discriminator is1024×4×4×4 ,the text embedding is again upsampled to 128 dimensionspassed through a LeakyReLU layer and then replicated and appended tothe discriminator so as to make the new dimension (1024 + 128) × 4 × 4.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 13 / 33

Page 15: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Figure: Basic Video Generation with Text Embedding Generator Discriminator

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 14 / 33

Page 16: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Basic Video Generation with Text Embedding NetworkTraining Procedure

For the Discriminator

1 From the database get video frames, their corresponding textembeddings and a set of fake database videos having different textembeddings.

2 Calculate the error for the batch in the following way.

Get error from database video with the corresponding text embeddingwith label true.Get error from the generated video and the text embedding with labelfalse.Get error from mismatched data video and text embedding with labelfalse.

3 Use this error to backpropogate it through the discriminator networkand update the Discriminator parameters.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 15 / 33

Page 17: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Basic Video Generation with Text Embedding NetworkTraining Procedure

For the Generator

1 Use the generated video in discriminator training with the textembedding and find the error with the true label.

2 This error is then backpropogated through the discriminator networkto find the error at the video level representation.

3 This video level error used for the generator network. Using this errorwe backpropogate the error through the Generator network andupdate its parameters.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 16 / 33

Page 18: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ApproachModified Video Generation with Embedding

The results generated from the basic model, though were decipherable aslip-movement, they were blurry.

We expanded upon the basic model made some changes in the generatorand discriminator models as well as made some changes in the trainingprocedure.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 17 / 33

Page 19: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Modified Video Generation with EmbeddingGenerator

Figure: Modified GeneratorShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 18 / 33

Page 20: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Modified Video Generation with EmbeddingDiscriminator

Figure: Modified DiscriminatorShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 19 / 33

Page 21: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Modified Video Generation with Embedding NetworkChanges in Training

1 We sampled from a Spherical Gaussian rather than a Uniformdistribution for sampling for the generator.

2 Replace ReLU layers with LeakyReLU in both generator anddiscriminator.

3 Rather than using two target labels (0,1) for true and false we usesoft labels (0-0.3) for true and (0.7-1.2) for false. This leads to bettertraining of the generator and discriminator.

4 The Discriminator was training and moving towards 0 error soonwhich was causing the Generator to go haywire during training. Toavoid this we added Dropout layers in both generator anddiscriminator for better training.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 20 / 33

Page 22: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Dataset ExperimentsGrid Dataset

The dataset has 34 users saying sentences in the format<command><color ><preposition ><letter ><digit ><adverb >like”place blue at F 9 now”.

Type Number of Words Words

command 4 bin, lay, place, setcolor 4 blue, green, red, white

preposition 4 at, by, in, withletter 25 A-Z excluding Wdigit 10 0-9

adverb 4 again, now, please, soon

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 21 / 33

Page 23: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Various Datasets for Generation

Sub Sampling Dataset: Took the 75 frames of the video, sub sampled32 frames from it at regular intervals and used the full text embeddingassociated with them.

Multi Word Dataset: Broke down the 2 second videos into 2 parts ofalmost equal size according to the frames in which the words are spoken.The 2 videos were sub sampled for 32 frames with their correspondingword embedding.

One Word Dataset: Comprised of the frames of people saying a singleword which were super sampled from the corpus videos with one wordembedding.

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 22 / 33

Page 24: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ResultsBasic Video Generation with Sub Sampling Dataset

Figure: Basic Video Generation with Sub Sampling Dataset

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 23 / 33

Page 25: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ResultsBasic Embedding model with Sub Sampling Dataset

Figure: Basic Embedding model with Sub Sampling Dataset

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 24 / 33

Page 26: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ResultsModified Embedding Model with Sub Sampling Dataset

Figure: Modified Embedding Model with Sub Sampling Dataset

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 25 / 33

Page 27: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ResultsModified Embedding Model with Multi Word Dataset

Figure: Modified Embedding Model with Multi Word Dataset

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 26 / 33

Page 28: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

ResultsModified Embedding Model with One Word Dataset

Figure: Modified Embedding Model with One Word Dataset

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 27 / 33

Page 29: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Quantitative ResultsStructural Similarity Index

SSIM is Structural Similarity Index introduced in 2004 Z. Wang et.al Itmeasures the similarity in structure of images.

The SSIM index is defined as

SSIM(x , y) =(2µxµy + c1)(2σxy + c2)

(µ2x + µ2y + c1)(σ2x + σ2y + c2)

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 28 / 33

Page 30: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Word SSIM Score Word SSIM Score Word SSIM Scorea 0.4996644682 in 0.5339761656 set 0.5494068596

again 0.5157354864 j 0.4691103475 seven 0.5109698342at 0.5325048724 k 0.4763904443 sil 0.5107815605b 0.4859383738 l 0.4938563569 six 0.5021996758

bin 0.5272700028 lay 0.5411741415 soon 0.5374915408blue 0.5473855642 m 0.4411119419 sp 0.486668573by 0.498573907 n 0.4966639662 t 0.4368403466c 0.4942128081 nine 0.5535959598 three 0.5040174459d 0.4700953782 now 0.5198881993 two 0.539838367e 0.4768304858 o 0.4852925663 u 0.5290575558

eight 0.4897522701 one 0.5018125984 v 0.4799423712f 0.4533106431 p 0.4722277124 white 0.5351952106

five 0.4702177553 place 0.4851515854 with 0.5211795036four 0.5259320842 please 0.5468605807 x 0.4876435969

g 0.5023067994 q 0.4857770015 y 0.4842674642green 0.5004838687 r 0.4713588346 z 0.4627526471

h 0.4562585593 red 0.5369716873 zero 0.5007434968i 0.4918623863 s 0.4510213242

Table: SSIM score between real and generated videos

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 29 / 33

Page 31: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Similar Lip Movement WordsWord1 Word2 Real Videos Generated Videos

u blue 0.7734897963 0.7988792887a e 0.8125130974 0.7845548215b bin 0.7974434175 0.7881345659

blue two 0.7828630665 0.7928570448blue bin 0.8102865591 0.7875543107

in nine 0.7805094837 0.7997297857

Different Lip Movement WordsWord1 Word2 Real Videos Generated Videos

four d 0.7400236139 0.7609438891seven t 0.7423081211 0.7207883703one e 0.7108516006 0.7364385358four k 0.7440401521 0.7207561837set place 0.6997665547 0.7223355249

seven place 0.7540461594 0.7261831424at five 0.70381279997 0.73909304439

Table: SSIM score between Similar and Different Lip Movement WordsShishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 30 / 33

Page 32: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Qualitative Results

Figure: Four Eight M

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 31 / 33

Page 33: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Qualitative Results

Figure: Five Blue B

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 32 / 33

Page 34: Lip Movement Synthesis from Text Movement Synthesis from Text Shishir Mathur1 1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur July 20, 2017 Shishir

Thank You

Any Questions?

Shishir Mathur (1Department of Computer Science and Engineering Indian Institute of Technology, Kanpur)Lip Movement Synthesis from Text July 20, 2017 33 / 33