arxiv:2004.05830v1 [eess.as] 13 apr 2020 · 2020. 4. 14. · published as a conference paper at...

Published as a conference paper at ICLR 2020

FROM INFERENCE TO GENERATION: END-TO-ENDFULLY SELF-SUPERVISED GENERATION OF HUMANFACE FROM SPEECH

*Hyeong-Seok Choi1, *Changdae Park2, Kyogu Lee11Music and Audio Research Group (MARG), Seoul National University2Department of Computer Science and Engineering, Hong Kong University of Science and [email protected], [email protected], [email protected]

ABSTRACT

This work seeks the possibility of generating the human face from voice solelybased on the audio-visual data without any human-labeled annotations. To thisend, we propose a multi-modal learning framework that links the inference stageand generation stage. First, the inference networks are trained to match the speakeridentity between the two different modalities. Then the trained inference networkscooperate with the generation network by giving conditional information about thevoice. The proposed method exploits the recent development of GANs techniquesand generates the human face directly from the speech waveform making our sys-tem fully end-to-end. We analyze the extent to which the network can naturallydisentangle two latent factors that contribute to the generation of a face image -one that comes directly from a speech signal and the other that is not related to it -and explore whether the network can learn to generate natural human face imagedistribution by modeling these factors. Experimental results show that the pro-posed network can not only match the relationship between the human face andspeech, but can also generate the high-quality human face sample conditioned onits speech. Finally, the correlation between the generated face and the correspond-ing speech is quantitatively measured to analyze the relationship between the twomodalities.

1 INTRODUCTION

Utilizing audio-visual cues together to recognize a person’s identity has been studied in variousfields from neuroscience (Hasan et al., 2016; Tsantani et al., 2019) to practical machine learningapplications (Nagrani et al., 2018b;a; Wen et al., 2019a; Shon et al., 2019). For example, someneurological studies have found that in some cortical areas, humans recognize familiar individualsby combining signals from several modalities, such as faces and voices (Hasan et al., 2016). Inconjunction with the neurological studies, it is also a well known fact that a human speech productionsystem is directly related to the shape of the vocal tract (Mermelstein, 1967; Teager & Teager, 1990).

Inspired by the aforementioned scientific evidence, we would like to ask three related questionsfrom the perspective of machine learning: 1) Is it possible to match the identity of faces and voices?(inference) 2) If so, is it possible to generate a face image from a speech signal? (generation) 3)Can we find the relationship between the two modalities only using cross-modal self-supervisionwith the data “in-the-wild”? To answer these questions, we design a two-step approach where theinference and generation stages are trained sequentially. First, the two inference networks for eachmodality (speech encoder and face encoder) are trained to extract the useful features and to computethe cross-modal identity matching probability. Then the trained inference networks are transferredto the generation stage to pass the information about the speech, which helps the generation networkto output the face image from the conditioned speech.

We believe, however, that it is impossible to perfectly reconstruct all the attributes in the image ofa person’s face through the characteristics of the voice alone. This is due to factors that are clearly

*Equal contribution

1

arX

iv:2

004.

0583

0v1

[ee

ss.A

S] 1

3 A

pr 2

020


unrelated to one’s voice, such as lighting, glasses, and orientation, that also exist in the natural faceimage. To reflect the diverse characteristics presented in the face images “in-the-wild”, we thereforemodel the generation process by incorporating two latent factors into the neural network. Morespecifically, we adopted conditional generative adversarial networks (cGANs) (Mirza & Osindero,2014; Miyato & Koyama, 2018) so that the generator network can produce a face image that isdependent not only on the paired speech condition, but also on the stochastic variable. This allowsthe latent factors that contribute to the overall facial attributes to be disentangled into two factors:one that is relevant to the voice and the other that is irrelevant.

Adopting cGANs negligently still leaves a few problems. For example, the condition in a cGANsframework is typically provided as embedded conditional vectors through the embedding look-uptable for one-hot encoded labels (Brock et al., 2019; Miyato & Koyama, 2018). The raw signalssuch as speech, however, cannot be taken directly from the embedding look-up table, so an encodermodule is required. Therefore, the trained speech encoder from the inference step is reused to outputa pseudo conditional label that is used to extract meaningful information relevant to the correspond-ing face. Then the generator and the discriminator are trained in an adversarial way by utilizing thepseudo-embedded conditional vectors obtained from the trained speech encoder in the first step.

Another problem with applying the conventional cGANs for generating faces from voice arises fromthe fact that the distinction between different speakers can be quite subtle, which calls for a needfor a more effective conditioning method. To mitigate this problem, we propose a new loss function,relativistic identity cGANs (relidGANs) loss, with modification of the relativistic GANs (Jolicoeur-Martineau, 2019), allowing us to generate the face with a more distinct identity.

Each step will be described in greater detail in Section 3.

Our contributions can be summarized as follows:

1. We propose simple but effective end-to-end inference networks trained on audio-visualdata without any labels in a self-supervised manner that perform a cross-modal identitymatching task.

2. A cGANs-based generation framework is proposed to generate the face from speech, to beseamlessly integrated with the trained networks from inference stage.

3. A new loss function, so called a relidGANs loss, is designed to preserve a more consistentidentity between the voices and the generated images.

4. An extensive analysis is conducted on both inference and generation tasks to validate ourproposed approaches.

2 RELATED WORKS

There has been an increasing interest in the self-supervised learning framework within the machinelearning research community. While the focus of the many studies has been concentrated on appli-cations of matching the audio-visual correspondence such as the presence or absence of objects in avideo or a temporal alignment between two modalities (Chung et al., 2017; Afouras et al., 2018b;a;Ephrat et al., 2018), growing attention has come to matching the speaker identity between the humanface and voice.

Inference The cross-modal identity matching task between face and voice has been recently studiedin machine learning community. Nagrani et al. (2018b) proposed a convolutional-neural-network(CNN) based biometric matching network by concatenating the two embedding vectors from eachmodality and classifying them with an additional classifier network. Though showing promisingresults, this style of training is limited as the model is not flexible in that the number of concatenatedvectors used in the training cannot be changed in the test phase. Next, Nagrani et al. (2018a) modeledthe concept of personal identity nodes (Bruce & Young, 1986) by mapping each modality into theshared embedding space and using triplet loss to train the encoders to allow more flexible inference.Lastly, Wen et al. (2019a) proposed DIMNet where two independent encoders are trained to map thespeech and face into the shared embedding space, classifying each of them independently utilizingthe supervised learning signals from labels such as identity, gender and nationality.

2


Generation There have been a few concurrent works that tackled the similar problem of generatinga face image from speech, which we think are worth noting here. Duarte et al. (2019) proposed aGANs-based framework to generate a face image from the speech. But the intention of their workwas not to generate a face from unseen speaker identity but more of seeking the possibility of cross-modal generation between the speech and face itself. Oh et al. (2019) proposed a similar work calledspeech2face. The pre-trained face recognition network and the neural parametric decoder were usedto produce a normalized face (Parkhi et al., 2015; Cole et al., 2017). After that, the speech encoderwas trained to estimate the input parameter of the face decoder directly. Because the training processstill requires the pre-trained modules trained with supervision, we do not consider this a fully-self-supervised approach. Lastly, the most similar work to ours was recently proposed by Wen et al.(2019b) where they utilized a pre-trained speech identification network as a speech encoder, andused a GANs-based approach to produce the face conditioned on the speech embedding.

Our proposed method differs from the abovementioned approaches in the following ways. First,none of the previous approaches has tried to model the stochasticity in the generation process, butwe address this problem by incorporating the stochasticity in the latent space so that the differentface images can be sampled even when the speech condition is fixed. Second, modeling the imageis more challenging in our work as we aim to train our network to produce larger image size (128× 128) compared to other GANs-based works. Also, we trained the model on the AVSpeech dataset(Ephrat et al., 2018) which includes extremely diverse dynamics in the images. Finally, the importantscope of this work is to seek the possibility of training the whole inference and generation stagesonly using the self-supervised learning method, which is the first attempt to our knowledge. Thewhole pipeline of the proposed approach is illustrated in Fig. 1.

Face1embedding(𝒇𝒆𝟏)

Face2embedding(𝒇𝒆𝟐)

InnerProduct

Softmax

spk1 spk2

Prob.

Speechencoder

Speechembedding(𝒔𝒆)

(a) Inference stageTransfer

Transfer

Speechencoder(fixed)

𝒛

InnerProduct

AdaIN(conditioning) +

(b) Generation stage

Fake

𝜙𝜓

𝒄Discriminator

𝒟

Generator𝒢

Real

Faceencoder

Speaker 1

Speaker 2

ℱ

ℱ

𝒮

𝒮

Figure 1: Overview of the proposed inference and generation stages.

3


3 SELF-SUPERVISED INFERENCE AND GENERATION

3.1 CROSS-MODAL IDENTITY MATCHING

In order to successfully identify the speaker identity from the two audio-visual modalities, we trainedtwo encoders for each modality. First, a speech encoder is trained to extract the information relatedto a person’s identity from a segment of speech. To this end, we use raw-waveform based speechencoder that was shown to extract useful features in speaker verification task, outperforming con-ventional feature such as mel-frequnecy-cepstal-coefficient (MFCC) even when trained using self-supervised learning in speech modality (Pascual et al., 2019)1. The first layer of the speech encoderis SincNet where the parameterized sinc functions act as trainable band-pass filters (Ravanelli &Bengio, 2018). The rest of the layers are composed of 7 stacks of 1d-CNN, batch normalization(BN), and multi-parametric rectified linear unit (PReLU) activation. Next, we used a 2d-CNN basedface encoder with residual connections in each layer. Note that the network structure is similar tothe discriminator network of (Miyato & Koyama, 2018). The details of the networks are shown inAppendix C.

Based on the speech encoder S(·) and the face encoder F(·), they are trained to correctly identifywhether the given face and speech is paired or not. Specifically, as a cross-modal identity matchingtask, we consider two settings as in (Nagrani et al., 2018b), 1. V-F: given a segment of speech selectone face out of K different faces, 2. F-V: given an image of face select one speech segment out ofK different speech segments. The probability that the j-th face fj is matched to the given speech sor vice versa is computed by the inner product of embedding vectors from each module followed bythe softmax function as follows:

1. V -F : p(y = j|{S(s),F(fj)}j=Kj=1 ) = p(y = j|{se,fej}j=K

j=1 ) =e<se,fej>∑j=K

j=1 e<se,fej>,

2. F -V : p(y = j|{S(sj),F(f)}j=Kj=1 ) = p(y = j|{sej ,fe}j=K

j=1 ) =e<sej ,fe>∑j=K

j=1 e<sej ,fe>,

(1)

where s denotes a speech segment, f denotes a face image, se denotes a speech embedding vectorand fe denotes a face embedding vector.

Note that using inner product as a means of computing similarity allows more flexible inference. Forexample, the proposed method enables both F-V and V-F settings, no matter with which method themodel is trained. Also, it allows setting a different number of K in the test phase than the trainingphase.

Finally, the two encoders are trained solely using the self-supervised learning signal from cross-entropy error.

3.2 GENERATING FACE FROM VOICE

Although we aim to generate the human face from the speech, it is only natural to think that not allattributes of the face image are correlated to the speech. Hence, we assume the latent space of theface to be broken down into two parts, the deterministic variable from the speech encoder c and arandom variable z sampled from Gaussian distribution.

Such latent space is modeled using cGANs, a generative model that allows the sampling of faceimages conditioned on speech. More specifically, randomly sampled Gaussian noise z ∈ R128

and speech condition c ∈ R128 are concatenated and used as input for the generator functionffake = G(z, c) to sample the face image conditioned on the speech. In addition, adaptive in-stance normalization (AdaIN) technique is applied as a more direct conditioning method for eachlayer of the generator network (Huang & Belongie, 2017; Karras et al., 2019). The details of thenetwork are described in Appendix C.

In order to generate the face image that are not only close enough to the real face image distribution,but matches with the given condition, the condition information must be properly fed into the dis-criminator. Many studies have suggested such conditioning approaches for the discriminator (Reed

1The speech encoder can be downloaded in the following link: https://github.com/santi-pdp/pase

4

https://github.com/santi-pdp/pase


et al., 2016; Odena et al., 2017; Miyato & Koyama, 2018); among them, we adopted a recent condi-tioning method, projection discriminator (Miyato & Koyama, 2018), which not only suggests a moreprincipled way of conditioning embeddings into the discriminator, but has also been widely used inmany successful GANs related works (Brock et al., 2019; Zhang et al., 2018; Miyato et al., 2018).The study showed that, writing the discriminator function as D(f , c) := A(g(f , c)), the conditioninformation can be effectively provided to the discriminator using the inner-product of two vectors,c and φ(f), as follows:

g(f , c) = cTφ(f) + ψ(φ(f)), (2)where A(·) denotes an activation function (sigmoid in our case), c denotes condition embedding,φ(f) denotes output from the inner layer of discriminator and ψ(·) denotes a function that mapsinput vector to a scalar value (fully-connected layer in our case). Here we focused on the fact thatthe conditioning signals can be used as an inner-product of two vectors and replaced it with thesame inner-product operation used to compute the identity matching probability in the inferenceframework. Accordingly, we can rewrite the Eq. 2 by providing the condition with the trained speechencoder c = S(s) and substituting φ(·) with a trained face encoder F(·) from the subsection 3.1 asfollows:

g(f , c) = cTφ(f) + ψ(φ(f)) = S(s)TF(f) + ψ(F(f)). (3)

Next, we adopted relativistic GANs (relGANs) loss which was reported to give stable image gener-ation performance. See Jolicoeur-Martineau (2019) for more details. Combining the condition termcTφ(f) and relGANs loss, g(f , c) can be modified to grel(freal,ffake, c) as follows:

grel(freal,ffake, c) = g(freal, c)− g(ffake, c)

= cTφ(freal)− cTφ(ffake) + ψ(φ(freal))− ψ(φ(ffake)),

grel(ffake,freal, c) = g(ffake, c)− g(freal, c)

= cTφ(ffake)− cTφ(freal) + ψ(φ(ffake))− ψ(φ(freal)).

(4)

Eq. 4 is formulated to produce a face from the paired speech, but it can cause catastrophic forget-ting on the trained φ(·) because the discriminator is no longer trained to penalize the mismatchedface and voice. Thus, we again modify Eq. 4 so that the discriminator relativistically penalizes themismatched face and voice more than a positively paired face and voice as follows:

grelid(freal,ffake, c+, c−) = grel(freal,ffake, c+) + cT+φ(freal)− cT−φ(f

real),

grelid(ffake,freal, c+, c−) = grel(ffake,freal, c+) + cT+φ(ffake)− cT−φ(f

fake),(5)

where freal and c+ denotes the paired face and speech condition from data distribution, ffake

denotes the generated face sample conditioned on c+, and c− denotes the speech condition withmismatched identity to freal using negative sampling.

Finally, utilizing the non-saturating loss (Goodfellow et al., 2014; Jolicoeur-Martineau, 2019), theproposed objective function (relidGANs loss) for discriminator LD and generator LG becomes asfollows:

LD = −E(freal,c+,c−)∼pdata,ffake∼pgen[log(A(grelid(freal,ffake, c+, c−))],

LG = −E(freal,c+,c−)∼pdata,ffake∼pgen[log(A(grelid(ffake,freal, c+, c−))].

(6)

4 EXPERIMENTS AND RESULTS

Dataset and Sampling Two datasets were used throughout the experiments. The first dataset is theAVSpeech dataset (Ephrat et al., 2018). It consists of 2.8m of YouTube video clips of people activelyspeaking. Among them, we downloaded about 800k video clips from a train set and 140k clips froma test set. Out of 800k training samples, we used 50k of them as a validation set. The face imagesincluded in the dataset are extremely diverse since the face images were extracted from the videos “inthe wild” (e.g., closing eyes, moving lips, diversity of video quality, and diverse facial expressions)making it challenging to train a generation model. In addition, since no speaker identity informationis provided, training the model with this dataset is considered fully self-supervised. Because of theabsence of the speaker identity information, we assumed that each audio-visual clip represents an

5


individual identity. Therefore, a positive audio-visual pair was sampled only within a single clipwhile the negative samples were randomly selected from any clips excluding the ones sampled bythe positive pair. Note that each speech and face image included in the positive pair was sampledfrom different time frames within the clip. This was to ensure that the encoded embeddings of eachmodality do not contain linguistic related features (e.g., lip movement or phoneme related features).

The second dataset is the intersection of VoxCeleb (Nagrani & Zisserman, 2017) and VGGFace(Parkhi et al., 2015). VoxCeleb is also a large-scale audio-visual dataset collected on YouTubevideos. VGGFace is a collection of the images of public figures gathered from the web, which isless diverse and relatively easier to model compared to the images in AVSpeech dataset. Since bothVGGFace and VoxCeleb provide speaker identity information, we used the speakers’ included in theboth datasets resulting in 1,251 speakers. We used face images from both VoxCeleb and VGGFace,and used speech audio from Voxceleb. Note that, this dataset provides multiple images and audiofor a single speaker; therefore, training a model with this dataset cannot be called a self-supervisedtraining method in a strict sense.

Implementation In every experiment, we used 6 seconds of audio. The speech samples that wereshorter or longer than 6 seconds were duplicated or randomly truncated so that they became 6 sec-onds. At the inference stage, we trained the networks with stochastic gradient descent optimizer(SGD). At the generation stage, we trained the networks with Adam optimizer (Kingma & Ba,2015). The discriminator network was regularized using R1 regularizer (Mescheder et al., 2018).More details are described in Appendix B.

4.1 CROSS-MODAL INFERENCE EVALUATION RESULTS

Fully self-supervised inference accuracy results: Here we show the evaluation results of the in-ference networks trained on the AVSpeech dataset. Again, the model was trained with no additionalinformation about speaker identity. The evaluations were conducted 5 times by selecting differentnegative samples each time and reported using the average value of them.

Table 1: Accuracy (%) of cross-modal identity matching task.

TestTrain V-F F-V

V-F 89.22 54.33 85.10 44.70F-V 86.94 49.20 88.47 52.55

K-way 2 10 2 10

We trained our model in two settings (1. V-F and 2. F-V) andboth were trained in 10-way setting (1 positive pair and 9negative pairs). Then the two trained models were tested onboth V-F and F-V settings, each of which were tested on 2-way and 10-way settings. The top1-accuracy (%) results ofcross-modal identity matching task is shown in Table 1. Theresults show that our network can perform the task of cross-modal identity matching with a reasonable accuracy despitebeing trained in a fully self-supervised manner. In addition,our model can perform F-V inference with a reasonable accu-racy even when it is trained in V-F setting, and vice versa.

Comparison results: We compared our model with two models - SVHFNet (Nagrani et al., 2018b)and DIMNet (Wen et al., 2019a). For fair comparisons, we trained our model with the intersectionof VoxCeleb and VGGFace datasets. Our model is trained in two train settings (1. V-F and 2. F-V)with 10-way configuration.

Table 2: Comparison results interms of accuracy (%) of cross-modal identity matching task.

ModelTrain V-F 2 F-V

SVHFNet 81.00 35 79.50 -DIMNet-IG 84.12 40 84.03 -

Ours 79.90 55.66 80.83 54.84

K-way 2 10 2 10

The comparison results are shown in Table 2. The results showthat our model performs worse when tested in 2-way settingcompared to other models. Note that, SVHFNet used a pre-trained speaker identification network and face recognitionnetwork trained in a supervised manner with identity infor-mation and DIMNet trained the network in a supervised man-ner using the labels such as identity, gender, and nationality,and therefore is not trainable without such labels, whereas ourmodel was trained from the scratch and without any such la-bels. Under a more challenging situation where K increasesfrom two to ten, however, our model shows significantly bet-

2The 10-way results are inferred from the graph of the paper DIMNet (Wen et al., 2019a).

6


ter performance, which may indicate the proposed method is capable of extracting more informativeand reliable features for the cross-modal identity matching task.

4.2 CROSS-MODAL GENERATION EVALUATION RESULTS

We conducted three qualitative analyses (QLA’s) and three quantitative analyses (QTA’s) to thor-oughly examine the relationship between condition and the generated face samples.

QLA 1. Random samples from (z, c) plane: The generated face images from diversely sampled zand fixed speech condition c is shown in Fig. 2. We can observe that each variable shows differentcharacteristics. For example, it is observable that z controls the orientation of head, backgroundcolor, haircolor, hairstyle, glasses, etc. Alternatively, we observed that c controls gender, age, eth-nicity and the details of the face such as the shape of the face and the shape of the nose.

Figure 2: The illustration of generated face images from interpolated speech conditions. Note thatthe first column consists of the ground truth image of the speakers3.

QLA 2. Generated samples from interpolated speech condition: Next, we conducted an exper-iment to observe how speech conditions affect facial attributes by exploring the latent variable cwith z fixed. We sampled the speech segments from the AVSpeech test set and linearly interpolatedtwo speech conditions. After that we generated the face images based on the interpolated conditionvectors. The generated face images are shown in Fig. 3. We can observe the smooth transition ofspeech condition c in the latent space and can confirm that it has a control over the parts that aredirectly correlated to the identity of a person such as gender, age, and ethnicity.

QLA 3. Generated samples from interpolated stochastic variable: Lastly, we conducted an ex-periment that shows the role of stochastic variable in the generation process by interpolating tworandomly sampled z vectors with c fixed. The speech segments from the AVSpeech test set wereused to make a fixed c for each row. The generated face images are shown in Fig. 4. We can observethat the latent vector z has a control over the parts that are not directly correlated to the identity of aperson such as head orientation, hair color, attire, and glasses.

QTA 1. Correlation between the generated samples and speech conditions: We performed aquantitative analysis to investigate the relationship between the speech condition and the face imageit generated. To this end, we first sampled two different random variables z1, z2 and two differentspeech condition vectors c1, c2 of different speakers. Next, we generated two face images ffake

1 =

G(z1, c1), ffake2 = G(z2, c2) using the generator network. Then the two generated samples were

encoded using the trained inference network F(·) to extract the face embeddings fe1 = F(ffake1 ),

fe2 = F(ffake2 ). We then calculated the cosine distance between the speech condition vectors

(CD(c1, c2)), and the cosine distance between the two face embeddings (CD(fe1,fe2)). Finally,we computed the Pearson correlation between CD(c1, c2) and CD(fe1,fe2). This is to see ifthere exists any positive correlation between the embedding spaces of the two different modalities;

3Speech samples: https://drive.google.com/open?id=1n9VTYm9Z–dxNpwS-ELiotmXES6COE4h

7

https://drive.google.com/open?id=1n9VTYm9Z--dxNpwS-ELiotmXES6COE4h


Figure 3: The illustration of generated face images from interpolated speech condition vectors whilefixing z. Note that the images on the very left and right sides of each row are the ground truth faceimages of the speakers.

Figure 4: The illustration of generated face images from interpolated random vectors while fixingc. Note that the images on the very left side of each row are the ground truth face images of thespeakers.

that is, to see if closer speech embeddings help generate face images whose embeddings are alsocloser even when the random variable z is perturbed.

Fig. 5 (a) shows the results using the test set of the AVSpeech. We can see a positive correlation be-tween the CD(c1, c2) and CD(fe1,fe2) meaning that the generated face images are not randomlysampled.

Furthermore, we examined whether the CD between two speech condition vectors gets closer whencontrolling the gender of the speaker, and also the face images generated from the two speech con-dition vectors. We tested this on VoxCeleb dataset as it provides gender labels for each speakeridentity. We compared the mean values of CD(c1, c2) and CD(fe1,fe2) in two cases; one settingthe gender of the two speakers different and the other one setting the gender the same. Fig. 5 (b)

8


shows a scatter plot of the CD when the two sampled speakers have different genders, and Fig. 5 (c)shows a case where the genders are set the same. We found out that the mean value of CD(c1, c2)gets smaller (0.46 → 0.28) when we set the gender of two speakers the same and accordingly theCD(fe1,fe2) gets also smaller (0.27→ 0.15).

𝐶𝐷(𝒄%, 𝒄')

𝐶𝐷(𝒇𝒆 %,𝒇𝒆 ')

Correlation: 0.5857

(a) AVSpeech𝐶𝐷 𝒄%, 𝒄'


Correlation: 0.4131

(b) Voxceleb (different gender)𝐶𝐷 𝒄%, 𝒄'


Correlation: 0.4514

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

(c) Voxceleb (same gender)

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

Figure 5: The scatter plots of CD(c1, c2) and CD(fe1,fe2).

QTA 2. Testing the generated samples with the inference networks: Here we conducted twoinference experiments in V-F setting. In the first experiment, given a positive pair from the AVSpeechtest set (f , s), two face images (f and G(z,S(s))) are passed to the trained inference networks. Theinference networks are then used to determine which of the two images yields a greater cross-modalmatching probability with a given speech segment s as follows:

N∑n=1

1[p(y = 1|{S(sn),F(fn,j)}j=2j=1) > p(y = 2|{S(sn),F(fn,j)}j=2

j=1)]/N, (7)

where 1[·] denotes an identity function, n denotes a index of test sample, fn,1 denotes the generatedsample G(z,S(sn))), fn,2 denotes the ground truth image paired with the speech sn, andN denotesthe total number of test samples. Note that z was randomly sampled from Gaussian distribution foreach n.

Surprisingly, we found out that the inference networks tend to choose the generated samples morethan the ground truth face images with a chance of 76.65%, meaning that the generator is able togenerate a plausible image given the speech condition c. On the other hand, we found out that thechances significantly decreases to 47.2% when we do not use the relidGANs loss (using Eq. 4),showing that the proposed loss function helps generate images reflecting the identity of a speaker.Note that, however, this result does not necessarily say that the generator is capable of generatinga more plausible image conditioned on one’s voice than the paired face image as this experiment isbounded to the performance of the inference network. In addition, we believe the distribution of realface image is much more diverse than the generated samples, causing the inference network tendtowards the generated samples more.

Our second experiment is similar to the first experiment, but this time we compared the generatedimages from two different generators; one trained without mismatched identity loss (Eq. 4) andthe other with the proposed relidGANs loss. We found out that the inference networks selected thegenerated sample from the generator trained with the proposed relidGANs loss with a chance of79.68%, again showing that the proposed loss function helps to generate samples that reflect theidentity information encoded in the speech condition.

Next, we conducted one additional experiment in F-V setting. Given a generated image G(z,S(s1))from a speech segment s1, we measured the accuracy of inference network selecting s1 out of twoaudio segments s1 and s2 as follows:

N∑n=1

1[p(y = 1|{S(sn,j),F(fn)}j=2j=1) > p(y = 2|{S(sn,j),F(fn)}j=2

j=1)]/N, (8)

where fn denotes a generated face image from a speech segment sn,1, and sn,2 denotes a nega-tively selected speech segment. We found out that the inference networks select s1 with a chanceof 95.14%. Note that, the accuracy of inference networks in 2-way F-V setting is 88.47%, whichmeans the generator can faithfully generate a face image according to the given speech segment s1.

9


QTA 3. Face image retrieval:

Lastly, we conducted a face retrieval experiment in which the goal was to accurately re-trieve a real face image for the speaker using the generated image from their speechsegment as a query. To compose a retrieval dataset, we randomly sampled 100 speak-ers from the test set of the AVSpeech. For each speaker, we extracted 50 face im-ages from a video clip resulting in 5,000 images for the retrieval experiment in total.

Table 3: Face retrieval performance.

Models Metric Top-K

K = 1 K = 2 K = 5 K = 10

Speech2FaceL1 8.34 13.7 24.66 36.22L2 8.28 13.66 24.66 35.84CD 10.92 17.00 30.60 45.82

Ours(w/o mil)

L1 7.32 12.81 24.41 38.82L2 7.21 12.83 24.34 39.24CD 7.36 13.04 24.78 39.59

Ours(w/ mil

(relidGANs))

L1 12.97 20.98 36.56 52.66L2 12.90 21.5 36.84 52.49CD 13.59 21.69 36.94 53.83

Note that this process of composing the re-trieval dataset is same as that of Speech2FaceOh et al. (2019). To retrieve the closest face im-age out of the 5,000 samples, the trained faceencoder was used to measure the feature dis-tance between the generated image and each ofthe 5,000 images. We computed three metricsas a feature distance (L1, L2, and cosine dis-tance (CD)) and the results are reported in Ta-ble 3.

Table 3 shows that our model can achievehigher performance than that of Speech2Faceon all distance metrics. We also measured theperformance of the generator trained withoutmismatched identity loss (mil) (Eq. 4). The re-sults show that the proposed relidGANs lossfunction is crucial to generate the face images that reflect the identity of the speakers. Examplesof the top-5 retrieval results are shown in Fig. 6.

Real Generated Top-5 nearest face images

Figure 6: Top-5 retrieval examples.

5 CONCLUSION AND FUTURE WORK

In this work, we proposed a cross-modal inference and generation framework that can be trained in afully self-supervised way. We trained cGANs by transferring the trained networks from the inferencestage so that the speech could be successfully encoded as a pseudo conditional embedding. We also

10


proposed relidGANs loss to train the discriminator to penalize negatively paired face and speechso that the generator could produce face images with more distinguished identity between differentspeakers. As a future work, we would like to address a data bias problem (e.g., ethnicity, gender,age, etc.) that exists in many datasets. This is a significant problem as many publicly availabledatasets have biased demographic statistics, consequently affecting the results of many algorithms(Buolamwini & Gebru, 2018). We believe that this can be solved with the use of a better datasampling strategy in an unsupervised manner such as (Amini et al., 2019). In addition, we wouldlike to expand the proposed methods to various multi-modal datasets by generalizing the proposedconcept to other modalities.

ACKNOWLEDGMENTS

This work was supported partly by Next-Generation Information Computing Development Pro-gram through the National Research Foundation of Korea (NRF) funded by the Ministry of Scienceand ICT (MSIT; Grant No. NRF-2017M3C4A7078548), and partly by Institute for Information& Communications Technology Planning & Evaluation(IITP) grant funded by the Korea govern-ment(MSIT) (No.2019-0-01367, Infant-Mimic Neurocognitive Developmental Machine Learningfrom Interaction Experience with Real World (BabyMind))

REFERENCES

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman.Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machineintelligence, 2018a.

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Deep lip reading: a comparison ofmodels and an online application. Proc. Interspeech, 2018b.

Alexander Amini, Ava Soleimany, Wilko Schwarting, Sangeeta Bhatia, and Daniela Rus. Uncov-ering and mitigating algorithmic bias through learned latent structure. In Proceedings of theAAAI/ACM Conference on AI, Ethics, and Society, 2019.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity naturalimage synthesis. In International Conference on Learning Representations (ICLR), 2019.

Vicki Bruce and Andy Young. Understanding face recognition. British journal of psychology, 77(3):305–327, 1986.

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commer-cial gender classification. In Conference on fairness, accountability and transparency, 2018.

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences inthe wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,2017.

Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Free-man. Synthesizing normalized faces from facial identity features. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017.

Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Sal-vador, Eva Mohedano, Kevin McGuinness, Jordi Torres, and Xavier Giro-i Nieto. Wav2pix:speech-conditioned face generation using generative adversarial networks. In IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William TFreeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independentaudio-visual model for speech separation. ACM Trans. Graph., 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neuralinformation processing systems, 2014.

11


Bashar Awwad Shiekh Hasan, Mitchell Valdes-Sosa, Joachim Gross, and Pascal Belin. “hearingfaces and seeing voices”: Amodal coding of person identity in the human brain. Scientific reports,6:37494, 2016.

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance nor-malization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),2017.

Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standardgan. In International Conference on Learning Representations (ICLR), 2019.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generativeadversarial networks. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2019.

Davis E. King. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res., 10:1755–1758, Decem-ber 2009. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1577069.1755843.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InternationalConference on Learning Representations (ICLR), 2015.

Paul Mermelstein. Determination of the vocal-tract shape from measured formant frequencies. TheJournal of the Acoustical Society of America, 41(5):1283–1294, 1967.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans doactually converge? Proc. International Conference on Machine Learning (ICML), 2018.

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784, 2014.

Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In InternationalConference on Learning Representations (ICLR), 2018.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalizationfor generative adversarial networks. In International Conference on Learning Representations(ICLR), 2018.

Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. Learnable pins: Cross-modal embeddingsfor person identity. In Proceedings of the European Conference on Computer Vision (ECCV),2018a.

Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the IEEE conference on computer vision and patternrecognition (CVPR), 2018b.

Joon Son Chung Nagrani, Arsha and Andrew Zisserman. Voxceleb: a large-scale speaker identifica-tion dataset. In Proc. Interspeech, 2017.

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxil-iary classifier gans. In Proc. International Conference on Machine Learning (ICML). JMLR. org,2017.

Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Rubinstein,and Wojciech Matusik. Speech2face: Learning the face behind a voice. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In BritishMachine Vision Conference, 2015.

Santiago Pascual, Mirco Ravanelli, Joan Serra, Antonio Bonafonte, and Yoshua Bengio. Learn-ing problem-agnostic speech representations from multiple self-supervised tasks. In Proc.Interspeech, 2019.

12

http://dl.acm.org/citation.cfm?id=1577069.1755843

http://dl.acm.org/citation.cfm?id=1577069.1755843


Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In 2018IEEE Spoken Language Technology Workshop (SLT), 2018.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.Generative adversarial text to image synthesis. In Proc. International Conference on MachineLearning (ICML), 2016.

Suwon Shon, Tae-Hyun Oh, and James Glass. Noise-tolerant audio-visual online person verificationusing an attention-based neural network fusion. In IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2019.

HM Teager and SM Teager. Evidence for nonlinear sound production mechanisms in the vocal tract.In Speech production and speech modelling, pp. 241–261. Springer, 1990.

Maria Tsantani, Nikolaus Kriegeskorte, Carolyn McGettigan, and Lucia Garrido. Faces andvoices in the brain: a modality-general person-identity representation in superior temporal sul-cus. NeuroImage, 201:116004, 2019.

Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, and Rita Singh. Disjoint mappingnetwork for cross-modal matching of voices and faces. International Conference on LearningRepresentations (ICLR), 2019a.

Yandong Wen, Bhiksha Raj, Changil Kim, and Rita Singh. Face reconstruction from voice usinggenerative adversarial networks. In Advances in neural information processing systems, 2019b.

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generativeadversarial networks. In Proc. International Conference on Machine Learning (ICML), 2018.

APPENDIX

A DATA PREPROCESSING

Image processing: All video clips downloaded from the AVSpeech dataset were resampled to be25FPS. If more than one face are detected in a frame, a face closer to the coordinates of the targetspeaker in the first frame was selected. Note that the coordinates of the target speakers are providedby the AVSpeech dataset. Because the size of the speaker’s face on the screen varies from videoto video, the image was resized to maintain interocular distance as 55 pixels, as described in Coleet al. (2017). We used a publicly available software, Dlib (King, 2009), to detect and crop the faceimages. The images were cropped to the size of 224 × 224 and resized to 128 × 128 for training.We additionally applied horizontal flip method as a data augmentation strategy. These procedures ofcropping, resize, and flipping are adapted to all images in VGGFace and VoxCeleb, too.

Audio processing: All audio samples were resampled to 16kHz, and stereo audio samples wereconverted to mono. We used 6 seconds of audio in both the inference and generation model. Ifthe audio is longer than 6 seconds, the audio was randomly truncated. If the audio is shorter than6 seconds, the entire audio was duplicated until it becomes longer than 6 seconds. After then theduplicated audio sample was randomly truncated to be 6 seconds. We applied root-mean-squarenormalization to make the overall amplitude of speech signals to be consistent. The reference levelwas selected as 0.01.

B IMPLEMENTATION DETAILS

We used stochastic gradient descent (SGD) with the momentum of 0.9 and weight decay of 5e − 4to optimize inference networks (speech encoder and face encoder). The learning rate was initializedto 0.001 and decayed by the factor of 10 if validation loss was not decreased for 1 epoch. Trainingwas stopped if learning rate decay occurred three times. Minibatch size was fixed to 32 and 12 forV-F and F-V training configuration, respectively. For both training and test phase, negative sampleswere randomly selected for every step while pre-defined negative samples for each positive samplewere used for validation phase to ensure stable learning rate decay scheduling.

13


To train cGANs, we used Adam optimizer using β1 and β1 of 0.9 and 0.9, respectively. The learningrate of generator and discriminator was fixed to 0.0001 and 0.00005 during training, respectively.Each time the discriminator was updated twice, the generator was updated once. Batch size is 24 andwe trained the model about 500,000 iterations. Note that we adopted the truncation trick of Brocket al. (2019) in subsection 4.2. In all experiments except QTA 1 and QTA 2, where the truncationtrick was not adopted, the truncation threshold was set to 1.0.

C NETWORK STRUCTURES

Inference network consists of a speech encoder and a face encoder. The network structure of thespeech encoder is based on the problem agnostic speech encoder (PASE) (Pascual et al., 2019)followed by an additional time pooling layer and a fully-connected layer (FC). PASE consists ofSincNet and 7 stacks convolutional-block (ConvBlock) composed of 1d-CNN, batch normalization,and multi-parametric rectified linear unit (PReLU) activation. Following PASE, average pooling ontime dimension is applied to make the size of embedding time-invariant. The FC layer was usedas the last layer of the speech encoder. The details of the speech encoder is depicted in Fig. 7 Forface encoder, we used 2d-CNN based residual block (ResBlock) in each layer. Note that the networkstructure is similar to the discriminator network of (Miyato & Koyama, 2018). The details of thearchitecture is shown in Fig. 8.

For the generator, we followed the generator network structure of (Miyato & Koyama, 2018) withsome modifications, such as concatenating z and c as an input and adopting adaptive instance nor-malization (AdaIN) for the direct conditioning method. The details of the discriminator are almostthe same as the face encoder except that it includes additional FC layer, projection layer, and sig-moid as activation function. The details of the discriminator and generator are shown in Fig. 8 andFig. 9, respectively.

Conv1d (O, K, S)

BatchNorm1d

PReLU

(b) ConvBlock for Speech Encoder

SincNet, BatchNorm1d, PReLU

ConvBlock (64, 20, 10)







(a) Speech Encoder

Conv1d (100, 1, 1)

FC (128)

𝒔

Time pooling, ReLU

Figure 7: The structure of the speech encoder. O, K, S indicate the number of output channels, sizeof the kernel and stride, respectively

14


Downsample

+

ReLU

Conv 3×3Conv 1×1

ReLU

Conv 3×3

Downsample

(b) ResBlock for Discriminator

Conv 1×1

+

Conv 3×3

ReLUDownsample

Conv 3×3

ReLU

Downsample

(c) InitResBlock(a) Discriminator

𝒄

𝒇

InitResBlock, Down, 64

ResBlock, Down, 128

ResBlock, Down, 256

ResBlock, Down, 512

ResBlock, Down, 1024

ResBlock, 1024

ReLU, Conv3×3, 128

ReLU, GSP

FC

+

$

𝑠𝑖𝑔𝑚𝑜𝑖𝑑

Faceencoder

Figure 8: The structure of the discriminator network. The blue colored blocks indicate the networkstructure of the face encoder F which is transferred to the discriminator network at the generationstage. The numbers on each block denote the output channel. The GSP denotes a global sum poolingalong the spatial dimension.

𝒄𝒛

Concat.

FC, 1024×4×4

ResBlock, Up, 1024

ResBlock, Up, 512

ResBlock, Up, 512

ResBlock, Up, 256

ResBlock, Up, 128

ResBlock, 64

ReLU, Conv 3×3, 3

Tanh

(a) Generator

Conv 1x1

+

AdaIN

Leaky ReLU

Conv 3×3

UpsampleUpsample

FC

AdaIN

Leaky ReLU

Conv 3×3

𝜸𝟏,𝜷𝟏

𝒄

FC𝜸𝟐,𝜷𝟐

(b) ResBlock for Generator

Figure 9: The structure of the generator network. The numbers on each block denote the outputchannel.

15


D SUPPLEMENTARY GENERATED IMAGES

Here we provide some additional uncurated generated face images. Fig. 10 and Fig. 11 show gener-ated images with truncation threshold 0.5 and 1.0, respectively 4.

Figure 10: The uncurated generated face images of 19 speakers from the test set of the AVSpeech.The very left column is composed of the ground truth face images of speakers.

4Audio samples are available in the following link: https://drive.google.com/open?id=1gWZqh6VfqN33uaKDlmRf8IxmF33FoxAE

16

https://drive.google.com/open?id=1gWZqh6VfqN33uaKDlmRf8IxmF33FoxAE


Figure 11: The uncurated generated face images of 19 speakers from the test set of the AVSpeech.The very left column is composed of the ground truth face images of speakers.

17


E THE EFFECTIVENESS OF TWO-STAGE TRAINING

In this subsection, we would like to address concerns regarding the validity of the two-stage trainingmethod as proposed in Fig. 3. Here the effect of the two-stage training method is demonstrated byshowing the failure case of training when we skip the inference training stage. Fig. 12 shows thatthe generator ignores the speech condition vector in the generation process and, consequently, thegenerated images are only dependent on the random vector z. This shows that the learned represen-tation using the proposed self-supervised method is effective and can be used as an important cuefor the conditional generation.

Figure 12: The generated face images when skipping the inference training stage. The very leftcolumn is composed of the ground truth face images of speakers.

18

arxiv:2004.05830v1 [eess.as] 13 apr 2020 · 2020. 4. 14. · published as a conference paper at...

Documents