video face swap based on autoencoder generation...

978-1-5386-5195-7/18/$31.00 ©2018 IEEE 103 ICALIP2018

Video Face Swap Based on Autoencoder Generation Network

Shuqi Yan1, Shaorong He2, Xue Lei1, Guanhua Ye1, Zhifeng Xie1,2

1Department of Film and Television Engineering, Shanghai University 2Shanghai Engineering Research Center of Motion Picture Special Effects

Shanghai, China [email protected]

Abstract— Video facial swap usually has strong entertainment

applications, and it is also applicable for the post-production of films and has great application value. At present, the popular face swap is done manually by the PS software, and the synthetic effect of the automatic face changing technology is not good. In order to make up for the lack of these features, this paper proposes a method of video face swap based on autoencoder generation network. The network learns the mapping relationship between distorted face and original face: the encoder can distinguish and extract facial information, and the decoder can restore face separately. First, the local information of tow face is sent to the network to get the initial model; then, the global information is put into the network for fine-tuning; finally, the face exchange between A and B is completed with face alignment and alpha fusion. The experimental results show that the quality of the method is improved significantly.

Keywords—autoencoder generation network; migration learning; face alignment; image fusion;

I. INTRODUCTION Facial exchange has a wide range of applications in video

synthesis and other fields, which is a relatively new research direction. Related technologies are mainly divided into two types: graphic-based face swap technology and image-based face exchange technology. The former can edit and process faces in 3d images by fitting 3d face models corresponding to 2d images, such as 3DMM, which is more accurate and natural. However, this method requires specialized equipment to capture faces and collect large amounts of data. Although the image-based method can achieve simple face change, the expression cannot achieve consistent and obvious sense of synthesis. With the emergence of generated network models in recent years, many researchers began to try facial expression as a kind of "style", using GAN transfers facial exchange network for facial expression. But the instability of the GAN generator makes the face exchange only on the single image. There are still some problems that need to be solved in the video-based face exchange. For example, the stability of the generated face expression, and the fusion between the generated face and the original image.

To solve the above problems, this paper mainly studies two aspects: 1) How to make the generator-generated picture stable,

so that the video facial expression will not have a big difference and ensure the continuity of the video after the face change. 2) How to make the seamless integration of the generated face and the original image, so that it will not make the video jitter and present the flicker phenomenon. Even if the target picture is constrained, the face detail generated by the GAN network has randomness, and the quality is not high. As a classic generation model, autoencoder is relatively stable after training. So we used a autoencoder as a generator network. For the fusion problem, a pure alpha fusion method will make the image fusion edge prominent, the positioning problem of Poisson editing makes the fusion part tremble in video. Alpha fusion method based on face alignment can effectively solve the disadvantages of the two and make the composite image naturally coherent.

In summary, this paper proposes a method of video face swap based on autoencoder generation network. First, the method takes the features 1 1,A Bf f of faces A and B as a local information into two autoencoders, in which the weights of the encoders are shared, the decoders are different. In this way, the distorted information 1~ 1~,A Bf f can restore the original information, so that the initialized network model M can be obtained. Furthermore, we modify the parameters, so that the complete faces 2 2,A Bf f are sent to the network for fine-tuning. In this way, the network input human face A/B can obtain the face B/A of the same expression and gesture. Finally, face alignment and alpha blending are used to exchange the face. Moreover, we can also use migration learning to train any face C and B on the model to get a new model C BM , reducing training time.

II. RELATED WORK As early as 2008, Bitouk et al. [1] automatically substituted

an input face by another face selected from a large database of images based on the similarity of appearance and pose, and further carried out color and light adjustments. However, this method is actually based on the retrieval method, which is inefficient and cannot control the characteristics of the output face. In 2011, K. Dale et al. [2] proposed face swapping in video, using a three-dimensional multi-linear model to track

the facial expression of two videos. But dealing with video needs to consider issues such as time aligned and face tracking and still requires a lot of time and man-machine interaction. In 2015, S. Suwajanakorn [3] and others used a photo database to

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61303093 and 61402278, the Innovation Program of the Science and Technology Commission of Shanghai Municipality of China under Grant No. 16511101300, and the Gaofeng Film Discipline Grant of Shanghai Municipal Education Commission of China.

104

perform face 3D reconstruction, tracking, alignment, and texture modeling to complete the control of another face expression. In July 2015, Matthew Earl [4] used opencv and dlib to perform simple face-changing techniques, but the expression could not be consistent and there was a clear sense of synthesis. In 2017, Yuval Nirkin et al. [5] proposed a face-change algorithm based on Deep Learning, but Deep Learning is only used in image segmentation steps, and the implementation of face-lifting is still based on 3D and the algorithm is inefficient. In 2017, Iryna Korshunova et al. [6] proposed to use CNN to learn the mapping relationship between human faces to realize face change, and in the real sense, the face-changing subject was turned to two-dimensions.

In addition to the technology of face exchange, face recognition and image fusion technology are also the subject of this study. In recent years, with the in-depth study of deep learning, the accuracy of face recognition has reached 99.77%, and the face keypoints positioning from the traditional ASM algorithm [7] to the optimal MTCNN method [8]. Image fusion is also a research direction in image processing technology, such as alpha fusion algorithm, Laplacian pyramid image fusion algorithm [9], Poisson fusion algorithm [10]. This paper will adopt face alignment technology and alpha fusion algorithm [4] to achieve seamless face fusion.

III. VIDEO FACE SWAP BASED ON AUTOENCODER GENERATION NETWORK

A. The overall process Firstly, the research is to identify the face area in a video,

extract the facial local information, that is, the facial features, and then twist it into the network to get an initial network model. Re-extract the complete face area for distortion and initiate the migration of the network model. In the end, face alignment and image fusion are used to complete the face change. The specific process is mainly divided into four parts, as shown in Fig. 1.

Image preprocessing: Mainly decoding video, using face recognition to extract facial regions, and using image processing techniques to distort and get an image.

Autoencoder generation network training: The network learns the mapping relationship between the distorted partial face information and the original information, so that the encoder can recognize and extract the face information, and the decoder can restore the face separately. After training, an initial network model is obtained for subsequent network fine-tuning.

Fine-tuning: Adjust the pre-processing parameters, input the complete face information into the initialization network, fine-tune the training, and improve the quality of the generated image.

Image post-processing: The face-changing process is completed by using face alignment and image fusion methods.

Fig. 1. The overall process of the method.

B. Image Preprocessing Our dataset is two video segments containing A and B

faces. In order to generate a matching facial expression gesture, this method first decodes the video into an image sequence to identify the human face. Then, performs facial feature point location and affine transformation. Finally extracts facial information for distortion into the network. Here, we first use the FFmpeg decoder to decode the video and Dlib to implement face detection. On the basis of realizing face detection, feature points such as eyes and nose are located. This paper uses the trained CNN network model in the “Face_Recognition_Model” module to extract 68 key feature points. This model basically extends the deep residual network ResNet-34 [11] for image recognition. Then, using the referenced face feature points, the face is uniformly aligned, that is, the face positioning points in the two images are aligned by the affine transformation. As shown in (1), the least squares method is used to make the distance between the feature points after the change and the reference point minimum, where p and q are the positioning point matrix of the original image and the reference image, respectively. Finally, the face region is extracted and the image is distorted by remapping, so that we can get the distorted facial image and the corresponding source image is sent to the network. Fig. 2 below shows the image preprocessing process.

68 2

, ,1

arg min T Ts t i i

i

sRp T q

(a) (b) (c)

(d) (e) (f)

Fig. 2. (a) Face Detection Image; (b) Reference Face Image; (c) Face Alignment Image; (d), (f) Extract Face Area; (e) Distorted Face.

On this basis, it can be used for subsequent network fine-tuning., by adjusting the parameters to change the size of the

105

extraction area. At the same time, we also used the point cloud matching algorithm [12] and affine transform to enhance image training data.

C. Autoencoder Generation Network Due to the instability of GAN network image generation,

it cannot be applied to video face exchange technology. The autoencoder is an algorithm in Deep Learning, which has been used for data dimensionality reduction before and is gradually used for image generation. It takes the convolutional down sampling layer of the neural network as an encoder [13]: the image is converted to a latent vector Z; and uses the deconvolution upsampling layer as a decoder: decodes the feature vector z into an image; and ensures the input image is consistent with the output. In this way, the hidden layer of the network will keep the characteristics of the input data as much as possible. In this paper, an autoencoder generation network [14] is used to implement video face-change. The resulting image is stable and consistent in structure, making the video sequence naturally coherent.

1) Network Architecture In this study, we used an encoder and two decoders [15],

one decodes to character A and the other to character B. The specific network architecture is shown in Fig. 3 below.

Fig. 3. Autoencoder generation network framework in this paper.

First, input a distorted image mapA of the picture A to the encoder, and use decoder 1 to restore the picture. This requires the autoencoder to learn to create a feature code Z and to restore the image 'A as shown in (2).Then, the distorted face mapB of the picture B is sent to the encoder, and the decoder 2 is used to restore and generate B , as shown in (3). After training, the two decoders can respectively restore the faces of two people. At the same time, the encoder can also learn to distinguish the faces with the face features.

'1( ( ))mapA Decoder Encoder A

'2 ( ( ))mapB Decoder Encoder B

2) Network Model

Autoencoder generation networks are not complicated. The overall network model is shown in Fig. 4 below. The encoder is a convolutional layer plus a fully connected layer. The role of the convolutional layer is to map the input original information to the hidden layer feature space. The full connection layer is to disrupt the spatial dependence of the image and improve network performance. The encoder is mainly upscale layer. The general neural network first uses the bilinear difference to increase the resolution and then performs the convolution operation, and the upscale layer in this paper is convoluted at the same time as the upsampling. The upscale layer contains a convolution function and a PixelShuffler function. The idea of the PixelShuffler function mainly comes from the paper [16]. The paper points out that the learning ability of the upscale network is stronger than that of the network sampled first and then convolved with the same complexity and feature map information. This function sets the filter's size to the original quarter and the height and width become 2 times the original.

3) Network Training and Migration Learning We put the preprocessed image into the network. Due to

the computer memory, the image size is set to 64*64, which may cause the generated image to be unclear. The core of the convolution layer is 5 with a step size of 2, and the activation function uses LeakyReLU with a parameter of 0.1, which solves the problem that the ReLU function would be “dead”. The Adam optimizer was used for training. The learning rate was lr = 5e-5. The loss is the average absolute error, and the formula is shown in (4) and (5) below, gA is a real image

and 'A is a network generated image.

'1

1

1 m

gi

loss A Am

'2

1

1 m

gi

loss B Bm

After the first training is completed, we get the initialization model M . Adjust some parameters in the image preprocessing to change the area of the input image, send it to the network again for fine tuning, and finally get the face A to B face conversion model A BM . The results obtained are

shown in Fig. 5 below. The sharpness is improved and the details of the image are also enhanced.

Fig. 4. (a) the result of training directly using the entire face area; (b) the result of network fine-tuning.

106

Fig. 5. The overall process of the method.

Using the migration learning technique, the face C to face B transformation model C BM can be obtained by persistent training on other data sets of any faces C and faces B on the basis of the initialization model M . The video face exchange training process is as shown in Fig. 6 below. It shows the results of the initial model training, the middle model is the result of fine tuning A BM , and the right figure is the result of the migrated model C BM . Each of the three columns is a group, and from left to right are: the encoder input picture, the decoder 1 output pictures, and the decoder 2output picture. It can be seen from the figure that the trained codec can easily distinguish the input data information and can reconstruct the corresponding picture.

Fig. 6. Initialize network training (left); network fine-tuning (middle); migration learning training (right).

4) Image FusionFrom the above Fig. 5, we can see that the autoencoder

generation network only learns features of the face area. To complete the face change of the entire video, it is also necessary to seamlessly blend the images, that is, to integrate the face images obtained from the network with the original images. The traditional way is to use simple channel fusion, but this method will have a strong edge. The Poisson-based seamless fusion will cause face jitter to the video due to positioning, resulting in poor fusion. This paper adopts the fusion technology based on the key points of human face: first, align the face and blend the skin color, then use the key points to calculate the convex hull and create the mask, and then use the alpha channel method for fusion. The method of face

alignment is the same as the method in section 3.3.1 of this article. For skin color fusion, we use the RGB scaling color correction principle [4] to change the color to match. The formula is shown in (6).

2* ( 1, )( 2, )

im blur im ksizeblur im ksize

The formula refers to the Gaussian filter, the ksize of which is a key parameter. In this experiment we chose the value of the 0.6* interpupillary distance as the Gaussian kernel. Finally, using the key points of the face (18, 68) to form the convex hull to create a mask to obtain the final fused image. the following Fig. 7 shows the pictures obtained by several fusion methods, and the method proposed by this paper is comparatively good.

a b

c d

Fig. 7. (a) Direct alpha channel fusion; (b) mask Gaussian filtered fusion; (c) Poisson

107

D. Experimental results The experimental environment of this paper uses the Keras

framework under Ubuntu system to train the network model, and the image processing part is written in Python language. The computer is configured with two TITAN XP graphics cards with memory of 62.8GIB. The training model and the fine-tuning model take about 30 hours, and the processing of single image takes about 5~8s.

Experimental results: video facial exchange experiment results are shown in Fig. 8: (a) represents the source video sequence; (b) represents the face image generated by the network in this paper, which is consistent with the face lightening posture of the source image and retains personality information; (c) represents face exchange results using the deepfakes method [14]. Image fusion is bad, with clear boundary and color difference; (d) indicates that the experiment improved the image fusion part and made the mask gaussian blur on the basis of deepfakes. The margin difference is reduced, but the image is blurred; (e) adopts the method of seamless integration of the results, although the effect of single image looks better, but faces in video can produce jitter phenomenon due to the seamless integration of locating; (f) represents the results of the method used in text. On the basis of deepfakes, we change the input image area for

fine-tuning, and then use face key points for fusion. The experimental results show that our method is more effective.

Subjective evaluation: Because the method of this paper is a kind of image generation technology, the field of image generation lacks an objective evaluation standard, therefore, we put forward the subjective evaluation method to evaluate the experiment. Fifty people were asked to rate our results. The observers especially need to compare this paper to solve the two problems: the stability of the video image and image fusion effect, and the grading results summary analysis. The subjective evaluation results as shown in Fig. 9 below. The scoring criteria are shown in table 1.

IV. CONCLUSION This paper presents a video face swap based on the

autoencoder generation network. The network can learn two faces of the information, and have the power generated by a human face to another face, through the given data set into the network training after pretreatment. Then, the result images are fused based on face features to obtain the final images after face change.

(a) (b) (c) (d) (e) (f)

Fig. 8. (a) source video sequence, (b) network generation, (c) deepfakes results, (d) results after mask gaussianblur, (e) results after seamless fusion, (f) results in this paper.

108

TABLE I. SCORING CRITERIA

level score optimal 4~5

good 3~4 general 2~3

poor 1~2 Bad 0~1

(a)

(b)

Fig. 9. The results of subjective evaluation. (a) The score of video coherence; (b) The score of image fusion.

This method solves the complicated interaction of using PS to change the face of video characters and keeps the expression and attitude consistent. However, in this experiment, two characters were used as data sets for training, and video containing two characters was also processed in the test. It is a disadvantage of this model that only two people can change their faces after training, and it is also a common phenomenon based on deep learning of face-changing technology. How to generalize the model to the exchange of any two faces is the future research direction.

REFERENCES

[1] D. Bitouk, N. Kumar, S. Dhillon, P. Belhumeur, and S.K. Nayar,

2008, August. “Face swapping: automatically replacing faces in photographs.” In ACM Transactions on Graphics (TOG) (Vol. 27, No. 3, p. 39). ACM

[2] K. Dale, K. Sunkavalli, M.K. Johnson, D. Vlasic, W. Matusik, and H. Pfister, 2011. “Video face replacement.” ACM Transactions on Graphics (TOG), 30(6), p.130.

[3] S. Suwajanakorn, S.M. Seitz, and I. KemelmacherShlizerman. “What makes tom hanks look like tom hanks.” In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3952–3960, 2015.

[4] Matthew Earl. “Switching Eds: Face swapping with Python Dlib and OpenCV”. [Online] Available: https://matthewearl.github.io/2015/07/28/switching-eds-with-python/. (July 28, 2015)

[5] Y. Nirkin, I. Masi, A.T. Tran, T. Hassner, and G. Medioni, 2017. “On Face Segmentation, Face Swapping, and Face Perception.” arXiv preprint arXiv:1704.06729.

[6] I. Korshunova, W. Shi, J. Dambre, and L. Theis, 2017. “Fast face-swap using convolutional neural networks.” In The IEEE International Conference on Computer Vision.

[7] T.F. Cootes, C. J. Taylor, D.H. Cooper, et al. Active Shape Models-Their Training and Application[J]. Computer Vision & Image Understanding, 1995, 61(1):38-59.

[8] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, 1995. “Active shape models-their training and application.” Computer vision and image understanding, 61(1), pp.38-59.

[9] M. Brown, D.G. Lowe. Automatic Panoramic Image Stitching using Invariant Features[J]. International Journal of Computer Vision, 2007, 74(1):59-73.

[10] M. Brown, and D.G. Lowe, 2007. “Automatic panoramic image stitching using invariant features.” International journal of computer vision, 74(1), pp.59-73.

[11] K.He, X.Zhang, S. Ren, and J. Sun, 2016. “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[12] P.J. Besl, D. Neil. McKay. A method for registration of 3-D shapes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 14(2):239-256.

[13] P.J. Besl, and N.D. McKay, 1992, April. “Method for registration of 3-D shapes.” In Sensor Fusion IV: Control Paradigms and Data Structures (Vol. 1611, pp. 586-607). International Society for Optics and Photonics.

[14] deepfakes . https://github.com/deepfakes/faceswap [15] Gaurav Oberoi. “Exploring DeepFakes”. [Online] Available:

https://www.kdnuggets.com/2018/03/exploring-deepfakes.html. (2018/03)

[16] W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, 2016. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1874-1883).

video face swap based on autoencoder generation...

Documents