[ieee 2011 ieee computer society conference on computer vision and pattern recognition workshops...

6
Photorealistic 3D Face Modeling on a Smartphone Won Beom Lee [email protected] Man Hee Lee [email protected] In Kyu Park [email protected] School of Information and Communication Engineering, Inha University, Incheon 402-751, Korea Abstract In this paper, we propose an efficient method for creating a photorealistic 3D face model on a smartphone. Major fea- tures of human face such as eyes, nose, lip, cheek, chin, and profile boundary are extracted automatically from the front and profile images, in which ACM (active contour model) and deformable ICP (iterative closest point) methods are employed. A 3D face model is generated by deforming a generic model so that the 3D face model is correctly corre- sponded to the extracted facial features. Skin texture map is created from the input image, which is mapped on the deformed 3D face model. All procedures are implemented and optimized efficiently on a smartphone with limited pro- cessing power and memory capability. Experimental results show that photorealistic 3D face models are created suc- cessfully on a variety of test samples. It takes about 6 sec- onds on an off-the-shelf smartphone. 1. Introduction In computer vision and graphics community, numerous researchers have challenged to reconstruct photorealistic 3D models of human faces. 3D face modeling technique can be widely applied in a variety of applications including com- puter game, animation, virtual plastic surgery, and virtual beauty shop. However, there still exist many challenges in achieving high-quality and high-speed 3D face model- ing especially on a mobile device like a smartphone. For the diverse platforms ranging from a personal computer to a mobile phone, the cost of 3D face acquisition is proportion- ally increasing to the quality and the speed of photorealistic modeling. Therefore, a proper technique should be selected for a particular application while considering the character- istics of hardware platform as well as the quality and speed requirements. Recently, widely distributed smartphones are computing workhorse for personal multimedia applications. It equips a high resolution camera, a flash light, and a few sensors to find the global position and orientation of the device. In addition, CPU capability and display quality are constantly advancing. Therefore, the feasibility of using smartphone to run existing and new applications is radically increasing. However, it is impractical to use the existing algorithms di- rectly for a smartphone because of the capability and dif- ference resources on hardware and operating systems. This paper addresses the problem of reconstructing photorealis- tic 3D face model on a smartphone which is not handled intensively in the previous works. Existing methods for 3D face modeling are generally cat- egorized into active and passive techniques. Methods in the first category acquire face shape directly using laser range scanner [6] or use hybrid devices such as cameras with ad- ditional projector or controlled illumination. Golovinskiy et al. acquired the detailed facial geometry using photo- metric stereo technique in a controlled environment with a spherical dome and dozens of flashes [7]. Bickel et al. presented a method to obtain a detailed facial shape includ- ing wrinkles and realized highly natural deformation dur- ing animation [3]. Je et al. acquired 3D facial shape using color-coded structured lighting [8], while Zhang et al. ob- tained accurate facial shape changes using spacetime stereo and optical flow [17]. Wang et al. achieved the high speed shape acquisition by developing the structured light system with phase shifting embedded on R, G, and B channels of a customized DLP projector [16]. Note that even if these methods can get the high-resolution and highly accurate fa- cial geometry, they are computationally expensive and im- practical to use on a mobile device like a smartphone. On the other hand, passive methods use only images and videos without any additional active device. Due to the simplicity in capturing the input data, this kind of tech- nique is believed to be suitable for 3D face modeling on a smartphone. Blanz et al. generated a 3D face model by deforming a selected generic model in the database using a single input photo [4]. However, a single frontal image does not provide useful shape information of the face pro- file. Pighin et al. deformed a generic model based on facial features which are manually extracted on multiple images 163

Upload: in-kyu

Post on 11-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

Photorealistic 3D Face Modeling on a Smartphone

Won Beom [email protected]

Man Hee [email protected]

In Kyu [email protected]

School of Information and Communication Engineering,Inha University, Incheon 402-751, Korea

Abstract

In this paper, we propose an efficient method for creatinga photorealistic 3D face model on a smartphone. Major fea-tures of human face such as eyes, nose, lip, cheek, chin, andprofile boundary are extracted automatically from the frontand profile images, in which ACM (active contour model)and deformable ICP (iterative closest point) methods areemployed. A 3D face model is generated by deforming ageneric model so that the 3D face model is correctly corre-sponded to the extracted facial features. Skin texture mapis created from the input image, which is mapped on thedeformed 3D face model. All procedures are implementedand optimized efficiently on a smartphone with limited pro-cessing power and memory capability. Experimental resultsshow that photorealistic 3D face models are created suc-cessfully on a variety of test samples. It takes about 6 sec-onds on an off-the-shelf smartphone.

1. Introduction

In computer vision and graphics community, numerousresearchers have challenged to reconstruct photorealistic 3Dmodels of human faces. 3D face modeling technique can bewidely applied in a variety of applications including com-puter game, animation, virtual plastic surgery, and virtualbeauty shop. However, there still exist many challengesin achieving high-quality and high-speed 3D face model-ing especially on a mobile device like a smartphone. Forthe diverse platforms ranging from a personal computer to amobile phone, the cost of 3D face acquisition is proportion-ally increasing to the quality and the speed of photorealisticmodeling. Therefore, a proper technique should be selectedfor a particular application while considering the character-istics of hardware platform as well as the quality and speedrequirements.

Recently, widely distributed smartphones are computingworkhorse for personal multimedia applications. It equipsa high resolution camera, a flash light, and a few sensors

to find the global position and orientation of the device. Inaddition, CPU capability and display quality are constantlyadvancing. Therefore, the feasibility of using smartphoneto run existing and new applications is radically increasing.However, it is impractical to use the existing algorithms di-rectly for a smartphone because of the capability and dif-ference resources on hardware and operating systems. Thispaper addresses the problem of reconstructing photorealis-tic 3D face model on a smartphone which is not handledintensively in the previous works.

Existing methods for 3D face modeling are generally cat-egorized into active and passive techniques. Methods in thefirst category acquire face shape directly using laser rangescanner [6] or use hybrid devices such as cameras with ad-ditional projector or controlled illumination. Golovinskiyet al. acquired the detailed facial geometry using photo-metric stereo technique in a controlled environment witha spherical dome and dozens of flashes [7]. Bickel et al.presented a method to obtain a detailed facial shape includ-ing wrinkles and realized highly natural deformation dur-ing animation [3]. Je et al. acquired 3D facial shape usingcolor-coded structured lighting [8], while Zhang et al. ob-tained accurate facial shape changes using spacetime stereoand optical flow [17]. Wang et al. achieved the high speedshape acquisition by developing the structured light systemwith phase shifting embedded on R, G, and B channels ofa customized DLP projector [16]. Note that even if thesemethods can get the high-resolution and highly accurate fa-cial geometry, they are computationally expensive and im-practical to use on a mobile device like a smartphone.

On the other hand, passive methods use only images andvideos without any additional active device. Due to thesimplicity in capturing the input data, this kind of tech-nique is believed to be suitable for 3D face modeling ona smartphone. Blanz et al. generated a 3D face model bydeforming a selected generic model in the database usinga single input photo [4]. However, a single frontal imagedoes not provide useful shape information of the face pro-file. Pighin et al. deformed a generic model based on facialfeatures which are manually extracted on multiple images

163

Page 2: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

Figure 1. The overview of the proposed system.

and improved the photorealism by creating a high qualitytexture map [13]. Akimoto, Park, and Ansari et al. auto-matically extracted the facial features from the frontal andprofile images and then generated a customized 3D facemodel by deforming the generic model to fit to these fea-tures well [1][12][2]. They were able to obtain a photo-realistic texture and accurate shape on the side face. How-ever they involved large number of floating point operationswhich is uneasy to import on a mobile device.

In this paper, we propose an efficient algorithm for gen-erating a photorealistic 3D face model on a mobile device,especially on a smartphone. The proposed algorithm doesnot need any calibration of the smartphone camera. In addi-tion, the proposed algorithm is quite robust to common en-vironment where smartphones are used. ACM (active con-tour model) and deformable ICP (iterative closest point) arecustomized and optimized to the smartphone’s computingenvironment. It extracts the facial feature contours such aseyes, nose, lip, cheek, chin, and profile boundary automat-ically. The 3D generic mesh model is deformed to fit withextracted facial features using RBF (radial basis function)interpolation method [5]. A set of optimization techniquesis faithfully deployed when the algorithm is implementedon the smartphone. Figure 1 represents the overall blockdiagram of the proposed system.

2. Facial Feature Extraction

Facial region detection, feature point detection, and fea-ture contours fitting are three main steps of extracting thefacial features. The feature points are used in model defor-mation and texture map generation by setting proper cor-respondence to the vertices on the 3D mesh. In order toextract facial features robustly, the facial region is firstly

(a) (b) (c)Figure 2. Detecting eye contours. (a) Initial contours. (b) Edgegradient magnitude. (c) Final contours.

(a) (b) (c)Figure 3. Detecting lip contour. (a) Initial contour. (b) Binaryclustered image. (c) Final contour.

(a) (b) (c)Figure 4. Detecting nose contour. (a) Initial contours (b) Edgegradient magnitude. (c) final contours.

detected by employing Viola and Jones face detector [15].The shapes of initial contour are created according to thegeneric 3D mesh model for matching 2D feature with 3Dvertex. In this procedure, the position of the initial featurepoints is estimated by using relative and average locationon human faces. The detected feature points are interpo-lated by NURBS (non-uniform rational B-splines) to forminitial contours. The initial contours are forced to convergeto the correct position by using ACM method [9]. Note thatACM method repeatedly searches for the optimal positionby minimizing an energy function which is defined as thesum of internal and external term.

2.1. Frontal Feature Extraction

In eye contour detection, the initial contour is firstly usedto find the center position of the eye by means of rigid tem-plate matching by using only the external energy on theedge gradient magnitude. Then, it is deformed nonrigidlyto using ACM. Figure 2 demonstrates the procedure of ex-tracting eye contours.

In case of the lip contour detection, it is difficult todiscriminate the lip pixels from the surrounding skin ongrayscale intensity. In this paper, we convert the color spaceto R/G and B/G for better discrimination as in [12]. Then,binary K-means clustering is applied to separate the lip fromthe skin. The edge gradient extracted on the binary imageis used in computing the external energy. Similar to theeye contour detection, the initial lip contour is firstly tem-plate matched rigidly only with the external energy, which

164

Page 3: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

(a) (b) (c)Figure 5. Detecting cheek and chin contour. (a) Initial contour. (b)Edge gradient magnitude. (c) Final contour.

Figure 6. Extracted frontal feature contours.

is followed by fine deformation by ACM. Figure 3 showsthe procedure of extracting lip contour.

Left and right nose contours are independently modeledand optimized in applying ACM and then merged as a sin-gle contour. Initial localization and ACM convergence aresimilarly performed as in the eye and lip cases. Figure 4illustrates the procedure of extracting the nose contour.

The chin and cheek contour is located on the bound-ary between background and skin. In order to eliminatethe intervention of background, the initial contour is forcedto shrink inside face region. It is defined as the minimalconvex curve including the extracted eye and lip contours.Then, it is expanded by ACM to fit the final contour. The ex-ternal energy is computed on the edge gradient magnitude.When edge gradient is computed, a little weight are addedon the skin pixels to detect the skin-background boundarymore clearly. Skin pixel detection is performed using theconventional classifier [10]. Figure 5 shows the procedureof extracting the cheek and chin contour.

A few examples of overall frontal feature extraction areshown in Figure 6.

2.2. Profile Feature Extraction

Profile contour is hard to obtain accurately since there isno prior information on the facial feature location in pro-file image. In our approach, the initial contour is firstly de-tected by using global template matching on the candidateregion which is estimated by the spatial distribution of skinpixels. The template model is obtained by projecting theprofile view of the 3D face model and extracting the silhou-

(a) (b) (c)Figure 7. Extracting profile feature contour. (a) The result of initiallocation search over the candidate region. (b) Deformable ICPwith more control points. (c) Final contour in NURBS curve.

Figure 8. Extracted profile feature contours.

ette. Note that it is assumed that the background is fairlyuniform. Similar to the chin and cheek contour detectionprocedure, the external energy is computed on the edge gra-dient magnitude which is computed on the grayscale imagewith more weight on the skin pixels. In order to increase theflexibility of controlling the deformation of the profile con-tour, additional control points are inserted in-between theinitial feature points. A total of 33 control points is usedin our approach, which results in 99 degrees of freedom intotal.

In order to optimize the position and orientation of eachcontrol point, we employ the non-rigid registration tech-nique [11], which is referred as the deformable ICP. In ourapproach, the non-rigid registration with nonlinear least-square optimization problem is solved through conventionalLevenberg-Marquardt algorithm. Figure 7 demonstrates theprocedure of extracting the profile contour. A few examplesof profile feature extraction are shown in Figure 8.

3. 3D Face Model DeformationIn this paper, the generic 3D mesh model is deformed

to approximate the facial geometry of a subject. It is doneby warping the generic mesh such that the feature contourson the 3D mesh are coincidental to the 2D feature contourswhen the 3D mesh is projected onto frontal and side viewplane. Therefore, the exact camera calibration is unneces-sary. Warping of the generic mesh is performed by RBF

165

Page 4: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

(a) (b)

(c) (d)Figure 9. Deformation of initial mesh using RBF. (a) Frontal faceimages. (b) Deformed mesh models. (c) Profile face images. (D)Deformed face models.

interpolation [5]. The key point correspondence in RBF in-terpolation is set up by the feature correspondence on thefrontal (for adjusting x and y coordinates) and on the pro-file view (for z adjusting coordinates). All the other ver-tices are moved by interpolating the key point movementwith the weight of geodesic distance to the key points. Themathematical representation of this procedure is expressedas following.

f(p) = c0 + [c1 c2 c3]p +N∑

i=1

Λiϕ(|p− ui|geodesic) (1)

where p is a 3D vertex and ϕ is the RBF for the key pointsui (1 ≤ i ≤ N), respectively. In our approach, ϕ(r) =e−r/K where K is a predetermined coefficient that definesthe influencing range of the key points. c0, c1, c2, c3 andΛi are all vector coefficients, which are determined by thecorrespondence of the key point movement. The geodesicdistance |·|geodesic between vertices is precomputed by usingFloyd-Warshall all-pair shortest path algorithm. Figure 9shows examples of 3D face model deformation.

4. Texture Map Generation

Facial texture represents the visual appearance such asskin color, wrinkles, and blemishes. It plays an importantrole in photorealistic modeling of 3D object. We representthe facial texture using a cylindrical mapping. The verticesof 3D mesh are radially projected onto the cylinder surfacesuch that the texture coordinates are defined for each ver-tex, in which folded vertices are manually flattened. Frontalinput image is orthogonally projected onto the cylinder tofill the texture map with appropriate color pixels. In ourapproach, texture region outside the frontal face region arefilled with artificial skin texture. The main advantages are

Figure 10. Texture map with and without overlaid mesh.

(1) we can exclude hair texture which appears far from real-istic since we do not recover hair geometry and (2) we canavoid detecting ear in the profile image which is very dif-ficult to perform. In this context, we do not use the profileimage for texture generation.

In order to generate naturally looking texture map, theboundary between frontal face region and the artificial skintexture are alpha-blended. In addition, the color distributionof two regions is adjusted by using the color transfer tech-nique [14]. An example of generated texture map is shownin Figure 10.

5. Mobile Implementation and Optimization

It is not trivial to implement a complex application ona mobile device which has limited resources in computingand storage. Therefore, initial implementation needs con-stant optimization to improve the performance. In this pa-per, we deploy a few steps of algorithm and code optimiza-tion.

Since the CPU in a smartphone does not support hard-ware floating point operation, it is expensive to maintainheavy operations in floating point format. In our imple-mentation, a few computationally heavy steps are writtenin fixed point operation. The fixed point format is carefullyselected so that the residual error is minimized while avoid-ing overflow.

In ACM convergence of feature contour detection, thenumber of iteration is highly affected by the accuracy of theinitial contour. Input images captured by a smartphone havereasonably well-sized faces in the middle, which assure re-liable face detection performance with good initializationof each feature candidate region. Based on this, the initialcontours are positioned accurately with template matchingwithin small but highly probable search area. In this paper,initial contours are accurate enough such that we need lessthan 5 iterations in most cases.

Texture map generation is leveraged by GPU accelera-tion using OpenGL ES 1.1, when the texture map is ren-dered to a front frame buffer and when the pixel value isread back. In addition, in texture color adjustment usingLab color space, the expensive mathematical functions areprecomputed and stored in the lookup tables. The artificial

166

Page 5: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

Steps Before Optimization After Optimization

Front Feature Extraction 5.6 2.81

Profile Feature Extraction 4.2 2.15

3D Model Deformation 0.7 0.12

Texture Map Generation 1.8 1.13

Sum 12.3 6.21

Table 1. Processing time in seconds.

skin texture is stored as an indexed color format such thatthe color transfer operation is performed only for the col-ors in the index table. This enables us to avoid millions ofheavy float point computation over all the texture pixels.

Note that the optimization reduces approximately 50%of the processing time (from 12.3 seconds to 6.2 seconds)of the unoptimized implementation. It is shown in Table 1.

6. Experimental ResultSamsung GT-S8500 smartphone is used in our exper-

iments. It equips an S5PC110 application processor and256MB RAM. The application processor has ARM CortexA8 1GHz CPU in it. Figure 11 shows a few examples ofrendering the result of 3D face model on the smartphone.

Input images are taken under normal illumination with-out camera calibration. In capturing profile images, we as-sume fairly uniform background. The generic face modelconsists of 7,426 vertices and 12,349 triangles. Table 1presents the processing time over the complete modelingsteps. In average, it takes approximately 6.2 seconds for thewhole modeling procedure. The final 3D face models areshown in Figure 12, which shows that the photorealistic 3Dface models resemble the subjects closely. The skin textureneeds to be improved when there exists shadowed region inthe lower part of face, which is common when the lightingis above the subject.

7. ConclusionIn this paper, we proposed an automatic method for mod-

eling 3D face on a smartphone using frontal and profile cap-ture. The proposed method does not need any camera cali-bration. It is efficient enough to perform on the smartphonein a few seconds and robust enough to run in general envi-ronment without strict acquisition constraint where smart-phones are commonly used. A set of optimization tech-niques are faithfully deployed when the algorithm is imple-mented on the smartphone.

AcknowledgementThis work was supported by Samsung Electronics Co.

Ltd. This research was supported by the MKE (The Min-istry of Knowledge Economy), Korea, under the ITRC (In-

Figure 11. 3D face modeling result captured on a smartphone.

formation Technology Research Center) support programsupervised by the NIPA (National IT Industry PromotionAgency) (NIPA-2011-(C1090-1111-0003)).

References[1] T. Akimoto, Y. Suenaga, and R. S. Wallace. Automatic cre-

ation of 3D facial models. IEEE Computer Graphics andApplications, 13(5):16–22, September 1993. 164

[2] A. Ansari and M. Abdel-Mottaleb. Automatic facial featureextraction and 3D face modeling using two orthogonal viewswith application to 3D face recognition. Pattern Recognition,38(12):2549–2563, December 2005. 164

[3] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy,H. Pfister, and M. Gross. Multi-scale capture of facial geom-etry and motion. ACM Trans. on Graphics, 26(3):Article 33,July 2007. 163

[4] V. Blanz and T. Vetter. A morphable model for the synthesisof 3D faces. In Proc. of SIGGRAPH, pages 187–194, August1999. 163

[5] J. Carr, R. Beaton, J. Cherrie, T. Mitchell, W. Fright,B.McCallum, and T. Evans. Reconstruction and represen-tation of 3-D objects with radial basis functions. In Proc. ofSIGGRAPH, pages 67–76, August 2001. 164, 166

[6] A. D. Crocombe, A. D. Linney, J. Campos, and R. Richards.Non-contact anthropometry using projected laser line distor-tion: Three dimensional graphic visualisation and applica-tions. Optics and Lasers in Engineering, 28(2):137–155,September 1997. 163

[7] A. Golovinskiy, W. Matusik, H. Pfister, S. Rusinkiewicz, andT. Funkhouser. A statistical model for synthesis of detailedfacial geometry. ACM Trans. on Graphics, 25(3):1025–1034,July 2006. 163

[8] C. S. Je, S. W. Lee, and R. H. Park. High-contrast color-stripepattern for rapid structured-light range imaging. In Proc. ofEuropean Conference on Computer Vision, pages 95–107,May 2004. 163

[9] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: activecontour models. International Journal of Computer Vision,1(4):321–331, January 1988. 164

[10] J. Kovac, P. Peer, and F. Solina. Human skin colour clus-tering for face detection. In Proc. of EUROCON, volume 2,pages 144–148, September 2003. 165

167

Page 6: [IEEE 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) - Colorado Springs, CO, USA (2011.06.20-2011.06.25)] CVPR 2011 WORKSHOPS

Figure 12. Constructed 3D face models for several test samples on Samsung GT-S8500. All images are captured by the smartphone andalgorithms are run entirely on the smartphone.

[11] H. Li, R. W. Sumner, and M. Pauly. Global correspondenceoptimization for non-rigid registration of depth scans. Com-puter Graphics Forum, 27(5):1421–1430, October 2008. 165

[12] I. K. Park, H. Zhang, and V. Vezhnevets. High resolutionacquisition, learning and transfer of dynamic 3-D facial ex-pressions. EURASIP Journal on Applied Signal Processing,2005(13):2072–2090, August 2005. 164

[13] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, andD. Salesin. Synthesizing realistic facial expressions fromphotographs. In Proc. of SIGGRAPH, pages 75–84, Septem-ber 1998. 164

[14] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Colortransfer between images. IEEE Computer Graphics and Ap-

plications, 21(5):34–41, September/October 2001. 166[15] P. Viola and M. Jones. Robust real-time face detection. Inter-

national Journal of Computer Vision, 57(2):137–154, May2004. 164

[16] Y. Wang, X. Huang, C. S. Lee, S. Zhang, Z. Li, D. Samaras,D. Metaxas, A. Elgammal, and P. Huang. High resolution ac-quisition, learning and transfer of dynamic 3-D facial expres-sions. Computer Graphics Forum, 23(3):677–686, Septem-ber 2004. 163

[17] L. Zhang, N. Snavley, B. Curless, and S. M. Seitz. Spacetimefaces: high resolution capture for modeling and animation.ACM Trans. on Graphics, 23(3):548–558, August 2004. 163

168