hmm-based surface reconstruction from single images

10
HMM-Based Surface Reconstruction from Single Images Takayuki Nagai, 1 Masaaki Ikehara, 2 and Akira Kurematsu 1 1 Department of Electronic Engineering, The University of Electro-Communications, Chofu, 182-8585 Japan 2 Department of Electronics and Electrical Engineering, Keio University, Yokohama, 223-8522 Japan SUMMARY In this paper, a novel method of surface reconstruc- tion from a single monocular image is proposed. Our pro- posed approach (called “shape from object-specific knowledge”) is based on knowledge of objects, acquired by learning from a number of samples. To achieve this, we investigate the use of the Subband Pseudo 2D Hidden Markov Model (SPHMM), which is an extended version of normal Pseudo 2D HMMs. SPHMMs can model the corre- spondence between an intensity image and its depth infor- mation. We have applied our algorithm to 3D face, 3D hand, and 3D car reconstruction from single images, and the results show the effectiveness of the proposed method. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(11): 80–89, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.10685 Key words: HMM; 3D shape recovery; object rec- ognition; object-specific knowledge. 1. Introduction Recovery of 3D surfaces from 2D images is an attrac- tive research field and has many applications, including 3D object recognition and 3D modeling for graphics. One common approach in this field is stereo processing, which simultaneously uses captured images. The stereo process- ing is mainly a correspondence problem if geometrical information about the cameras is already known [1]. On the other hand, the shape from shading and shape from texture methods are concerned with recovering sur- face orientations from local variations in the observed in- tensity image [1–3]. It is not an easy task, since only a single monocular image is used for the surface reconstruction. A method of stable 3D reconstruction from single images is still an open issue. Humans inherently recognize a surface of the object in a picture with ease, provided that it is a familiar one, such as a human face. This principle reminds us of two aspects of the surface reconstruction. One aspect is the picture’s inclusion of enough information to allow satisfactory re- covery of 3D surfaces; the other is the use of knowledge on recovery of the surface. It seems that humans categorize objects in everyday life and store the mean shape of each category. The true shape, which is required for constructing the mean shape, can be acquired by stereo processing, tactile information, and so forth. To reconstruct the 3D information on an object in a picture, we can first recognize the object, and then map the mean shape of the object into the object in the picture. In this paper, a novel method of depth reconstruction from a single image is proposed. Our proposed approach, which we call “shape from knowledge,” is based on knowl- edge of the object. To model the knowledge, a subband extension of the Pseudo 2D Hidden Markov Model (PHMM) is proposed. Recently, HMMs have been widely noticed for their effectiveness in image processing tasks [4–6], because of their elastic matching capability. By using the proposed subband PHMMs framework, the correspon- © 2007 Wiley Periodicals, Inc. Systems and Computers in Japan, Vol. 38, No. 11, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J86-D-II, No. 9, September 2003, pp. 1286–1296 80

Upload: takayuki-nagai

Post on 06-Jul-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HMM-based surface reconstruction from single images

HMM-Based Surface Reconstruction from Single Images

Takayuki Nagai,1 Masaaki Ikehara,2 and Akira Kurematsu1

1Department of Electronic Engineering, The University of Electro-Communications, Chofu, 182-8585 Japan

2Department of Electronics and Electrical Engineering, Keio University, Yokohama, 223-8522 Japan

SUMMARY

In this paper, a novel method of surface reconstruc-tion from a single monocular image is proposed. Our pro-posed approach (called “shape from object-specificknowledge”) is based on knowledge of objects, acquired bylearning from a number of samples. To achieve this, weinvestigate the use of the Subband Pseudo 2D HiddenMarkov Model (SPHMM), which is an extended version ofnormal Pseudo 2D HMMs. SPHMMs can model the corre-spondence between an intensity image and its depth infor-mation. We have applied our algorithm to 3D face, 3D hand,and 3D car reconstruction from single images, and theresults show the effectiveness of the proposed method.© 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(11):80–89, 2007; Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/scj.10685

Key words: HMM; 3D shape recovery; object rec-ognition; object-specific knowledge.

1. Introduction

Recovery of 3D surfaces from 2D images is an attrac-tive research field and has many applications, including 3Dobject recognition and 3D modeling for graphics. Onecommon approach in this field is stereo processing, whichsimultaneously uses captured images. The stereo process-

ing is mainly a correspondence problem if geometricalinformation about the cameras is already known [1].

On the other hand, the shape from shading and shapefrom texture methods are concerned with recovering sur-face orientations from local variations in the observed in-tensity image [1–3]. It is not an easy task, since only a singlemonocular image is used for the surface reconstruction. Amethod of stable 3D reconstruction from single images isstill an open issue.

Humans inherently recognize a surface of the objectin a picture with ease, provided that it is a familiar one, suchas a human face. This principle reminds us of two aspectsof the surface reconstruction. One aspect is the picture’sinclusion of enough information to allow satisfactory re-covery of 3D surfaces; the other is the use of knowledge onrecovery of the surface. It seems that humans categorizeobjects in everyday life and store the mean shape of eachcategory. The true shape, which is required for constructingthe mean shape, can be acquired by stereo processing,tactile information, and so forth. To reconstruct the 3Dinformation on an object in a picture, we can first recognizethe object, and then map the mean shape of the object intothe object in the picture.

In this paper, a novel method of depth reconstructionfrom a single image is proposed. Our proposed approach,which we call “shape from knowledge,” is based on knowl-edge of the object. To model the knowledge, a subbandextension of the Pseudo 2D Hidden Markov Model(PHMM) is proposed. Recently, HMMs have been widelynoticed for their effectiveness in image processing tasks[4–6], because of their elastic matching capability. By usingthe proposed subband PHMMs framework, the correspon-

© 2007 Wiley Periodicals, Inc.

Systems and Computers in Japan, Vol. 38, No. 11, 2007Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J86-D-II, No. 9, September 2003, pp. 1286–1296

80

Page 2: HMM-based surface reconstruction from single images

dence between a 2D image and its depth information canbe acquired efficiently by learning from multiple samples.Simultaneously, recognition of the object and surface re-construction are achieved within the unified HMMs frame-work.

In Ref. 7, a recognition-based 3D face recoverymethod called “shape from recognition” is proposed. In thismethod, the results of facial part recognition and localiza-tion constrain the solution to the shape from shading. Themethod recovers 3D shapes of faces efficiently. However,it deals only with facial images, and it seems that generali-zation of the method is rather awkward. In Ref. 8, theauthors proposed a 3D face recovery method based on theidea that the target 3D face can be represented by a linearcombination of other 3D faces. The method gives veryaccurate 3D information on the input 2D face, but themethod is specialized for 3D face recovery. In addition,many accurate 3D models are required.

This paper is organized as follows. In the next section,an overview of the proposed algorithm is presented andpseudo 2D HMMs are introduced. In Section 3, the subbandextension of the PHMM is proposed, and then some detailsof the proposed algorithm are described. Section 4 givessome experimental results to show the effectiveness of theproposed method. Section 5 concludes the paper.

2. Preliminaries

2.1. Shape from knowledge [9]

The objective of this paper is to estimate the shape ofan object in an input intensity image. The proposed method,which we call “shape from knowledge,” is based on knowl-edge about objects. Here, the “knowledge” consists ofappearance models of objects and the corresponding mod-els of 3D shapes. Figure 1 shows an overview of theproposed approach. First, the features are extracted from aninput image. Then the object in the image is recognized byusing the extracted features and prepared appearance (in-tensity image) models. This process corresponds to theselection of the shape model (knowledge), which is usedfor the shape reconstruction process. By using the selectedshape model, the 3D surface of the object is reconstructed.In this paper, a framework using Hidden Markov Models(HMMs) for recognition and shape reconstruction from theknowledge model is proposed. Since HMMs have an elasticmatching capability, they are suitable for these purposes.Knowledge is acquired by learning from training samplescontaining pairs of an intensity image and its depth infor-mation.

2.2. Pseudo 2D Hidden Markov Model

To represent a 2D signal, a 2D HMM is desirable.Due to the high complexity of fully connected 2D HMMs,a pseudo 2D HMM (PHMM) is used in this approach. ThePHMM has been shown to be effective for character recog-nition [4] and for face recognition [5]. The PHMM consistsof a set of super-states and a set of normal states (embeddedstates), which are embedded in super-states. The super-states model the vertical direction, and the horizontal direc-tion is modeled by the embedded states. Figure 2 illustratesthe basic idea of the PHMM. The PHMM is parametrizedby the initial super-state distribution Ps, the super-statetransition probability matrix As = {akj}, the initial embed-ded-state distribution Pe

(k), and the embedded-state transi-tion probability matrix Ae

(k) = {aij(k)}. The embedded states

are characterized by Gaussian mixture densities with mix-ture weights Ce

(k) = {cjm(k)}, mean vectors Me

(k) = {mjm(k)}, and

covariance matrices Ue(k) = {Sjm

(k)}. Hereafter, the model ofthe PHMM is represented as

Fig. 2. Pseudo 2D HMM.

Fig. 1. Basic idea of the proposed “shape fromknowledge.”

81

Page 3: HMM-based surface reconstruction from single images

where

and Ns denotes the number of super-states. Left-to-rightPHMMs (for both directions) are used in this paper, sincethe spatial information about objects can be represented bya one-way state transition.

3. 3D Surface Reconstruction

3.1. Overview of the proposed approach

Since the proposed method is based on an HMM, ithas a shape reconstruction phase and a training phase.Figures 3(a) and 3(b) show the reconstruction phase and thetraining phase, respectively.

In the training phase, pairs of an intensity image andcorresponding depth image are prepared and feature vectorsare extracted. In this algorithm, various kinds of featurescan be used. For example, in Ref. 5 the lower frequencycoefficients of the 2D-DCT are used for PHMM-basedfacial recognition. Gaussian, Laplacian, and first-order de-rivatives can be other candidates for the features of images.From the point of view of image reconstruction, the pixelvalue itself must be a good choice here. In the proposedalgorithm, we use nine lower coefficients of the 5 × 5 2DDCT as the feature vector of intensity images. For depthimages, the Gaussian, Laplacian, and first-order derivativesand the pixel value itself are employed. These features arechosen because the combination gave the best results in

preliminary tests. It should be noted that the feature vectoris extracted on a pixel basis: that is, the window is shiftedby one pixel in each direction. Then the feature vectors ofthe paired intensity and depth images are combined as onevector and the parameters of the PHMMs are estimated byusing the segmental k-means algorithm [4]. After training,each PHMM is divided into subband PHMMs. In Fig. 3,PHMM-I and PHMM-S denote the PHMMs for the inten-sity image and depth image, respectively.

In the reconstruction phase, an observed 2D image isdecoded by a doubly embedded Viterbi algorithm [4] usingPHMM-I. The most probable PHMM-S corresponding toPHMM-I is selected as the object model according to theresult of 2D image decoding. Then the depth informationwhich corresponds to the observed 2D image is calculatedthrough the output of the selected PHMM-S with the iden-tical state transitions of PHMM-I.

3.2. Subband PHMMs

To model the correspondence between an intensityimage and its depth information, we extend the PHMM toa subband PHMM. Here the subbands correspond to theintensity channel (intensity image) and the depth channel(depth image). Hence, the number of subbands is alwaystwo. The basic idea of the proposed subband PHMMs isillustrated in Fig. 4. It is seen that the subband PHMMsshare a common state transition probability between sub-bands (for both the super and embedded states). However,the mean vectors and covariance matrices are different ineach subband. Now, let j, k, and l be the embedded stateindex, the subband index (l = 0, 1), and the super-stateindex, respectively. Ol t represents an observed vector in thesubband l at a pixel location t (the pixels are assumed tobe in lexicographical order). The size of O0t is 9 × 1 and that

Fig. 3. Block diagram of the proposed subband PHMM-based 3D reconstruction.

(1)

(2)

(3)

82

Page 4: HMM-based surface reconstruction from single images

of O1t is 5 × 1. In each state, the output probability is writtenusing a mixture of normal distributions

where N, mjml(k) , Sjml

(k) , cjmk , and N represent the normal

distribution, the mean vector, the covariance matrix, themixture weight, and the number of mixtures. Moreover, thefollowing equations hold:

We also assume that Sjml(k) is diagonal. Here the decompo-

sition of the PHMM l into subband models ll using Eqs.(6) and (7) is considered. The marginalization of Eq. (4)results in

Therefore,

where S denotes all possible state transitions and

τ is the number of pixels. This implies that the likelihoodof the observation O0 can be evaluated by the decomposedmodel l0. O1 can also be evaluated by the decomposedmodel l1. In the training phase, features are extracted fromall subbands and are combined according to Eq. (5) to sharethe same transition probabilities between ll ’s. A set ofparameters of the PHMM lo for a category o is obtained bythe segmental k-means algorithm (the EM algorithm is alsoapplicable) using the combined feature vectors. Then, ap-plying Eqs. (6) and (7) to lo inversely results in theSPHMMs of object o:

It is worth noting that the llo ’s share the same transition

probability matr ices As and Ae(k).

3.3. Shape reconstruction from subbandPHMMs

The problem of shape reconstruction is to obtain thedepth image O1 which maximizes the conditional prob-ability P(O1 | O0). P(O1 | O0) can be rewritten as

where l__

1o and l

__0o represent a set of the model, the state

transition sequence q, and the sequence of mixture’s indexm, as follows:

Equation (12) implies that we must seek for O1 (a depthimage) which maximize the likelihood P for a given O0

(intensity image) for all combinations of l__

0o and l

__1o. In

practice, it seems that P(O1 | O0) is dominated by the valueof P(O1, l

__1o, l

__0o | O0), in which l

__0o represents the observed

vector O0 best, and l__

1o represents the correct depth informa-

tion O1 best. Therefore, the following problem, of findingthe most likely path, is solved instead:

Since the above maximization is still intractable, we furtherrewrite the above equation using the conditional inde-pendence:

(5)

(6)

(7)

(4)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

Fig. 4. Illustration of subband PHMMs.

83

Page 5: HMM-based surface reconstruction from single images

Hence, instead of the direct maximization of Eq. (14), Eq.(15) is maximized step by step from the rightmost term onthe RHS.

First, the maximization of P(l__

0o | O0) can be seen as

the recognition part if we assume a constant prior. There-fore, the input signal is decoded using all SPHMM-I’s ofthe observable signals and then the most probable modell0

o, state transition sequence q0, and sequence of mixtureindices m0 are selected. Maximization of the second termP(l

__1o | l

__0o) is a problem of finding the most likely l

__1o for a

given l__

0o found in the foregoing maximization step of

P(l__

0o | O0). It is clear that the term can be maximized when

the same depth model corresponding to the selected objectl0

o, exactly the same state sequence q1 = q0 and m1 = m0 areselected, since the PHMMs are trained under the constraintthat they share common state transition probabilities.

The remaining problem is the maximization ofP(O1 | l

__1o). Since a set of l1

o, q1, and m1 is given in thepreceding step, it is clear that the most probable O1 can beobtained as a sequence of the mean vectors of each selectedstate. However, this results in a discontinuous image, sinceno information among pixels is taken into account. Thisproblem is considered in the next section.

3.4. Shape reconstruction from PHMMs

To reconstruct a smooth object surface, interpixelinformation should be taken into consideration when wemaximize the above-mentioned likelihood function. Recallthat the Gaussian, Laplacian, and derivatives are includedin the feature vectors of the depth image; they retain infor-mation on the eight neighbors of the target pixel.

Let pt be the pixel value of the depth image at alocation t. The feature vector O1t at position t is

where G(⋅), L(⋅), H(⋅), and V(⋅) represent the Gaussian,Laplacian, horizontal derivative, and vertical derivative op-erations, respectively. Consequently, the problem here be-comes that of finding p = [p1 p2 ⋅ ⋅ ⋅ pτ]

T (lexicographicordering of pixels) that maximize the likelihoodP(O1 | l

__1o) under the following constraint:

where the matrix W consists of the coefficients of theGaussian, Laplacian, horizontal derivative, and vertical de-rivative kernels (see Appendix). We then have

where m1 represents a mean vector, constructed by arrang-ing the means of the selected states according to q1 andm1. S1 is a diagonal covariance matrix whose diagonalelements consist of the variances of the selected statesaccording to q1 and m1. Equation (18) is maximized when

since Eq. (18) is a quadratic form in p. Hence, we can obtainthe optimal depth image by solving the following linearequation:

where

Since the size of matrix G is quite large, solving Eq. (20)directly may be problematic. Thus, we solve the linearequation iteratively by using the conjugate gradient descentmethod [10]. In addition, the amount of memory requiredcan be small, since the matrix G is highly sparse. We canprove that the inverse matrix of G always exists as follows.Since the matrix S1 is a diagonal one, S1

−1 has full rank.From Eqs. (16) and (17), it is clear that all unit vectors areincluded in the matrix W so that pt is directly output.Therefore, the rank of the matrix is obviously τ (size of p).These facts show that the G has full rank and that the inversematrix always exists.

4. Experiments

We applied the proposed method to three categories(faces, hands, and cars). For faces we used the XM2VTSdatabase [11] consisting of 293 texture images and VRMLmodels. For hands, 11 pairs of intensity and depth imageswere downloaded from the WWW [12]. For the car data, acommercial CG database [13] was used in the experiment.Each CG was rendered using CG software and the 2D imageand its depth information (Z buffer) were captured. Thetotal number of cars was 56 and various types of cars wereincluded. In addition, the database consists of both photore-alistic CG and synthetic ones. The directions of all objectswere roughly aligned manually.

(15)

(16)

(18)

(19)

(20)

(17)

84

Page 6: HMM-based surface reconstruction from single images

4.1. Surface representation using subbandPHMM

We trained the subband PHMM using certain per-sons’ images and reconstructed the persons’ depth informa-tion to verify that the proposed model was able to representthe relationship between intensity and depth images. Themodel had 6 super-states and 5 embedded states, and thenumber of mixtures in each state was 4. Figure 5 illustratesthe results. Panels (a) and (b) are the original intensityimage and depth image, respectively. Panels (c) and (d) arethe results of reconstruction from panel (b) and PHMM-S.Panel (c) is a result obtained without using informationamong pixels, and the result with the interpixel informationis given in panel (d). It is obvious from the figure that thedepth information has been successfully reconstructed bythe PHMM. The effectiveness of the interpixel informationon the shape reconstruction process is confirmed as well.

4.2. Results of surface reconstruction

Three subband PHMMs lface, lhand, and lcar weretrained using the face, hand, and car databases, respectively.After that, surface reconstruction was performed usingopen data which were not included in the training set. Forthe training we randomly selected 80 faces, 10 hands, and50 cars from the above-mentioned databases. In Fig. 6, aface result is illustrated. Panels (a), (b), and (c) depict theinput image, the 3D face with correct depth information,and the 3D face with reconstructed depth information,respectively. From the figure one can see that a reasonableresult is obtained by the proposed method. However, thedifference between panels (b) and (c) is obvious. The dif-ference comes from the fact that the proposed methodmodels the average shape of a number of faces and mapsthe mean shape into the input image. Therefore, no indi-viduality is reflected in the reconstructed result. Panels (d),(e), and (f) illustrate the true depth image, the depth imageestimated without using interpixel information, and the

result obtained using interpixel information, respectively.These results show that interpixel information helps toreconstruct a smooth surface. In panel (e), significant esti-mation error can be seen under the left eye. The immediatecause of this error is weakness of the edges on the right sideof the nose. In the decoding process the HMM erroneouslyrecognized the nose region as the under-eye region and theleft edge of the nose was treated as the beginning of the noseregion. Therefore, the bright region which is supposed tobe in the nose region appears under the left eye. Althoughthe underlying problem must be the choice of state transi-tion sequences, panel (f) indicates that the problem is solved

Fig. 6. Result of surface recovery. (a) Input image. (b)Texture-mapped true 3D face. (c) Reconstructed 3D face.(d) True depth image. (e) Recovered depth image withoutinterpixel information. (f) Recovered depth image with

interpixel information.

Fig. 7. Example of a man with a mustache. (a) Inputimage. (b) Recovered depth image. (c) Reconstructed 3D

face.

Fig. 5. Depth information represented by a subbandPHMM. (a) Input image. (b) True depth image. (c)

Recovered depth image without interpixel information.(d) Recovered depth image with interpixel information.

85

Page 7: HMM-based surface reconstruction from single images

to some extent by using the interpixel information. Further-more, the use of more sophisticated features and/or model-ing of the state duration may resolve the problem radicallyand improve the performance. Figures 7 and 8 show otherexamples of a man with a mustache and a reconstructed 3Dhand, respectively. In those figures, panels (a), (b), and (c)depict the input image, the reconstructed depth image, andtexture-mapped reconstructed 3D shape, respectively. Fig-ure 9 is an example of a car; panels (a), (b), (c), and (d)represent the input image, the correct depth image, thedepth image reconstructed without using interpixel infor-mation, and the depth image reconstructed using interpixelinformation. In Fig. 10(a) the texture-mapped 3D car withtrue depth information is given, and Fig. 10(b) shows theestimated 3D car based on the result in Fig. 9(d). Theseresults show the validity of our proposed method.

The proposed method evaluates the likelihood ofintensity models to select the appropriate model for surfacereconstruction. To confirm that the selection of the appro-priate model works well, the log likelihoods of each inten-sity model for several input images are plotted in Fig. 11.Panel (a) shows the result for the car images; the horizontalaxis denotes the image index. Panels (b) and (c) are theresults for faces and hands, respectively. In every case, thetarget object model gives the highest log likelihood. Thatmeans that the selection of the model works properly. It

should be noted that the number of categories is smallenough in the experiment so that the objects can be recog-nized correctly. Further research is required for handling alarge number of categories including objects with similarappearance.

4.3. Comparison with shape from shadingmethod

For subjective performance evaluation, the proposedmethod was compared with the shape from shading (SFS)

Fig. 8. Example of recovered 3D hand. (a) Input image.(b) Recovered depth image. (c) Reconstructed 3D hand

with texture.

Fig. 10. Recovered 3D car. (a) Texture-mapped true 3Dcar. (b) Reconstructed 3D car.

Fig. 11. Log likelihood of the input images for allmodels. (a) Cars. (b) Faces. (c) Hands.

Fig. 9. Example of a car. (a) Input image. (b) Truedepth image. (c) Recovered depth image without

interpixel information. (d) Recovered depth image with interpixel information.

86

Page 8: HMM-based surface reconstruction from single images

method in terms of RMS error. For SFS, the method pro-posed in Ref. 3 was utilized. The same face database asabove was used for the experiment. The results are given inTable 1. From the table one can confirm that the proposedmethod outperforms SFS in terms of RMS error. Further-more, the use of interpixel information decreases the RMSerror.

4.4. Number of training samples andstructure of PHMMs

To see the dependence on the number of trainingsamples and the structure of the PHMMs, we tested thealgorithm under various conditions. The face database wasused in the experiment. Five PHMMs with different struc-tures were selected. Then the log likelihood for each inputintensity image and the RMS error of the reconstructedsurface were calculated. The number of training samplesranged from 1 to 100.

• HMM(a): super-states 6, embedded states 5, mix-ture 1

• HMM(b): super-states 6, embedded states 5, mix-tures 2

• HMM(c): super-states 6, embedded states 5, mix-tures 4

• HMM(d): super-states 12, embedded states 5,mixture 1

• HMM(e): super-states 3, embedded states 5, mix-tures 4

Figure 12 shows the results. In Fig. 12(a) the log likelihoodof the input image for each condition is plotted. In everymodel, the log likelihood increases as the number of train-ing samples increases from 1 to 5. The log likelihood tendsto converge to a constant level when the number of trainingsamples reaches about 40. It can also be seen that a modelwith more degrees of freedom gives a higher likelihood.Figure 12(b) illustrates the RMS error of the reconstructedsurface. Although the result varies to some extent accordingto the model, the RMS error decreases as the number oftraining samples increases in the first stage, then convergesto a steady value when the number of training samplesreaches about 40. Fluctuation from the model selection isrevealed when the number of training samples is small,

since the number is not sufficient for training a model witha high number of degrees of freedom. Next we tested theperformance with various numbers of embedded stateswhile the number of super-states was fixed. The number of

Table 1. Comparison with the shape from shading

Fig. 12. Dependency on the number of trainingsamples. (a) Number of samples versus log likelihood.

(b) Number of samples versus RMS error.

Fig. 13. Dependency on the number of embeddedstates. (a) Number of states versus log likelihood. (b)

Number of states versus RMS error.

87

Page 9: HMM-based surface reconstruction from single images

training samples was 50. The result is given in Fig. 13.Figures 13(a) and 13(b) show the log likelihood and RMSerror, respectively. In every case, one can see that morestates give better results, but that more than 10 states doesnot improve the performance.

From these results it can be seen that the RMS errorconverges as the number of samples increases, and that thisis independent of the HMM’s structure. Although the over-all performance is improved as the number of degrees offreedom of the model increases, the excessive number ofparameters improves the performance little.

5. Conclusions

This paper has presented an HMM-based surfacereconstruction method. The proposed approach is based onknowledge of the object, that is, the mean shape of theobject is mapped into the input image to reconstruct thesurface. In order to acquire and represent such knowledgeabout each object, we have proposed subband PHMMs.Experimental results show the effectiveness of the proposedmethod.

In the experiments, the poses of each object werealigned. Also, the lighting conditions were similar for alltraining samples. This means that reconstruction of surfacesfrom different conditions, for example, a face in profile, isquestionable. Therefore, this problem should be solved inthe future. However, this problem could be resolved to someextent by including different poses and lighting conditionsin the training samples, or, alternatively, by constructingPHMMs for different poses and lighting conditions. Inte-gration with other cues, such as shading, texture informa-tion, and stereo processing, is also a future issue. Findingan optimal structure of the HMMs is an open issue as well.

REFERENCES

1. Matsuyama T, Hisano Y, Imamiya J. Computer vi-sion. Shingijyutsu Communications; 1998. (in Japa-nese)

2. Horn BKP. Robot vision. MIT Press; 1986.3. Zhang R, Tsai PS, Cryer JE, Shah M. Shape from

shading: A survey. IEEE Trans Pattern Anal MachineIntell 1999;21:690–706.

4. Kuo SS, Agazzi OE. Keyword spotting in poorlyprinted documents using pseudo 2-D hidden Markovmodels. IEEE Trans Pattern Anal Machine Intell1994;16:842–848.

5. Nefian AV, Hayes MH III. An embedded HMM–based approach for face detection and recognition.Proc Int Conf on Acoustics, Speech and Signal Proc-essing 1999;6:3553–3556.

6. Brand M, Kettnaker V. Discovery and segmentationof activities in video. IEEE Trans Pattern Anal Ma-chine Intell 2000;22:844–851.

7. Nandy D, Arie JB. Recovery of 3D-face structureusing recognition. Proc Int Conf on Pattern Recogni-tion 2000;1:1104–1108.

8. Blanz V, Vetter T. A morphable model for the synthe-sis of 3D faces. Proc SIGGRAPH 1999, p 187–194.

9. Nagai T, Naruse T, Ikehara M, Kurematsu A. HMM-based surface reconstruction from single images.Proc Int Conf on Image Processing 2002;1:856–859.

10. Fukui Y, Nodera T, Kubota K, Togawa H. Numericalcomputation. Kyoritsu Shuppan; 1999. (in Japanese)

11. Messer K, Matas J, Kittler J, Luettin J, Maitre G.XM2VTSDB: The extended M2VTS database. ProcInt Conf on Audio and Video-Based Biometric Per-son Authentication, 1999.

12. OSU Range Image Database, http://sampl.eng.ohio-state.edu/~sampl/data/3DDB/RID/index.html.

13. Expressions Tools Homepage, http://www.ex-tools.co.jp/index.html.

APPENDIX

Here we describe the matrix W in Eq. (17). In Eq.(17), the vector p displays the pixels of the depth image inlexicographical order and O1 consists of O1t in Eq. (16).Since feature vectors are extracted by shifting kernels oneby one, W should be a matrix whose components are smallblocks Wb shifted one by one as follows:

By Eq. (16), the small block Wb is

(A.1)

(A.2)

88

Page 10: HMM-based surface reconstruction from single images

where gij, lij, hij, and vij are the (i, j) components of theGaussian, Laplacian, horizontal derivative, and vertical de-rivative kernels, respectively. The matrices 0a and w arerespectively a 1 × a vector with zeros, and the width of theinput image. It should be noted that the matrix Wb has adifferent form at the border of images. Since zero-paddingis used in our algorithm, the Gaussian kernel becomes asfollows at the image border (the same is true for the otherkernels):

Therefore, Wb, for p1, can be written as follows:

AUTHORS (from left to right)

Takayuki Nagai received his B.E., M.E., and D.Eng. degrees from the Department of Electrical Engineering, KeioUniversity, in 1993, 1995, and 1997 and became a visiting researcher there. Since 1998, he has been affiliated with the Universityof Electro-Communications, where he is currently a research associate in the Department of Electronic Engineering. His researchinterests include digital signal processing and multirate systems.

Masaaki Ikehara received his B.E., M.E., and D.Eng. degrees in electrical engineering from Keio University in 1984,1986, and 1989 and became a lecturer at Nagasaki University. In 1992, he joined the Faculty of Engineering at Keio University,and is currently an associate professor in the Department of Electronics and Electrical Engineering. His research interests arein the areas of multirate signal processing, wavelet image coding, and filter design problems.

Akira Kurematsu received his B.E. degree in electrical communications engineering from Waseda University in 1961and joined the Research and Development Laboratories of KDD. In 1971, he received a Ph.D. degree from Waseda University.In 1983, he was appointed deputy director of the KDD R&D labs. From 1986 to 1993, he was president of the ATR InterpretingTelephony Research Laboratories. Since 1993, he has been a professor in the Department of Electronic Engineering, Universityof Electro-Communications. His research interests are in the areas of speech recognition, intelligent human interface, and spokenlanguage understanding.

(A.4)

(A.3)

89