two-point perspective 3d modeling from a single image: a tour into the picture experience · 2006....

Two-Point Perspective 3D Modeling from a Single Image: A Tour into the Picture Experience

Faustinus Kevin Gozali

Department of Electrical & Computer Engineering Carnegie Mellon University

[email protected]

Abstract The Single View Geometry is one of the common practices in computational photography for 3D model reconstruction based on one image. In this paper, we present an interactive method to produce a general two-point perspective tour into the picture just from a single image. Our method focuses on user experience instead of a fully automated system. We require user to supply the image and assist the system such that all information about perspective world depicted in the image is readily available. Since it is a user-based system, the accuracy of the generated models really depends on the supplied data. Fortunately, with a few general assumptions, our method is able to produce artistic 3D models, to which users can fly through in an interactive experience. Keywords: image-based rendering, vanishing point, single-view geometry 1. Introduction Imagine a 3D visualization of a scene in everyday life. A large number of 3D models are generated, put together in one big scene, allowing viewers to navigate through it. This is simply a tour in a 3D world, which consists of model generation and fly-through or navigation. However, people in the industry spent uncountable hours trying to model the world with detailed environment. This can be troublesome for most people. The question lies on what if someone with no or little background on 3D modeling would like to generate similar tour experience. It would be nice if anyone can reconstruct any scene depicted in their photographs in a fairly accurate 3D model then navigate through it. It would be nice if they can put markers on their pictures and pass all these information to a system that can generate a correct model for them. In this paper, we present one method of such reconstruction. All we need is a single image depicting a perspective world. The underlying assumption is that every object in the picture lies on a ground plane, allowing us to select a view reference lines that give information on the location of such objects in the actual 3D world. Not all details can be reconstructed perfectly, since the target model is assuming vertical walls extruded from the specified reference lines. We realize that not

everything in the world has such vertical extrusion property. But, our method is able to generate decent reconstruction in general. Due to the focus on the interactive user experience, our method requires user input to assist the system. Users only need to provide a few reference lines/points before the system can automatically compute all of the perspective properties of the scene captured in the input image. This is different from some other attempted method where the world properties generation is fully automated. To further improve the tour experience, our implementation performs simple texture manipulations before the actual 3D model is rendered to the viewer. We allow a few input images of the same scene but different exposures. We then combine these images into one final texture that will be used during the rendering process. Just with this simple idea, the user experience does improve. Anyone can now virtually create a 3D scene of different time or exposure and navigate through it. Related Work Our method is basically an extension to the simpler one-point perspective tour into the picture using spidery mesh in [5]. We try to capture the world using a more general assumption, i.e. the two-point perspective, since in most cases, our world cannot be partitioned into a simple 5-sided 3D box. A similar approach to ours is described in [1], where a panorama images are used as the input, allowing a full 360 degree reconstruction of the scene. There is also an automated method of single view reconstructed described in [2], in which all perspective properties in the input image are calculated fully automatically. Our method focuses on the interactive experience; hence it is semi-automatic with the help of user inputs. 2. Method Our method requires that it know all properties of the perspective world in the picture: horizon line, based-line, camera focal length, etc. We obtain this information by the help of user inputs. Thus, we partition our method into four major steps: vanishing point and horizon calculation,

3D location calculation, surfaces and textures generation, and 3D modeling. The first two steps really require user inputs. The third and fourth step is fully automated and a concrete 3D model will be rendered, ready to be navigated through. Note that in all of the steps we assume that all camera parameters are fixed. 2.1 Vanishing Points and Horizon Recall our assumption on the two-point perspective world. Two different vanishing points exist (finite) while the third one is at infinity. All parallel lines will converge to any of these two points as viewed in the image plane. The horizon line is perfectly horizontal, and all vertical lines stay vertical. Figure 1 illustrates the two-point perspective world containing a 3D box as the object.

Figure 1: A 3D box viewed in the two-point perspective

Also recall our assumption of the camera and image plane. Image is aligned so that it is parallel to the x- and z-axis[4]. Thus, depth is denoted by y-axis. Camera or eye is located at (0,0,h), with h denotes the actual camera height in the 3D world. Camera view will converge to the camera vanishing point (veye), located in the center of the horizon line. In our implementation, we assume that h is the vertical distance of veye to the bottom of the image plane.

yy Bottomhorizonh −=

Consequently, the center of the image plane is at (0,f,h) in the real 3D world, with f denotes the focal length of the camera. In our implementation, we use f obtained from EXIF data (in pixels). Semi-Automatic Vanishing Point Calculation To obtain the two vanishing points, we ask for user inputs. For this purpose, we assume that the first vanishing point lies to the left of veye, while the second point lies to the right of veye. To compute each of them, user will specify a set of parallel lines from the image that are supposed to converge to a single point to the left of veye. The number

of lines provided really affects the accuracy of the computed vanishing point. Thus, we ask the user to supply as many lines as possible.

Figure 2 – An illustration of parallel lines and the convergence

to vanishing points We then compute the intersection of all these lines using the least square method. One method suggested by Bob Collins at [6] utilizes eigen-values decomposition to find the best fit for this intersection point. Doing this gives us the (x,y) location of each of the vanishing points, denoted by vpl and vpr. We now successfully obtain the estimated location of the vanishing points in the image coordinate. Horizon Line Recall our assumption that horizon is a perfect horizontal line. Both vanishing points must lie in the horizon line. But the computed vpl and vpr may have different y components. Therefore, to produce an estimated horizon line, we take the average y component of the two vanishing points as follow:

2.. yryl

y

VPVPhorizon

+=

Figure 3 – Our system asks the user to input several lines (red)

to compute one of the vanishing points and horizon (yellow) This method gives us a fairly accurate location of the horizon line. For flexibility, we also allow the user to hand-pick the horizon line in case he already knows the exact location. All of these give us the horizon location to be used in the next steps.

Xvpr vpl

veye

Image Plane x

z

y

Xvpr vpl

veye

Image Plane x

z

y

2.2 3D Location Calculation To obtain the 3D point coordinates of the scene, we ask the user to specify several baselines, which should lie on the ground plane (z=0). Here, we assume that baselines are always below the horizon line. For each point of the baselines, we need to predict its 3D location, i.e. it’s depth along y-axis and horizontal displacement along x-axis. We first calculate the depth (3Dy). If we take a side view of the scene, we will get something similar to the following diagram:

Figure 4 – Side view for depth computation

This is in fact very similar to the spidery mesh method described in [1] and [5] and. Based on figure 4, since we know h, the camera height, f, the focal length, and the vertical distance from the point of interest to the horizon line (measured from the input image), we can easily compute 3Dy:

fdy

hDy =3

Now that we have the depth for the point of interest, we can compute its horizontal displacement (3Dx). The displacement measured from the image plane is amplified according the distance from the camera, i.e. depth. The farther points are amplified more. Figure 5 illustrates the distance measures in the image plane. Overall, given 2Dx (x-location in the image plane), f, veye and 3Dy, we can compute 3Dx as follow:

f

DvDD yxeyexx

3)2(3 .−=

Hence, we can obtain the 3D coordinates for all baseline points denoted by (3Dx, 3Dy, 0). Height Calculation for Vertical Walls We again ask the user to specify the height of the vertical walls, i.e. the actual 3D vertical distance of the top of the wall to its base line. To do this, we compute height scale for each of the baseline point. This scale is the ratio of its distance from horizon in image plane to the camera height.

Figure 5 – Top View of the scene for the horizontal

displacement calculation We set the baseline point closest to the camera as the reference. User specifies the height of that point and all other heights in the image plane will be computed automatically.

Figure 6 – Specifying heights in our implementation

Then, the actual 3D height is computed based on the ratio. Note that this value is uniform for all baseline points. By this, we obtain all 3D coordinates for all of the surfaces that we need for the reconstruction. 2.3 Texture Generation There are two types to texture to consider: ground texture and vertical wall texture. For ground texture, we need to define a mask so that it only includes visible ground region. For wall texture, no special masking is necessary. We also define point correspondences from the target surface to the original image so that textures can be warped properly using Homography matrix calculation. Ground Texture To define the mask, we look at the lowest baseline position for each horizontal pixel. For each pixel column, only region from the bottom of the image up to the baseline is considered. If there is no baseline captured by that column, we consider the region up to the vertical

veye

Image Plane

y

z Side View

dy h

base point depth (3Dy) f

Camera (0,0,h) horizon line

X

veye

Image Plane

dx2

dx1

2Dx1

2Dx2

pixel location of the farthest baseline point. Given this, we obtain the mask for our box image:

Figure 7 – Ground texture before warping

Then we need to define our ground surface and the point correspondence. The surface is assumed to be a perfect rectangle at z=0 plane. However, we still need to consider depth of each pixel. This is done by correctly finding the maximum width of the rectangle based on the farthest pixel location of the ground texture. The point correspondence setup must consider the depth as well. Vertical Wall Texture Each vertical wall is a vertical rectangle. Its height is already specified by the 3D height calculation in the previous step. The width of this rectangle, however, must be calculated based on the 3D (x,y) location of the points. The width is simple the Cartesian distance of the two endpoints. Then, setting up point correspondences is straight forward. After everything is setup, it is just a matter of finding Homography matrix and warping each texture properly. In our implementation, we introduce a texture scaling factor to speed up the process. In some cases, the resulting geometry is too big in dimension, requiring a lot of time just to warp the textures accordingly. Using this scaling, the target texture dimension is scaled down so that it produces less sharp texture, but does not really affect the result. 2.4 3D Modeling Given that we have obtained the ground texture and the wall textures, all left to do is to create the 3D model and render it. Since all of the 3D point locations are calculated, it is straight forward to render each of the surfaces and wrap the appropriate textures. Note that the ground mask has to be applied properly so that a correct model can be displayed. Once the model is setup properly, it is just a matter of rendering it to the screen. Viewer can now navigate through the reconstructed 3D scene based on the single image. Thus our 3D reconstruction scheme is now complete. The scene can, in fact, be enhanced by doing a little manipulation on the textures. This is discussed in the next section.

Figure 8 – Reconstructed 3D model from different views

3. Simple Texture Manipulation We further improve the tour into the picture experience by allowing texture manipulations before the 3D model is constructed. Two simple methods are possible: simple interpolation and blending. In both case, multiple images of the exact same view with different exposures can be used as the input.

Figure 9 – Sample images with different exposures

Image Interpolation / Dissolve For instance, we provide a morning view and a night view of a scene. Using interpolation, we can control the contribution from each image, combine them, and produce a synthetic texture (e.g. evening view). We just need to control the parameter α , as the contribution factor.

))(1()( NightViewwMorningVietexture αα −+= We then use this “evening view” for our modeling and observe interesting effect.

Figure 10 – Results with texture interpolation

Blending We can also blend the two images by defining a mask for each. A simply pyramid blending technique is used. Using this, we can simulate a half morning, half night effect. The 3D model generated with this texture gives us a unique experience of seeing the same scene with different exposures all at once. Perhaps, combining this with the

interpolation method may produce more interesting textures.

Figure 11 – Results with texture blending

4. Gallery Here we present some of our reconstruction results. For each set, the first row represents inputs, and the second row represents results. All of these can be found at http://www.cs.cmu.edu/afs/andrew/scs/cs/15-463/2006/pub/www/projects/fproj/fkg/ .

Figure 12 – Building Scene

Figure 13 – Indoor Bridge Scene

Figure 14 – Indoor Scene

Figure 15 – Nature Scene

5. Conclusion A two-point perspective tour into the picture based on the single view reconstruction indeed allows us to present an interactive user experience. By helping the system via user inputs, anyone can easily generate a 3D model from a single photograph, and navigate through it. Our method constructs pretty realistic models assuming enough accuracy on the depiction of the two-point perspective world itself. However, we realize that the accuracy of the model really depends on the accuracy of the user inputs. Perhaps a more automated method can be applied to tolerate this human-error factor. Furthermore, more advanced techniques for texture manipulations may be applied to produce more variants of the same scene. All of these are geared toward the fascinating user experience as well as appealing reconstructions. Of course, our method can be extended to a multi-view geometry, where multiple input images work together to produce more complete scenes. This is something to consider in our future implementation.

References [1] Chu, Siu-Hang, Animating Chinese Landscape

Paintings and Panoramas. A Thesis Submitted to the Hong Kong University of Science and Technology, August 2001.

[2] Hoeim, Derek; Efros, A. Alexei; Herbert, Martial.

Automatic Photo Pop-up. Robotics Institute, Carnegie Melon University, Pittsburgh PA, USA. http://www.cs.cmu.edu/~dhoiem/projects/popup

[3] Single View Reconstruction Lecture slides.

http://graphics.cs.cmu.edu/courses/15-463/2006_fall/www/Lectures/SingleViewReconstruction.pdf

[4] Perspective Drawing. An online tutorial.

http://www.lems.brown.edu/vision/people/leymarie/SkiP/May98/Boehm1.html

[5] Horry, Yoichi; Anjyo, Ken-ichi; Arai, Kiyoshi. Tour

Into the Picture: Using a Spidery Mesh Interface to Make Animation from a Single Image. Hitachi, Ltd.

[6] Collins, Bob. A guide to compute vanishing points.

http://www.cs.cmu.edu/~ph/869/www/notes/vanishing.txt

two-point perspective 3d modeling from a single image: a tour into the picture experience · 2006....

Documents