cg2real: improving the realism of computer generated ...avidan/papers/cg2real_lowres.pdfresults of a...

1

CG2Real: Improving the Realism ofComputer Generated Images using a

Large Collection of PhotographsMicah K. Johnson, Member, IEEE, Kevin Dale, Student Member, IEEE, Shai Avidan, Member, IEEE,

Hanspeter Pfister, Senior Member, IEEE, William T. Freeman, Fellow, IEEE, and Wojciech Matusik

Abstract—Computer-generated (CG) images have achieved high levels of realism. This realism, however, comes at the cost of long

and expensive manual modeling, and often humans can still distinguish between CG and real images. We introduce a new data-driven

approach for rendering realistic imagery that uses a large collection of photographs gathered from online repositories. Given a CG

image, we retrieve a small number of real images with similar global structure. We identify corresponding regions between the CG

and real images using a mean-shift cosegmentation algorithm. The user can then automatically transfer color, tone, and texture from

matching regions to the CG image. Our system only uses image processing operations and does not require a 3D model of the scene,

making it fast and easy to integrate into digital content creation workflows. Results of a user study show that our hybrid images appear

more realistic than the originals.

Index Terms—Image enhancement, image databases, image-based rendering.

F

1 INTRODUCTION

THE field of image synthesis has matured to the pointwhere photo-realistic computer-generated (CG) im-

ages can be produced with commercially available soft-ware packages (e.g., RenderMan and POV-Ray). How-ever, reproducing the details and quality of a natural im-age requires a considerable amount of time by a skilledartist. Even with large budgets and many man-hours ofwork, it is sometimes surprisingly easy to distinguishCG images from photographs.

CG images differ from photographs in three majorways. First, color distributions of CG images are oftenoverly saturated and exaggerated. Second, multi-scaleimage statistics, such as histograms of filter outputsat different scales, rarely match the statistics of pho-tographs. Finally, CG images often lack details (e.g., highfrequencies, texture, and noise) and consequently looklook too pristine.

To improve realism in computer graphics, rather thanintroduce more detailed geometric models and complextextures, this work proposes a new rendering pipelinethat leverages a large collection of photographs. Largeimage collections can be readily acquired from photo-

• M.K. Johnson and W.T. Freeman are with the Massachusetts Institute ofTechnology.E-mail: {kimo, billf}@mit.edu.

• K. Dale and H. Pfister are with Harvard University.E-mail: {kdale, pfister}@seas.harvard.edu.

• S. Avidan is with Tel Aviv University and Adobe Systems.E-mail: [email protected].

• W. Matusik is with Disney Research, Zurich, Switzerland.E-mail: [email protected].

sharing sites, such as Flickr, and such collections havebeen the basis for data-driven methods for improvingphotographs [1]. This work demonstrates that large im-age collections can be used in the context of computergraphics to synthesize realistic imagery without the needfor complex models.

CG2Real takes a CG image, retrieves and aligns asmall number of similar photographs from a database,and transfers the color, tone, and texture from the realimages to the CG image. A key ingredient in the systemis a mean-shift cosegmentation algorithm that matchesregions in the CG image with regions in the real images.After cosegmentation, we transfer real-image texturesto the CG image and perform local color and tonetransfers between image regions. Local transfers offer animprovement in quality over global transfers based onhistogram matching. Color and tone transfers are com-pletely automatic, and texture transfer can be controlledby adjusting a few parameters. In addition, all operationsare reasonably fast: an average computer can run thecosegmentation and all three transfer operations in lessthan 15 seconds for a 600× 400 pixel image.

CG2Real is useful in a variety of artistic scenarios. Inits current form, the system can synthesize natural scenesusing a database of outdoor images. The CG imageis used as a template that gets filled in with realistictextures. The user can also choose to preserve structuresin the CG image. For example, a user working on a3D model may wish to render it with a photorealisticbackground. This problem occurs in architectural mod-eling where an architect has a detailed model for a housebut not for the surroundings. In this scenario, CG2Realcan be configured to preserve the rendered model and

2

Fig. 1. Given an input CG image (left), our system finds the most similar photographs (not shown) to the input image.

Next, the system identifies similar regions between the CG image and photographs, transfers these regions into the

CG image (center), and uses seamless compositing to blend the regions. Finally, it transfers local color and gradient

statistics from the photographs to the input image to create a color and tone adjusted image (right).

synthesize a background using real image textures (e.g.,grass, trees, and sky). CG2Real allows a user to controlthe amount of image content to be replaced by realtextures, enabling artists to create hybrid images andenabling researchers to study the cues important to theperception of realism.

The primary contribution of this paper is a novel data-driven approach for rendering realistic imagery basedon a CG input. Within this system, several individualoperations also further the state of the art, including (1)an improved image search tuned for matching globalimage structure between CG and real images; (2) animage cosegmentation algorithm that is both fast andsufficiently accurate for color and tone transfer; and (3)methods for local transfer of color, tone, and texturethat take advantage of region correspondences. As afinal contribution, we describe several user studies thatdemonstrate that our hybrid images appear more real-istic than the originals, providing insight into the cuesthat people use to distinguish CG from real.

2 PREVIOUS WORK

The appearance of a computer-generated image can beimproved by adding realistic texture. In their seminalwork, Heeger and Bergen [2] proposed a novel approachfor synthesizing textures. Their method starts with arandom noise image and iteratively adjust its statisticsat different scales to match those of the target texture,leading to new instances of the target texture. Thisapproach was later extended by De Bonet [3] to usejoint multi-scale statistics. Alternatively, one can take anexemplar based approach to texture synthesis. This ideawas first illustrated in the work of Efros and Leung [4]and was later extended to work on patches insteadof pixels [5], [6]. The image analogies framework [7]extends non-parametric texture synthesis by learning amapping between a given exemplar pair of images andapplying the mapping to novel images. Freeman et al. [8]proposed a learning-based approach to solve a range

of low-level image processing problems (e.g., imagesuper-resolution) that relies on having a dictionary ofcorresponding patches that is used to process a givenimage.

Unfortunately, these approaches require correspon-dence between the source and target images (or patches),a fairly strong assumption that cannot always be satis-fied. Rosales et al. [9] relaxed this assumption by framingthe problem as a large inference problem, where boththe position and appearance of the patches are inferredfrom a pair of images without correspondence. Whilethe results look convincing for a variety of applications,the specific problem of improving realism in CG imageswas not addressed.

Instead of requiring corresponding images (or patches)to learn a mapping, one can take a global approachthat transfers color or style between images. Reinhardet al. [10] modified the color distribution of an image togive it the appearance of another image. They showedresults on both photographic and synthetic images. Al-ternatively, Pitie et al. [11] posed color transfer as a prob-lem of estimating a continuous N-dimensional transferfunction between two probability distribution functionsand presented an iterative non-linear algorithm. Baeet al. [12] used a two-scale nonlinear decompositionof an image to transfer style between images. In theirapproach, histograms of each layer were modified in-dependently and then recombined to obtain the finaloutput image. Finally, Wen et al. [13] demonstrated astroke-based interface for performing local color transferbetween images. In this system, the user provides thetarget image and additional input in the form of strokepairs.

We build on and extend this line of work with severalimportant distinctions. First, the works discussed so fardo not consider how the model images are chosen, andinstead assume that the user provides them. We believea system should be able to obtain model images witha minimum of user assistance. Given a large collection

3

Input

Image Database Similar images Cosegmentation

Color

Tone

Texture

Local style transfer

Output

A

B

A

B

A

B

A

B

Fig. 2. An overview of our system. We start by querying a large collection of photographs to retrieve the most similar

images. The user selects the k closest matches and the images, both real and CG, are cosegmented to identify similar

regions. Finally, local style transfer algorithms use the real images to upgrade the color, tone, and/or texture of the CG

image.

of photographs, we find images with similar globalstructure (e.g., trees next to mountains below a blue sky).Because the photographs are semantically and contextu-ally similar to the CG image, we can find correspondingregions using cosegmentation, and thus can more easilyapply local style transfer methods to improve realism.

Recently, several authors have demonstrated the useof large collections of images for image editing oper-ations. The system of Hays and Efros [1] uses a largecollection of images to complete missing informationin a target image. Their system retrieves a number ofimages that are similar to the query image and usesthem to complete a user-specified region. We take a dif-ferent approach by automatically identifying matchingregions and by stitching together regions from multipleimages. Liu et al. [14] perform example-based imagecolorization using images from the web that is robust toillumination differences. However their method involvesimage registration between search results and input andrequires exact scene matches. Our approach instead usesan image search based on image data, and our transfersonly assume similar content between CG input and realsearch results. Finally, Dale et al. [15] show that largeimage collections can be used to find context-specificpriors for images and demonstrate the utility of thesepriors for global corrections to exposure, contrast andwhite balance. In this work, we consider CG imagesas input and address local transfers of color, tone andtexture.

In work based on annotated image datasets, Lalondeand Efros [16] use image regions drawn from the La-belMe database [17] to populate the query image withnew objects. Johnson et al. [18] allow the user to cre-ate novel composite images by typing in a few nounsat different image locations. A similar system calledSketch2Photo [19] takes sketches with text labels andsynthesizes an image from the sketch. These systems

rely on image annotations to identify image regionsand region correspondences. They can produce decentresults when replacing well-defined objects, but are moredifficult to use for replacing image regions that cannotbe defined by a simple search term. In contrast, ourapproach uses an image-based search descriptor with anautomatic cosegmentation algorithm for identifying localregions and inter-region correspondences.

Researchers have also studied the characteristics ofnatural versus synthetic images. For example, in digitalforensics, Lyu and Farid [20] examine high-order imagestatistics to distinguish between synthetic and naturalimages. Lalonde and Efros [21] use color informationto predict if a composite image will look natural ornot. Others have focused solely on learning a model forthe statistics of natural images [22], [23]. These workssuggest that natural images have relatively consistentstatistical properties and that these properties can beused to distinguish between synthetic and natural im-ages. Based on this observation, our color and tonetransfer algorithms work statistically, adjusting color andgradient distributions to match corresponding distribu-tions from real images.

3 IMAGE AND REGION MATCHING

Fig. 2 shows an overview of our system. First, weretrieve the N closest real images to the query CG image.The N images are shown to the user, who selects the k

most relevant images; typically, N = 30 and k = 5. Next,we perform a cosegmentation of the k real images withthe CG image to identify similar image regions. Oncethe images are segmented, the user chooses among threedifferent types of transfer from the real images to theCG image: texture, color and tone. We find that all threetypes of style transfer can improve the realism of low-quality CG images.

4

3.1 Image Database

Our system leverages a database of 4.5 million naturalimages crawled from the photo-sharing site Flickr usingkeywords related to outdoor scenes, such as ‘beach’,‘forest’, ‘city’, etc. Each image, originally of Flickr’s largesize with a maximum dimension of 1024 pixels, wasdownsampled to approximately 75% its original size andstored in PNG format (24-bit color) to minimize theimpact of JPEG compression artifacts on our algorithms.

Similar to other work that uses large image collec-tions [1], [24], we chose to focus on a specific sceneclass; in our case, outdoor scenes. We have found thatthe performance of our system improves with largerdatabase sizes, and while it is straightforward to build alarge database for a targeted image class, it is still com-putationally impractical to build a large database thatdensely samples the space of imagery. While our tech-niques could work on other scene types (e.g., indoor),the variability would require a much larger database toyield usable matching images [15].

To search the database for matching images, we needan efficient, yet descriptive, image representation. Thegist scene descriptor [25] is one choice of representationthat has been used successfully for image matchingtasks [1]. The gist descriptor uses histograms of Gaborfilter responses at a single level. We used gist in anearly implementation of our system and were not fullysatisfied with the results. In a recent study, Gabor-baseddescriptors, such as gist, were out-performed by SIFT-based descriptors for texture classification [26], justifyingour decision to use a more detailed image representation.

Our representation is based on visual words, or quan-tized SIFT features [27], and the spatial pyramid match-ing scheme of Lazebnik et al. [28]. This approach hasbeen shown to perform well for semantic scene classi-fication and for finding context-specific priors for realimages [15]. To build a descriptor robust to differencesbetween CG and real images, we use a descriptor withsignificant spatial resolution that favors global structuralalignment and we use small visual word vocabularies tocoarsely quantize appearance.

Specifically, we use two vocabularies of 10 and 50words and grid resolutions of 1 × 1, for the 10-wordvocabulary, and 1 × 1, 2 × 2, 4 × 4, and 8 × 8, for the50-word vocabulary, for a final pyramid descriptor with4260 elements. This representation has some redundancy,since a visual word will occur multiple times acrosspyramid levels. The weighting scheme specified by thepyramid match kernel [29] accounts for this; it alsoeffectively provides term-frequency (tf) weighting. Wealso apply inverse document frequency (idf) weightingto the pyramid descriptor.

In addition, we represent a rough spatial layout ofcolor with an 8×8 downsampled version of the image inCIE L*a*b* space (192 elements). Since the search is partof an interactive system, we use principal componentanalysis (PCA) to reduce the descriptor dimensionality

AAA

BBB

A A A

B B B

A A A

B

BB

CC

C

Fig. 3. Results from our cosegmentation algorithm. In

each row, the CG image is shown on the left and two

real image matches, on the right. Note that in all cases,

segment correspondences are correct, and the images

are not over-segmented.

to allow for an efficient in-core search. We keep 700elements for the pyramid term and 48 for the colorterm and L2-normalize each. The final descriptor is theconcatenation of the spatial pyramid and color terms,weighted by α and (1 − α), respectively, for α ∈ [0, 1].Similarity between two images is measured by Euclideandistance between their descriptors.

For low quality CG images, texture is only a weakcue, so smaller α values achieve a better balance of colorversus texture cues. We found that presenting the userwith 15 results obtained with α = 0.25 and 15 withα = 0.75 yielded a good balance between the qualityof matches, robustness to differences in fidelity of CGinputs, and time spent by the user during selection. Weuse a kd tree-based exact nearest-neighbor search, whichrequires about 2 seconds per query on a 3 GHz dual-coremachine.

While more complex than gist [25], we find that ourdescriptor consistently returns better images for thesame query image. To quantify the improvement insearch results, we conducted a user study where theusers were asked to judge the top 20 search resultsfrom each algorithm. The full details of this study aredescribed in Section 6.2.

3.2 Cosegmentation

Global transfer operations between two images, suchas color and tone transfer, work best when the imageshave similarly-sized regions, e.g., when there are similaramounts of sky, ground, or buildings. If the imageshave different regions, or if one image contains a largeregion that is not in the other image, global transfers canfail. Similar to Tai et al. [30], we find that segmentingthe images and identifying regional correspondencesbefore color transfer greatly improves the quality and

5

CG input Color model Global histogram matching Local color transfer

Fig. 4. Transferring image color using cosegmentation. On the left are CG images and real images that serve as

color models; white lines are superimposed on the images to denote the cosegmentation boundaries. On the right are

results from two color transfer algorithms: a global algorithm based on N-dimensional histogram matching [11], and our

local color transfer algorithm. In the top example, the global result has a bluish color cast. In the bottom example, the

global result swaps the colors of the building and the sky. Local color transfer yields better results on both examples.

robustness of the results. But in contrast to their work,we use cosegmentation to segment and match regionsin a single step. This approach is better than segmentingeach image independently and matching regions afterthe fact because the content of all images is taken intoaccount during cosegmentation and matching regionsare automatically produced as a byproduct [31].

The cosegmentation approach of Rother et al. [31] usesan NP-hard energy function with terms to encode bothspatial coherency and appearance histograms. To opti-mize it, they present a novel scheme that they call trust-region graph cuts. It uses an approximate minimizationtechnique to obtain an initial estimate and then refinesthe estimate in the dual space to the original problem.

While our goal is similar to that of Rother et al., weuse the simple approach of Dale et al. [15] that is basedon the mean-shift framework [32]. We define a featurevector at every pixel p that is the concatenation of thepixel color in L*a*b* space, the normalized x and y

coordinates at p, and a binary indicator vector (i0, . . . , ik)such that ij is 1 when pixel p is in the jth image and 0otherwise. Note that the problem of segmenting a setof related images is different from the problem of seg-menting video—there is no notion of distance across theimage index dimension as there is in a video stream (i.e.,there is no time dimension). Thus, the final componentsof the feature vector only differentiate between pixelsthat come from the same image versus those that comefrom different images.

The distance between feature vectors at pixels p1 andp2 in images Ij and Ik is a weighted Euclidean distance:

d(p1, Ij , p2, Ik)2 = σ2

cdc(Ij(p1), Ik(p2))2

+ σ2

sds(p1, p2)2 + σ2

bδ(j − k), (1)

where dc(Ij(p1), Ik(p2)) is the L*a*b* color distance be-tween pixel p1 in image Ij and pixel p2 in image Ik andds(p1, p2) is the spatial distance between pixels p1 andp2. The delta function encodes the distance between thebinary components of the feature vector.

The scalars σc, σs and σb in Eqn. 1 serve as weights tobalance the color, spatial, and binary index components.The default values are σc = 0.4, σs = 0.5 and σb = 0.1,and we find that the weights and the mean-shift band-width parameter do not need to be adjusted for individ-ual image sets to achieve the types of segmentations thatare useful to our color and tone transfer algorithms.

A disadvantage of mean-shift is that it can be costlyto compute at every pixel of an image without usingspecific assumptions about feature vectors or kernel [33].Since we are after coarse regional correspondences, wereduce the size of the image by a factor of 8 along eachdimension and use a standard mean-shift algorithm withthe feature vectors described above. We then upsamplethe cosegmentation maps to full resolution using jointbilateral upsampling [34].

In Fig. 3, we show three cosegmentation results, eachwith three images (one CG, two real). In the first twocases, the algorithm segments the images into sky andnon-sky. In the last case, the images are segmented intothree regions: ground, sky, and water. In all cases, thesegment correspondences are correct, and although ourcolor and tone transfer algorithms are robust to it, theimages have not been over-segmented.

4 LOCAL STYLE TRANSFER OPERATORS

After cosegmentation, we apply local style transfer op-erations for color, tone and texture.

6

CG input Tone model Local color and tone transfer Close-up

Fig. 5. Transferring image tone. From left to right: an input CG image, a real image that will serve as a color and tone

model, the result of region-based transfer of subband distributions, and close-up view of before and after tone transfer.

4.1 Local Color Transfer

The simplest style transfer is color transfer, where colorsof the real images are transferred to the CG image bytransferring the statistics of a multi-dimensional his-togram. This method was shown to work quite well forcolor transfer between real images, but it often fails whenapplied to CG images. The main difficulty is that thecolor histogram of CG images is typically different fromthe histogram of real images—it is much more sparse(fewer colors are used). The sparsity and simplicity of thecolor distributions can lead to instability during globaltransfer where colors are mapped arbitrarily, as shownin the bottom row of Fig. 4.

We mitigate these problems by a combination of joint-bilateral upsampling and local color transfer. We down-sample the images, compute the color transfer offsetsper region from the lower resolution images, and thensmooth and upsample the offsets using joint bilateralupsampling. Working on regions addresses the problemof images that contain different proportions of colors andjoint bilateral upsampling smooths color transfer in thespatial domain.

Within each sub-sampled region, our color transferalgorithm uses 2D histogram matching on the a* and b*channels, and 1D histogram matching on the L* channel.The advantage of histogram matching methods is thatthey do not require per pixel correspondences, whichwe do not have. Unfortunately, unlike 1D histogrammatching, there is no closed form solution for 2D his-togram transfer. We use an iterative algorithm by Pitieet al. [11] that projects the 2D histogram onto random1D axes, performs standard 1D histogram matching,and reprojects the data back to 2D. The algorithm typ-ically converges in fewer than 10 iterations. We foundthat marginalizing the distributions and performing theremapping independently for the a* and b* channelsproduces inferior results.

In Fig. 4, we show two examples of color transfer.

From left to right, we show the original CG images, thereal images used as color models, the results of global N-dimensional color transfer (in L*a*b* space), and resultsof our region-based transfer. In the top example, theglobal algorithm produces a blue color cast over theentire image because the real image has significantlymore sky than the CG image. In the bottom example, thehouse and sky are influenced by the opposite regions inthe model image: the house becomes blue and the skybecomes a neutral gray. These problems are avoided bylocal color transfer.

4.2 Local Tone Transfer

In addition to transferring the color histogram from thephotographs to the CG image we are also interested inadjusting gradient histograms at different scales. To thisend, we apply a method similar to Bae et al. [12] tomatch filter statistics of the luminance channel of theCG image to the photographs. Our method, however,transfers detail locally within cosegmentation regionsand uses a 4-level pyramid based on a quadrature mirrorfilter [35] instead of a bilateral filter.

Our tone transfer algorithm is similar to the algorithmfor color transfer. First, we decompose the luminancechannel of the CG image and one or more real imagesusing a QMF pyramid. Next, we use 1D histogrammatching to match the subband statistics of the CG im-age to the real images in every region. After transferringon subbands, we model the effect of histogram transferon subband signals as a change in gain:

s′i(p) = gi(p)si(p), (2)

where si(p) is the level i subband coefficient at pixel p,and s′i(p) is the corresponding subband coefficient afterregional histogram matching, and gi(p) is the gain. Gainvalues greater than one will amplify detail in that sub-band and gain values less than one will diminish detail.

7

To avoid halos or other artifacts, we employ the gain-scaling strategies described by Li et al. [36] to ensurethat lower subbands are not amplified beyond highersubbands and that the gain signals are smooth near zero-crossings.

The results of color and tone transfer are shown inFig. 5. As can be seen, the appearance of the CG imagechanges subtly. Since color and tone transfers do notfundamentally change the structure of the image, theycan be used even when the image matches returned fromthe database are poor or when the CG image is alreadyclose to being photorealistic.

4.3 Local Texture Transfer

In addition to color and tone transfer, we also transfertexture from photographs, which improves realism espe-cially for low-quality CG images. This is different fromtexture synthesis, where the goal is to synthesize more ofthe same texture given an example texture. In our case,we do not want to reuse the same region many timesbecause this often leads to visual artifacts in the formof repeated regions. Instead, we rely on the k similarphotographs we retrieved from the database to provideus with a set of textures to help upgrade the realism ofthe CG image.

We start local texture transfer by aligning the CGimage with the real images. Transferring texture on aregion-by-region basis, as was done for color and tone,does not work well because region boundaries do notalways correspond to strong edges in the image. As aresult, slightly different textures can be transferred toneighboring regions, leading to noticeable visual arti-facts. To reduce these artifacts, we align multiple shiftedcopies of each real image to the different regions of theCG image and transfer textures using graph-cut. Theresult is a coherent texture transfer that respects strongscene structure.

We perform the region-based alignment as follows.For each cosegmented region in the CG image, we usecross-correlation of edge maps (magnitudes of gradients)to find the real image, and the optimal shift, that bestmatches the CG image for that particular region. Werepeat the process in a greedy manner until all regionsin the CG image are completely covered. To reducerepeated textures, we only allow up to c shifted copies ofan image to be used for texture transfer (typically c = 2).

Once the alignment step is over we have a set ofck shifted real images that we can now use for texturetransfer. We model the problem as label assignment overa Markov Random field (MRF) and solve it using graphcuts. The set of labels at each pixel location consists ofup to ck + 1 labels, corresponding to c shifted versionsof each of the k real images, as well as a copy of the CGimage, in case no matching texture was found. We lookfor the best label assignment to optimize an objectivefunction C(L) that consists of a data term Cd over allpixels p and an interaction term Ci over all pairs of pixels

Fig. 6. CG image (upper left) with regions transferred

from three reference photographs (upper right to lower

right). The composite after Poisson blending and color

and tone transfer (bottom left).

p, q. Specifically:

C(L) =∑

p

Cd(p, L(p)) +∑

p,q

Ci(p, q, L(p), L(q)) (3)

where the data penalty term Cd(p, L(p)) that measuresdistance between a 3× 3 patch around pixel p in the CGimage and a real image is given by:

Cd(p, L(p)) = α1Dc(p, L(p)) + α2Dg(p, L(p))

+ α3Dl(p, L(p)) . (4)

The term Dc(p, L(p)) is the average distance in L*a*b*space between the 3 × 3 patch centered around pixel p

in the CG image and the patch centered around pixelp in the image associated with label L(p). Similarly,Dg(p, L(p)) is the average distance between the magni-tudes of the gradients of the patches. Note that both ofthese terms are zero when the label L(p) corresponds tothe CG image.

The region label term Dl(p, L(p)) in Eqn. 4 controlsthe error of transferring textures between different coseg-mentation regions and also provides an error for choos-ing the original CG image as a source of texture:

Dl(p, L(p)) =

γ L(p) = CG label0 p and L(p) from same region1 otherwise

(5)

If the distance of all the real images is greater thanα3γ, the graph-cut algorithm will prefer keeping the CGtexture.

The weights in Eqn. 4 balance the contributions of thethree terms and they are normalized,

∑

i αi = 1. Thedefault values of 0.2, 0.7, and 0.1 give more weight tothe gradient term than the other terms, but these weightscan be adjusted by the user.

The interaction term Ci(p, q, L(p), L(q)) in Eqn. 3 iszero if the labels L(p) and L(q) are the same, and a

8

constant modulated by a blurred and inverted edge mapof the CG image otherwise. Specifically:

Ci(p, q, L(p), L(q)) =

{

0 L(p) = L(q)λM(p) otherwise

(6)

where M(p) is zero near strong edges in the CG imageand near 1 in smooth regions. The scalar λ affects theamount of texture switching that can occur. For lowvalues of λ, the algorithm will prefer small patches oftextures from many images and for high values of λ thealgorithm will choose large blocks of texture from thesame image. The typical range of this parameter is 0.1to 0.4.

Once the graph cut has determined the label assign-ment per pixel (i.e., the image from which to transfertexture), we copy gradients from the selected image intothe CG image and then reconstruct the image by solvingPoisson’s equation with Neumann boundary constraints.To minimize reconstruction errors near region bound-aries, we use a mixed guidance field, selecting the gra-dient based on its norm [37].

An example of texture transfer from three real imagesto a CG image is shown in Fig. 6. The input CG image isshown in the upper left and two matches returned fromour database are shown in the upper right; we allowedtwo shifted copies of each real image (c = 2). Gradientswere copied from the real images into the CG image inregions specified by the graph-cut. The resulting image,after color and tone adjustment, is shown in the lowerleft.

5 RESULTS

Figs. 9 and 11 show results generated by our system fora number of different scenes. These examples includecity scenes and various landscapes–e.g., forest, meadow,lake, and beach. To generate these results, we applied allthree stages of our system, transferring texture first andthen altering the color and tone of the input CG imagesto match the real images. We perform texture transferfirst because gradient-domain manipulations can causecolor shifts that deviate from natural colors. The colorand tone transfer steps correct these unnatural colors byshifting them towards colors in the real images.

In our system, color and tone transfer operations arecompletely automatic. And for the examples shown inFig. 11, texture transfer only requires the user to specifya few parameters. The most important parameters are γ

and λ, described in Section 4.3. The parameter γ controlsthe preference of the algorithm between real and CGtextures. Setting this parameter to a high value willcause the algorithm to recreate the CG image, as bestit can, with only real patches. Setting it to a low valuewill cause the algorithm to only choose real patches ifthey are close to the CG patch, otherwise preservingthe original patch. The λ parameter controls the sizeof the transferred texture patches. Because all transferoperations are reasonably fast, these parameters can beadjusted to synthesize different versions of an image.

As shown in Figs. 6 and 11, many natural scenescan be synthesized using an automatic approach. Thesescenes are primarily composed of stationary textures,and the global structure of the image is conveyed by thecomposition, boundaries, and scale of the textures withineach image. For scenes with more structure, however,such as the examples in Fig. 7(a) and (b), automatictexture transfer can fail. For these scenes, differences inperspective, scale, and scene geometry between the CGinput and real matches can lead to objectionable artifactsin the final result. These problems are due to the fact thatour MRF-based texture transfer makes local decisions,which are based solely on 2D image data. Thereforeproperties of the image that are strongly dependent onthe geometry of the original scene can be mishandled.

Fortunately, a few operations are sufficient to extendour method to a larger class of scenes, including thosewith significant structure. In particular, we can use thespidery mesh of Horry et al. [38] or simple planar ho-mographies to correct major structural differences. Oncewarped, texture transfer proceeds as in Sec. 4.3.

In Fig. 9, the real images were warped to align ge-ometry before texture transfer. In the top row, the roadfrom a real image was warped using the spidery meshtool to more closely match the road of the CG image.Trees, mountains, and the sky from other (unwarped)image matches were added to produce the final result.In the second row, textures of buildings were warpedby specifying planes in both the real and CG images.The color and tone of the image was also modified tomatch a real image (not shown). In the bottom row, theuser adjusted the blending mask within the red lines topreserve the dashed line in the center of the road.

In some cases, a user may not want a region of theimage to be modified substantially. For example, if theuser has spent hours editing a 3D model they probablywant the model to appear exactly as they have designedit. In these cases, the user can specify an alpha maskand the system will operate outside the mask. In thisscenario, the system provides an easy way to synthesizea realistic background for a 3D model. Two examples of3D models with backgrounds synthesized by our systemare shown in Fig. 8.

6 EVALUATION

We conducted a user study to evaluate the effectivenessof our system. From a set of 10 example CG images wegenerated 10 CG2Real results automatically, i.e., with-out manual adjustments. A third set of 10 images wasselected from the real photographs used to enhance thecorresponding CG2Real results.

Twenty subjects, between 20 and 30 years old, par-ticipated in the user study. They were paid $10 fortheir time. Each participant viewed a sequence of 10images drawn from the set of 30. Each test sequence wasselected randomly, with the constraint that the sequencecontained at least 3 images from each category, and that

9

(a) (b) (c)

Fig. 7. Failure cases. (a & b) The texture transfer algorithm does not account for viewpoint or geometric differences

between objects in the source images. As a result, images of structured scenes are difficult to blend without user

interaction. (c) Differences in scale can also lead to unrealistic composites. The cows from the image in Fig. 4 appear

to be the size of mice because the texture is at the wrong scale.

Fig. 8. Realistic backgrounds for CG models. In some

cases, a user may not want regions of the image replaced.

The user can specify an alpha mask (middle row) and

the system will operate outside the mask, thus creating a

realistic context for a rendered model (bottom).

multiple instances of images from the same example didnot appear in the sequence. Participants were instructedto identify each image as ‘real’ if they felt that theimage was captured by a camera and ‘fake’ if they felt itwas generated by a computer program. They were alsoinformed that their responses would be timed but thatthey should respond accurately rather than quickly.

Here we report a number of findings. With unlimitedviewing time, the subjects classified:

• 97% of the CG images as fake;• 52% of the CG2Real images as fake;• 17% of the real images as fake.

As can be seen, users were able to identify 97% of theCG images as fake and and this number was reduced to52%, indicating that some of the CG2Real images wereconsidered real. With only 5 seconds of viewing, thesubjects classified

• 82% of the CG images as fake;• 27% of the CG2Real images as fake;• 12% of the real images as fake.

In this case, most of the CG images were still identifiedas fake, but more of the CG2Real images were consid-ered real. These results suggest that our color, tone andtexture transfer give an initial impression of realism, butwith unlimited viewing time, inconsistencies, such asscale and perspective differences, become apparent.

Fig. 10(a) shows the complete results. It describes thepercentage of images marked as fake as a function ofmaximum response time.

6.1 Realism vs. Size

What cues are viewers using to distinguish CG fromreal? To begin to answer this question, we repeated thereal vs. fake discrimination task at various scales. Bychanging the size of the images, we change the amountof high-frequency information, and our goal was toquantify the effect of this information on the perceptionof realism.

We presented three sets of images (CG, CG2Real, andReal) and we fixed the presentation time at five seconds.We varied the image width in powers of two from 32 to512 pixels and asked the viewer to identify the images as‘real’ or ’fake’ using the same definitions as the previousstudy.

We used the 30 images from the previous study (10of each type) and collected 20 responses per image at 5different sizes. Due to the size of this study (30 × 5 ×

10

(a) (b) (c)

Fig. 9. CG2Real results on structured scenes. (a) Images of structured scenes are difficult to blend without user

interaction. (b) The spidery-mesh tool (top) and rectangle tool (middle) were used to align geometry. In the last row

of (b), a scribble tool was used to modify the mask produced by our texture transfer algorithm (the modifications are

outlined). (c) Final results using edited textures.

20 = 3000 responses), we used Amazon’s MechanicalTurk [39]. Mechanical Turk is an online marketplace forhuman intelligence tasks. Requestors can publish tasksand the rate they are willing to pay per task; workers canbrowse for tasks they are willing to perform. We dividedthe study into experiments by image size, collecting20 responses for each image in each experiment. Theaverage time per experiment was twelve minutes andthe average number of contributing workers was thirty-nine.

The results are shown in Fig. 10(b). At 32 pixels, mostimages were labeled as ‘real,’ though there was someability to distinguish between the three types of imageseven at this scale. As the image size increased, the taskbecame easier, though the ability to detect a CG imageas fake increased more dramatically with size than theability to detect a CG2Real image (based on the slope ofthe curves between 32 and 256 pixels). In addition, theviewer’s confidence in real images increased after 128pixels—they mislabeled fewer real images as fake. This

study suggests that high frequencies contribute to theperception of realism in CG images, though they do notaccount for all of it.

In our system, we apply three types of transfer andthe previous experiments have established that all threeof these operations together improve realism for CGimages. But how much realism is due to texture transferand how much is due to color and tone transfer? Toaddress this question, we ran another Mechanical Turktask showing images for five seconds and asking viewersto label the images as real or fake. For this experiment,we introduced CG2Real images without texture transfer(only color and tone), but kept the other parameters ofthe experiment the same. We found that 69% of color andtone images were classified as fake, which is betweenthe results for CG and CG2Real (full pipeline) foundin previous studies. This result suggests that color andtone transfer improve realism somewhat, but most of thegain shown in Figs. 10(a) and 10(b) comes from texturetransfer.

11

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

90

100

CG images

CG2Real images

Real photographs

Maximum response time (seconds)

Per

cen

tag

e o

f im

ages

mar

ked

as

fak

e

32 64 128 256 512

10

20

30

40

50

60

70

80

90

100

Image Width (pixels)

Perc

enta

ge o

f im

ages m

ark

ed a

s fake

CG images

CG2Real images

Real photographs

(a) (b)

Fig. 10. (a) Percentage of images marked as fake as a function of maximum response time for real photos, CG

images, and CG2Real images produced by our method. (b) Percentage of images marked as fake as a function of

image width for real photos, CG images, and CG2Real images produced by our method.

6.2 Search Descriptors

As we mentioned in Section 3.1, we originally used thegist descriptor for image search but were not satisfiedwith the results. We found that the spatial pyramiddescriptor consistently returned better matches, but wesought to quantify this observation. We used MechanicalTurk again for this task but with a different experimentalsetup.

We selected 50 CG images from our collection at ran-dom and performed queries on the same image databaseusing two different descriptors: our spatial pyramiddescriptor and gist. We hired 10 workers per image fora total of 1000 tasks (50 inputs × 10 workers per input× 2 descriptors).

In a given task, we presented the user with a CGimage and twenty image matches. Users were instructedto “select all images that were good matches to the queryimage.” There were additional instructions to clarify thata good match is an image that depicts a similar scene.Users responded via checkboxes, and their responseswere not timed.

Here we consider a match “good” if 3 or more users(out of 10) considered it so. Across 50 images, at this30% acceptance threshold, the spatial pyramid descriptorreturned an average of 4.6 good matches per image,while gist returned 1.7 good matches per image.

7 LIMITATIONS

In general, our system is only as good as the database ofphotographs upon which it is built. We have found thata database of 4.5 million images yields around 4 goodmatching images for CG landscapes, but more imageswould be required to match more complex scenes. Infuture work, we hope to increase the database size todetermine how the number of matching images andthe match quality affect the final result. We also hopeto expand the database to include a wider variety ofphotographs, including indoor images. But even with

more images, texture transfer may fail to be realisticif there are geometric or perspective differences or ifthe scale is incorrect, Fig. 7(c). Some of these examplescan be improved with texture warping, Fig. 9, and it ispossible that an appropriate warp could be determinedautomatically in some cases.

8 CONCLUSIONS

Our system takes a radically different approach to syn-thesizing photorealistic CG images. Instead of addinggeometric and texture details or using more sophisti-cated rendering and lighting models, we take an image-based approach that exploits the growing number of realimages that are available online. We transfer realisticcolor, tone, and texture to the CG image and we showthat these transfers improve the realism of CG images.To improve upon these results, future work will have toconsider other operations, perhaps perspective and scaleadjustments. While we have found that high frequenciesin real-image texture contribute to perceived realism,what other cues are people using to distinguish CGfrom real? Are they high or low-level? Future work onthis problem could provide valuable insight into themysteries of human vision.

ACKNOWLEDGMENTS

This work was supported in part by the National ScienceFoundation under Grants No. PHY-0835713 and 0739255,the John A. and Elizabeth S. Armstrong Fellowship atHarvard, and through generous support from AdobeSystems. We would like to thank David Salesin and TomMalloy at Adobe for their continuing support. This workwas completed while S. Avidan and W. Matusik wereSenior Research Scientists at Adobe Systems.

REFERENCES

[1] J. Hays and A. A. Efros, “Scene completion using millions of pho-tographs,” ACM Transactions on Graphics (Proc. ACM SIGGRAPH),vol. 26, no. 3, pp. 4 (1–7), Aug. 2007.

12

Fig. 11. Results of our CG2Real system. In each section, the CG input image appears above the output image. All of

these results were synthesized without the use of scribbles or geometric adjustment.

[2] D. J. Heeger and J. R. Bergen, “Pyramid-based texture analy-sis/synthesis,” in Proceedings of ACM SIGGRAPH, 1995, pp. 229–238.

[3] J. S. De Bonet, “Multiresolution sampling procedure for analysisand synthesis of texture images,” in Proceedings of ACM SIG-GRAPH, 1997, pp. 361–368.

[4] A. A. Efros and T. K. Leung, “Texture synthesis by non-parametricsampling,” in Proc. IEEE Int. Conf. Computer Vision, 1999, pp. 1033–1038.

[5] A. A. Efros and W. T. Freeman, “Image quilting for texturesynthesis and transfer,” in Proceedings of ACM SIGGRAPH, 2001,pp. 341–346.

[6] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick, “Graphcuttextures: Image and video synthesis using graph cuts,” ACMTransactions on Graphics (Proc. ACM SIGGRAPH), vol. 22, no. 3,pp. 277–286, Jul. 2003.

[7] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H.Salesin, “Image analogies,” in Proceedings of ACM SIGGRAPH,Aug. 2001, pp. 327–340.

[8] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-basedsuper-resolution,” IEEE Comput. Graph. Appl., vol. 22, no. 2, pp.56–65, Mar./Apr. 2002.

[9] R. Rosales, K. Achan, and B. Frey, “Unsupervised image transla-tion,” in Proc. IEEE Int. Conf. Computer Vision, 2003, pp. 472–478.

[10] E. Reinhard, M. Ashikhmin, B. Gooch, and P. S. Shirley, “Colortransfer between images,” IEEE Comput. Graph. Appl., vol. 21,no. 5, pp. 34–41, Sep./Oct. 2001.

[11] F. Pitie, A. C. Kokaram, and R. Dahyot, “N-dimensional probablil-ity density function transfer and its application to colour transfer,”in Proc. IEEE Int. Conf. Computer Vision, 2005, pp. 1434–1439.

[12] S. Bae, S. Paris, and F. Durand, “Two-scale tone managementfor photographic look,” ACM Transactions on Graphics (Proc. ACMSIGGRAPH), vol. 25, no. 3, pp. 637–645, Jul. 2006.

[13] C.-L. Wen, C.-H. Hsieh, B.-Y. Chen, and M. Ouhyoung, “Example-based multiple local color transfer by strokes,” Computer GraphicsForum (Proc. of Pacific Graphics), vol. 27, no. 7, pp. 1765–1772, 2008.

[14] X. Liu, L. Wan, Y. Qu, T.-T. Wong, S. Lin, C.-S. Leung, and P.-A.Heng, “Intrinsic colorization,” ACM Transactions on Graphics (Proc.of SIGGRAPH Asia), vol. 27, no. 3, pp. 152 (1–9), 2008.

[15] K. Dale, M. K. Johnson, K. Sunkavalli, W. Matusik, andH. Pfister, “Image restoration using online photo collections,” inProc. IEEE Int. Conf. Computer Vision, 2009, pp. 2217–2224.

[16] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, andA. Criminisi, “Photo clip art,” ACM Transactions on Graphics(Proc. ACM SIGGRAPH), vol. 26, no. 3, pp. 3 (1–10), Aug. 2007.

13

[17] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,“Labelme: a database and web-based tool for image annotation,”Int. Journal of Computer Vision, vol. 77, no. 1–3, pp. 157–173, May2008.

[18] M. Johnson, G. J. Brostow, J. Shotton, O. Arandjelovic, V. Kwatra,and R. Cipolla, “Semantic photo synthesis,” Computer GraphicsForum, vol. 25, no. 3, pp. 407–414, Sep. 2006.

[19] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu,“Sketch2Photo: Internet image montage,” ACM Transactions onGraphics (Proc. of SIGGRAPH Asia), vol. 28, no. 5, pp. 124 (1–10),2009.

[20] S. Lyu and H. Farid, “How realistic is photorealistic?” IEEE Trans.Signal Process., vol. 53, no. 2, pp. 845–850, 2005.

[21] J.-F. Lalonde and A. A. Efros, “Using color compatibility forassessing image realism,” in Proc. IEEE Int. Conf. Computer Vision,2007, pp. 1–8.

[22] Y. Weiss and W. T. Freeman, “What makes a good model ofnatural images?” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2007, pp. 1–8.

[23] S. Roth and M. J. Black, “Fields of experts: A framework forlearning image priors,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 2, 2005, pp. 860–867.

[24] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zis-serman, “Segmenting scenes by matching image composites,” inAdvances in Neural Information Processing Systems, 2009, pp. 1580–1588.

[25] A. Oliva and A. Torralba, “Modeling the shape of the scene:a holistic representation of the spatial envelope,” Int. Journal ofComputer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[26] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local fea-tures and kernels for classification of texture and object categories:a comprehensive study,” Int. Journal of Computer Vision, vol. 73,no. 2, pp. 213–238, 2007.

[27] D. G. Lowe, “Distinctive image features from scale-invariantkeypoints,” Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[28] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:Spatial pyramid matching for recognizing natural scene cate-gories,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition,vol. 2, 2006, pp. 2169–2178.

[29] K. Grauman and T. Darrell, “The pyramid match kernel:Discriminative classification with sets of image features,” inProc. IEEE Int. Conf. Computer Vision, vol. 2, 2005, pp. 1458–1465.

[30] Y.-W. Tai, J. Jia, and C.-K. Tang, “Local color transfer via proba-bilistic segmentation by expectation-maximization,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, 2006, pp. 747–754.

[31] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, “Cosegmen-tation of image pairs by histogram matching—incorporating aglobal constraint into MRFs,” in Proc. IEEE Conf. Computer Visionand Pattern Recognition, vol. 1, 2006, pp. 993–1000.

[32] K. Fukunaga and L. D. Hostetler, “The estimation of the gradientof a density function, with applications in pattern recognition,”IEEE Trans. Inf. Theory, vol. 21, no. 1, pp. 32–40, 1975.

[33] S. Paris and F. Durand, “A topological approach to hierarchicalsegmentation using mean shift,” in Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2007, pp. 1–8.

[34] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Jointbilateral upsampling,” ACM Transactions on Graphics (Proc. ACMSIGGRAPH), vol. 26, no. 3, pp. 96 (1–5), 2007.

[35] E. H. Adelson, E. P. Simoncelli, and R. Hingorani, “Orthogonalpyramid transforms for image coding,” in Proc SPIE Visual Com-munications and Image Processing II, vol. 845, 1987, pp. 50–58.

[36] Y. Li, L. Sharan, and E. H. Adelson, “Compressing and com-panding high dynamic range images with subband architectures,”ACM Transactions on Graphics (Proc. ACM SIGGRAPH), vol. 24,no. 3, pp. 836–844, 2005.

[37] P. Perez, M. Gangnet, and A. Blake, “Poisson image editing,” ACMTransactions on Graphics (Proc. ACM SIGGRAPH), vol. 22, no. 3, pp.313–318, Jul. 2003.

[38] Y. Horry, K.-I. Anjyo, and K. Arai, “Tour into the picture: using aspidery mesh interface to make animation from a single image,”in Proceedings of ACM SIGGRAPH, 1997, pp. 225–232.

[39] Amazon.com, “Amazon mechanical turk,” http://www.mturk.com, 2009.

Micah K. Johnson is a Postdoctoral Fellow atMIT. His PhD work focused on image foren-sics, i.e., detecting tampered images, but hehas since worked on a variety of techniquesfor improving realism in computer graphics. Heholds undergraduate degrees in Mathematicsand Music Theory from the University of NewHampshire, an AM in Electro-Acoustic Musicfrom Dartmouth College, and a PhD in ComputerScience from Dartmouth College.

Kevin Dale received his BS degree in computerscience from the University of North Carolina atChapel Hill in 2004 and his MCS degree from theUniversity of Virginia in 2006. He is currently aPhD student at Harvard University. His researchinterests are in computational photography andgraphics hardware.

Shai Avidan received the PhD degree fromthe School of Computer Science at the HebrewUniversity, Jerusalem, Israel, in 1999. He is aprofessor at Tel-Aviv University, Israel. He was apostdoctoral researcher at Microsoft Research,a project leader at MobilEye, a research scientistat Mitsubishi Electric Research Labs (MERL),and a senior research scientist at Adobe Sys-tems. He has published extensively in the fieldsof object tracking in video sequences and 3Dobject modeling from images. Recently, he has

been working on the Internet vision applications such as privacy pre-serving image analysis, distributed algorithms for image analysis, andmedia retargeting, the problem of properly fitting images and video todisplays of various sizes.

Hanspeter Pfister is the Gordon McKay Pro-fessor of the Practice in Harvard University’sSchool of Engineering and Applied Sciences.His research interests are at the intersection ofvisualization, computer graphics, and computervision. Pfister has a PhD in computer sciencefrom the State University of New York at StonyBrook. He is a senior member of the IEEEComputer Society and member of ACM, ACMSiggraph, and Eurographics.

William T. Freeman received the PhD degreefrom the Massachusetts Institute of Technologyin 1992. He is a professor of electrical en-gineering and computer science at the Mas-sachusetts Institute of Technology (MIT), work-ing in the Computer Science and Artificial In-telligence Laboratory (CSAIL). He has been onthe faculty at MIT since 2001. From 1992 to2001, he worked at Mitsubishi Electric ResearchLabs (MERL), Cambridge, Massachusetts. Priorto that, he worked at Polaroid Corporation, and

in 1987-1988, was a foreign expert at the Taiyuan University of Technol-ogy, China. His research interests include machine learning applied toproblems in computer vision and computational photography.

Wojciech Matusik is a senior research scientistat Disney Research Zurich. Before coming toDisney, he worked at Mitsubishi Electric Re-search Laboratories and at Adobe Systems. Hereceived a BS in EECS from the University ofCalifornia at Berkeley in 1997, MS in EECS fromMIT in 2001, and PhD in 2003. His primaryresearch is in the area of computer graphics withbroad applications in other disciplines such asdigital communications, materials science, andbiomechanics.

cg2real: improving the realism of computer generated ...avidan/papers/cg2real_lowres.pdfresults of a...

Documents