making images real again: a comprehensive survey on deep

17
1 Making Images Real Again: A Comprehensive Survey on Deep Image Composition Li Niu*, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang Abstract—As a common image editing operation, image com- position aims to cut the foreground from one image and paste it on another image, resulting in a composite image. However, there are many issues that could make the composite images unrealistic. These issues can be summarized as the inconsistency between foreground and background, which include appearance inconsistency (e.g., incompatible color and illumination) and geometry inconsistency (e.g., unreasonable size and location). Previous works on image composition target at one or more issues. Since each individual issue is a complicated problem, there are some research directions (e.g., image harmonization, object placement) which focus on only one issue. By putting all the efforts together, we can acquire realistic composite images. Sometimes, we expect the composite images to be not only realistic but also aesthetic, in which case aesthetic evaluation needs to be considered. In this survey, we summarize the datasets and methods for the above research directions. We also discuss the limitations and potential directions to facilitate the future research for image composition. Finally, as a double-edged sword, image composition may also have negative effect on our lives (e.g., fake news) and thus it is imperative to develop algorithms to fight against composite images. Datasets and codes for image composi- tion are summarized at https://github.com/bcmi/Awesome-Image- Composition. I. I NTRODUCTION Image composition aims to cut the foreground from one image and paste it on another image to form a composite image, as illustrated in Fig. 1(a). More generally, image composition can be used for combining multiple visual ele- ments from different sources to construct a realistic image. As a common image editing operation, image composition has a wide range of applications like augmented reality, artistic creation, and e-commerce. One example is virtual entertainment. One person can change the backgrounds of self- portraits to tourist attractions. Multiple persons can upload their self-portraits to composite a group photo with desired background scenery. Another example is advertising picture generation. Given an image containing the target product, the extracted product foreground can be pasted on a new background together with other decorative visual elements to compose an appealing advertising picture automatically. For artists, image composition can be used to create fantastic artworks that originally only exist in the imagination. After compositing a new image with foreground and back- ground, there exist many issues that could make the composite Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, and Liqing Zhang are with MOE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China (email: {ustcnewly, plcwyam17320, shirlley, Hy2628982280, bo-zhang, leungjing, lqzhang}@sjtu.edu.cn). * means the corresponding author. Fig. 1. Subfigure (a) illustrates the process of obtaining a composite image. Specifically, we extract the foreground from one image use image segmentation or matting techniques and paste the foreground on another background image. There could be many issues (e.g., unreasonable location and size, incompatible color and illumination, missing shadow) for the ob- tained composite image. These issues can be addressed by image composition techniques (e.g., object placement, image blending, image harmonization, shadow generation) as shown in Subfigure (b), in which the composite image and foreground mask are taken as input to produce a realistic-looking image. image unrealistic and thus significantly degrade its quality. These issues can be summarized as the inconsistency between foreground and background, which can be divided into ap- pearance inconsistency and geometry inconsistency [19]. Both appearance inconsistency and geometry inconsistency involve a number of issues to be solved. Previous works [135, 3, 164, 163] on image composition target at one or more issues. Since each individual issue could be very challenging, several divergent research directions are emerging, with each research direction focusing on only one issue (see Fig. 1(b)). Next, we will introduce the appearance inconsistency and the geometry inconsistency separately. The appearance inconsistency is including but not limited to: 1) unnatural boundary between foreground and background; 2) incompatible color and illumination statistics between fore- ground and background; 3) missing or implausible shadow and reflection of foreground. For the first issue, the foreground is usually extracted using segmentation [104] or matting [157] algorithms. However, the foregrounds may not be precisely delineated, especially at the boundaries. When pasting the arXiv:2106.14490v1 [cs.CV] 28 Jun 2021

Upload: others

Post on 25-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Making Images Real Again: A Comprehensive Survey on Deep

1

Making Images Real Again: A ComprehensiveSurvey on Deep Image Composition

Li Niu*, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang

Abstract—As a common image editing operation, image com-position aims to cut the foreground from one image and pasteit on another image, resulting in a composite image. However,there are many issues that could make the composite imagesunrealistic. These issues can be summarized as the inconsistencybetween foreground and background, which include appearanceinconsistency (e.g., incompatible color and illumination) andgeometry inconsistency (e.g., unreasonable size and location).Previous works on image composition target at one or moreissues. Since each individual issue is a complicated problem,there are some research directions (e.g., image harmonization,object placement) which focus on only one issue. By putting allthe efforts together, we can acquire realistic composite images.Sometimes, we expect the composite images to be not onlyrealistic but also aesthetic, in which case aesthetic evaluationneeds to be considered. In this survey, we summarize the datasetsand methods for the above research directions. We also discussthe limitations and potential directions to facilitate the futureresearch for image composition. Finally, as a double-edged sword,image composition may also have negative effect on our lives (e.g.,fake news) and thus it is imperative to develop algorithms to fightagainst composite images. Datasets and codes for image composi-tion are summarized at https://github.com/bcmi/Awesome-Image-Composition.

I. INTRODUCTION

Image composition aims to cut the foreground from oneimage and paste it on another image to form a compositeimage, as illustrated in Fig. 1(a). More generally, imagecomposition can be used for combining multiple visual ele-ments from different sources to construct a realistic image.As a common image editing operation, image compositionhas a wide range of applications like augmented reality,artistic creation, and e-commerce. One example is virtualentertainment. One person can change the backgrounds of self-portraits to tourist attractions. Multiple persons can uploadtheir self-portraits to composite a group photo with desiredbackground scenery. Another example is advertising picturegeneration. Given an image containing the target product,the extracted product foreground can be pasted on a newbackground together with other decorative visual elements tocompose an appealing advertising picture automatically. Forartists, image composition can be used to create fantasticartworks that originally only exist in the imagination.

After compositing a new image with foreground and back-ground, there exist many issues that could make the composite

Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, andLiqing Zhang are with MOE Key Lab of Artificial Intelligence, Department ofComputer Science and Engineering Shanghai Jiao Tong University, Shanghai,China (email: {ustcnewly, plcwyam17320, shirlley, Hy2628982280, bo-zhang,leungjing, lqzhang}@sjtu.edu.cn).

* means the corresponding author.

Fig. 1. Subfigure (a) illustrates the process of obtaining a compositeimage. Specifically, we extract the foreground from one image use imagesegmentation or matting techniques and paste the foreground on anotherbackground image. There could be many issues (e.g., unreasonable locationand size, incompatible color and illumination, missing shadow) for the ob-tained composite image. These issues can be addressed by image compositiontechniques (e.g., object placement, image blending, image harmonization,shadow generation) as shown in Subfigure (b), in which the composite imageand foreground mask are taken as input to produce a realistic-looking image.

image unrealistic and thus significantly degrade its quality.These issues can be summarized as the inconsistency betweenforeground and background, which can be divided into ap-pearance inconsistency and geometry inconsistency [19]. Bothappearance inconsistency and geometry inconsistency involvea number of issues to be solved. Previous works [135, 3,164, 163] on image composition target at one or more issues.Since each individual issue could be very challenging, severaldivergent research directions are emerging, with each researchdirection focusing on only one issue (see Fig. 1(b)). Next, wewill introduce the appearance inconsistency and the geometryinconsistency separately.

The appearance inconsistency is including but not limited to:1) unnatural boundary between foreground and background;2) incompatible color and illumination statistics between fore-ground and background; 3) missing or implausible shadow andreflection of foreground. For the first issue, the foreground isusually extracted using segmentation [104] or matting [157]algorithms. However, the foregrounds may not be preciselydelineated, especially at the boundaries. When pasting the

arX

iv:2

106.

1449

0v1

[cs

.CV

] 2

8 Ju

n 20

21

Page 2: Making Images Real Again: A Comprehensive Survey on Deep

2

TABLE ITHE ISSUES TO BE SOLVED IN IMAGE COMPOSITION TASK AND THE CORRESPONDING DEEP LEARNING METHODS TO SOLVE EACH ISSUE. NOTE THATSOME METHODS ONLY FOCUS ON ONE ISSUE WHILE SOME METHODS ATTEMPT TO SOLVE MULTIPLE ISSUES SIMULTANEOUSLY. “BOUNDARY” MEANS

REFINING THE BOUNDARY BETWEEN FOREGROUND AND BACKGROUND. “APPEARANCE” MEANS ADJUSTING THE COLOR AND ILLUMINATION OFFOREGROUND. “SHADOW” MEANS GENERATING SHADOW FOR THE FOREGROUND. “GEOMETRY” MEANS SEEKING FOR REASONABLE LOCATION, SIZE,

AND SHAPE FOR THE FOREGROUND. “OCCLUSION” MEANS COPING WITH THE UNREASONABLE OCCLUSION.

boundary appearance shadow geometry occlusion methods+ [167]

+ [139, 31, 25, 27, 56, 129, 54, 88, 147]+ + [148, 170]

+ + [162]+ + [163]

+ [91, 172, 127, 58]+ [3, 38, 51, 73, 85, 135, 138, 164, 169]+ + [3, 164, 136]

+ + + [19]

foreground with jagged boundaries on the background, therewould be obvious color artifacts along the boundary. To solvethis issue, image blending [148, 170] aims to address theunnatural boundary between foreground and background, sothat the foreground could be seamlessly blended with thebackground. For the second issue, since the foreground andbackground may be captured in different conditions (e.g.,weather, season, time of the day, camera setting), the obtainedcomposite image could be inharmonious (e.g., foregroundcaptured in the daytime and background captured at night).To solve this issue, image harmonization [139, 31, 25] aimsto adjust the color and illumination statistics of foregroundto make it more compatible with the background, so that thewhole image looks more harmonious. For the third issue, whenpasting the foreground on the background, the foregroundmay also affect the background with shadow or reflection. Tosolve this issue, shadow or reflection generation [91, 172, 127]focuses on generating plausible shadow or reflection for theforeground according to both foreground and backgroundinformation. As far as we are concerned, there are few workson generating reflection for the inserted object probably dueto the limited application scenarios, so we focus on shadowgeneration in this paper.

The geometry inconsistency is including but not limited to:1) the foreground object is too large or too small; 2) the fore-ground object does not have supporting force (e.g., hangingin the air); 3) the foreground object appears in a semanticallyunreasonable place (e.g., boat on the land); 4) unreasonableocclusion; 5) inconsistent perspectives between foregroundand background. In summary, the location, size, and shape ofthe foreground may be irrational. Object placement and spatialtransformation [3, 38, 73, 138, 169] tend to seek for reasonablelocation, size, and shape for the foreground to avoid theabovementioned inconsistencies. Object placement [169, 138]mainly translates and resizes the foreground, while spatialtransformation [73, 85] involves more complicated transfor-mation (e.g., perspective transformation) of the foreground.In this paper, we use the term object placement to indicatearbitrary geometric transformation for ease of description.Most previous methods seek for reasonable placement to avoid

unreasonable occlusions, while some methods [3, 164, 136]aim to fix unreasonable occlusion by removing the occludedregions of foreground based on the estimated depth informa-tion.

After addressing the appearance inconsistency and geometryinconsistency between foreground and background, the com-posite image will look more realistic. Nevertheless, sometimesrealism is not enough and aesthetics is also highly desired. Forexample, when placing a vase on a table during image compo-sition, there could be numerous reasonable locations and sizes.However, with certain location and size, the composite imagemay be more visually appealing considering composition rulesand aesthetic criteria. In this case, we need to further evaluatethe aesthetic value of a composite image. There have beenmany works on aesthetic evaluation [134, 21, 92, 126], whichcan judge an image from multiple aesthetic aspects (e.g.,interesting lighting, harmony, vivid color, depth of field, ruleof thirds, symmetry). In this paper, we focus on composition-aware aesthetic evaluation (e.g., rule of thirds, symmetry),because we can adjust the location and size of foregroundto achieve higher composition quality when compositing animage. Note that the term composition in composition-awareaesthetic evaluation and image composition are different con-cepts, in which the former means the visual layout of an imagewhile the latter means the process of combining foregroundand background.

Image composition targets at combining visual elementsfrom different real images to produce a realistic image, whichis complementary with unconditional or conditional imagegeneration. For example, with the goal of inserting a catinto the complex background, image generation methods canhardly produce the image with the cat exactly the same asexpected, even if detailed conditional information is provided.Instead, we can paste the desired cat on the background andemploy image composition techniques to obtain a natural-looking image. Despite the attractive and promising applica-tions (e.g., augmented reality, artistic creation, e-commerce),as a double-edged sword, image composition may also beutilized to do evil (e.g., fake news, Internet rumors, insurancefraud, blackmail). Thus, it is in great need to develop tech-

Page 3: Making Images Real Again: A Comprehensive Survey on Deep

3

niques to fight against composite images. Existing image splic-ing or manipulation detection methods [74, 153, 124, 83] canbe used to detect the composite images. Compared with imagesplicing, image manipulation covers a wider range of imageforgery. Image manipulation detection method can identify thecomposite image and localize the inserted foreground. Besides,these methods [124, 74] can also be used as discriminatorsto enhance the ability of image composition methods, likethe generator and discriminator combating with each other ingenerative adversarial network (GAN) [53, 105].

In the following sections, we will first introduce differentresearch directions to solve different issues: object placementin Section II, image blending in Section III, image harmo-nization in Section IV, and shadow generation in Section V,which roughly aligns with the sequential order of creating acomposite image. As mentioned before, image compositionhas many issues to be solved. Some methods target at only oneissue while some methods target at multiple issues simultane-ously. For better illustration, we summarize all issues and thecorresponding methods to solve each issue in Table I. Then,we will introduce composition-aware aesthetic evaluation inSection VI and image manipulation detection in Section VII.Furthermore, we will point out some limitations and suggestsome potential directions in Section VIII. Finally, we willconclude the whole paper in Section IX. The contributionsof this paper can be summarized as follows,

• To the best of our knowledge, this is the first comprehen-sive survey on deep image composition.

• We summarize the issues to be solved in image composi-tion task as well as the corresponding research directionsand methods, giving rise to a large picture for deep imagecomposition.

• For completeness, we also discuss the higher demand ofimage composition (i.e., from realistic to aesthetic) andthe approaches to fight against image composition.

II. OBJECT PLACEMENT

Object placement aims to paste the foreground on thebackground with suitable location, size, and shape. As shownin Fig. 2, the cases of unreasonable object placement areincluding but not limited to: 1) the foreground object is toolarge or too small; 2) the foreground object does not havesupporting force (e.g., hanging in the air); 3) the foregroundobject appears in a semantically unreasonable place (e.g.,boat on the land); 4) unreasonable occlusion; 5) inconsistentperspectives between foreground and background. Consideringthe above unreasonable cases, object placement is still achallenging task.

A. Datasets and Evaluation Metrics

In previous works [138, 42], object placement is usedas data augmentation strategy to facilitate the downstreamtasks (e.g., object detection, segmentation). So there is nobenchmark dataset for the object placement task itself. Instead,prevalent segmentation datasets [86, 40, 28, 50] with annotatedsegmentation masks are used to create composite images.For these datasets, there are two common methods, i.e.,

splicing and copy-move, to generate new composite images.Splicing is cropping the foreground object from one image andpasting it on another background image, such as [51, 138].Copy-move is removing the foreground object with inpaintingtechniques [160, 93, 161] and pasting the foreground object atanother place in the same image, such as [135, 42]. For spatialtransformation which needs to adjust the shape of foreground,some typical applications are placing glasses/hats on humanfaces or inserting furnitures into the room [85]. Catering tothese applications, some specific datasets with segmentationmasks [94, 131, 18] are used.

To evaluate the quality of generated composite images, pre-vious object placement works usually adopt the following threeschemes: 1) Some works measure the similarity between realimages and composite images. For example, Tan et al. [135]score the correlation between the distributions of predictedboxes and ground-truth boxes. Zhang et al. [169] calculateFrechet Inception Distance (FID) [57] between compositeimages and real images. However, they cannot evaluate eachindividual composite image. 2) Some works [138, 42] utilizethe performance improvement of downstream tasks (e.g., ob-ject detection) to evaluate the quality of composite images,where the training sets of the downstream tasks are augmentedwith generated composite images. However, the evaluationcost is quite huge and the improvement in downstream tasksmay not reliably reflect the quality of composite images,because it has been revealed in [52] that randomly generatedunrealistic composite images could also boost the performanceof downstream tasks. 3) Another common evaluation strategyis user study, where people are asked to score the rationalityof object placement [78, 135]. User study complies withhuman perception and each composite image can be evaluatedindividually. However, due to the subjectivity of user study, thegauge in different papers may be dramatically different. Thereis no unified benchmark dataset and the results in differentpapers cannot be directly compared.

B. Traditional Methods

Some object placement methods leverage prior knowledgeof background geometry or appearance to find a reasonablelocation for the foreground object. For example, Remez et al.[120] paste the cropped objects on the background along thesame horizontal scanline, without changing the scale of theforeground object. Wang et al. [141] propose the instance-switching strategy to generate new training images throughswitching the instances of the same class by taking shape andscale information of instances into account, where the positionand location are determined by the original real image. Tobetter refine the position where the object is pasted, Fanget al. [42] apply appearance consistency heatmap to guide theplacement with surrounding visual context, where appearanceconsistency heatmap evaluates similarity on the RGB space.For scale and rotation, the corresponding parameters are sam-pled independently from the uniform distribution. In the abovemethods, location and position are calculated based on the im-ages and their associated annotations. These prior knowledgeis incomplete and sometimes inaccurate, which is far below the

Page 4: Making Images Real Again: A Comprehensive Survey on Deep

4

Fig. 2. Examples of unreasonable object placements. The inserted foreground objects are marked with red outlines. From left to right: (a) objects withinappropriate size; (b) objects hanging in the air; (c) objects appearing in the semantically unreasonable place; (d) unreasonable occlusion; (e) inconsistentperspectives.

requirement to handle the diverse and complicated challengesin object placement task.

C. Deep Learning Methods

Apart from the above methods which leverage prior knowl-edge to calculate and select the reasonable placement for theforeground object, some methods [162, 3, 164, 136] employdeep learning techniques to predict the placement and generatethe composite image automatically. One group of methodsrely on explicit rules to find a reasonable location for theforeground object, with the help of pretrained deep learningmodels. For example, given the background and its depth map,Georgakis et al. [51] combine the support surface detectionand semantic segmentation to produce regions for placing theobjects. Then, the size of the object is determined by usingthe depth of the selected position and the original scale of theobject.

Other methods train networks to automatically learn thereasonable object placement, which can be further dividedinto supervised and unsupervised learning methods, wheresupervised methods leverage the size/location of the fore-ground object in the original image as ground-truth while un-supervised methods do not use the ground-truth size/location.For supervised learning methods [135, 38, 173, 169, 78],one common solution is to predict the bounding box ortransformation of the foreground object based on the fore-ground and background features. Dvornik et al. [38] learna model to estimate the likelihood of a particular categoryof object to be present inside a box given its neighborhoodand then automatically find suitable locations on images toplace new objects. Tan et al. [135] predict the location andsize of each potential person instance, where the boundingbox prediction is formulated as a joint classification problemby discretizing the spatial and size domains of the boundingbox space. In addition, through adversarial learning, Zhanget al. [169] combine the encoded object and background witha random variable to predict object placement, the plausibilityof which is then checked by a discriminator conditioned onthe foreground and the background. Besides, they preservethe diversity of object placement by the pairwise distancebetween predicted placements and the corresponding randomvariables. Given an input semantic map, Lee et al. [78] proposea network consisting of two generative modules, in which onedetermines where the inserted object mask should be and theother determines what the object mask shape should look like.

Besides, the two modules are connected together via a spatialtransformation network for joint training.

For unsupervised learning, Tripathi et al. [138] proposea 3-way competition among the synthesizer, target, and realimage discriminator networks. Given the background and fore-ground object, the synthesizer learns the affine transformationsto produce the composite image. Then, with the compositeimages, the target network aims to correctly classify/detectall foreground objects in the composite images. Furthermore,considering the interactions between objects, Azadi et al.[3] train Compositional GAN by enforcing a self-consistencyconstraint [179] with decomposition as supervisory signal.At test time, they fine-tune the weights of the compositionnetwork on each given test example by the self-supervisionprovided from the decomposition network. ST-GAN [85] isproposed to iteratively warp a foreground instance into abackground scene with spatial transformer network [67] underthe GAN framework, aiming to push composite images closerto the natural image manifold. Kikuchi et al. [73] attempt toreplace the iterative transformation procedure in [85] with one-shot transformation.

Finally, we briefly discuss the occlusion issue. Most ofthe above methods seek for reasonable placement to avoidthe occurrence of unreasonable occlusions, i.e., there is nounreasonable occlusion surrounding the inserted foreground.In contrast, some methods [3, 164, 136] intend to addressthe unreasonable occlusion when it occurs. Specifically, theyconsider removing the occluded regions of foreground basedon the estimated depth information.

III. IMAGE BLENDING

During image composition, the foreground is usually ex-tracted using segmentation [104] or matting [157] methods.However, the segmentation or matting results may be noisyand the foregrounds are not precisely delineated. When theforeground with jagged boundaries is pasted on the back-ground, there will be abrupt intensity change between theforeground and background. To refine the boundary and reducethe fuzziness, image blending techniques have been developed.

A. Datasets and Evaluation Metrics

To the best of our knowledge, there are only few deeplearning methods on image blending and there is no unifiedbenchmark dataset. The existing works use existing segmenta-tion datasets with annotated segmentation masks or manually

Page 5: Making Images Real Again: A Comprehensive Survey on Deep

5

crop regions from images to create composite images withunnatural boundary.

The existing deep image blending works [148, 170] adoptthe following two evaluation metrics: 1) calculating realismscore using the pretrained model [178] which reflects therealism of a composite image; 2) conducting user study byasking engaged users to select the most realistic images.

B. Traditional Methods

One simple way to make the transition smooth is alphablending [115], which assigns alpha values for boundary pixelsindicating what fraction of the colors are from foregroundor background. Although alpha blending can smoothen thetransition over boundary to some extent, it may blur the finedetails or produce ghost effect. Besides, Laplacian pyramidblending is proposed in [16] to leverage the informationfrom different scales. The most popular blending techniqueis enforcing gradient domain smoothness [44, 72, 79, 133],in which the earliest work is Poisson image blending [113].Poisson image blending [113] enforces the gradient domainconsistency with respect to the source image containing theforeground, where the gradient of inserted foreground iscomputed and propagated from the boundary pixels in thebackground. The above methods based on gradient domainsmoothness perform more favorably than alpha blending, butthe colors of background often seep through the foregroundtoo much and cause a significant loss of foreground content.

C. Deep Learning Methods

As mentioned above, there are only few deep learningmethods [148, 170, 167] proposed for image blending. Amongthem, the works [148, 170] not only enable smooth transitionover the boundary, but also reduce the color and illuminationdiscrepancy between foreground and background, in which thelatter one is the goal of image harmonization in Section IV.In this section, we only introduce the way they enable smoothtransition over the boundary. The works [148, 170] are bothinspired by [113]. Specifically, they add the gradient domainconstraint to the objective function according to Poissonequation, which can produce a smooth blending boundary withgradient domain consistency. Different from [148, 170], thework done by Zhang et al. [167] is inspired by Laplacian pyra-mid blending [16]. In [167], two separate encoders are used toextract multi-scale features from foreground and backgroundrespectively, which are fused to produce a seamlessly blendedimage.

IV. IMAGE HARMONIZATION

Given a composite image, its foreground and backgroundare likely to be captured in different conditions (e.g., weather,season, time of day, camera setting), and thus have distinctivecolor and illumination characteristics, which make them lookincompatible. Image harmonization aims to adjust the appear-ance of the composite foreground according to the compositebackground to make it compatible with the composite back-ground.

A. Datasets and Evaluation Metrics

A large amount of composite images can be easily ob-tained by pasting the foreground from one image on anotherbackground image [158, 139], but it is extremely difficult toobtain the ground-truth harmonized image for the compositeimage. Manually adjusting the foreground according to thebackground to obtain the ground-truth is time-consuming,labor-intensive, and unreliable. However, for the supervisionof harmonization model and faithful evaluation of harmoniza-tion results, we require adequate pairs of composite imagesand ground-truth harmonized images. In existing works, theconstructed datasets containing such pairs can be divided intothree groups: real-world dataset, synthetic real dataset, andrendered dataset.

Real-world dataset: The most natural way is collecting a setof images for the same scene in different capture conditions.Then, we can exchange the foregrounds within the same groupto create pairs of composite images and ground-truth images.For example, Transient Attributes Database [76] contains 101sets, in which each set has well-aligned images for the samescene captured in different conditions (e.g., weather, time ofday, season). This dataset has been used to construct pairs in[148, 25]. However, collecting the dataset like [76] requirescapturing the same scene with a fixed camera for a long time,which is hard to be realized in practice. One easy alternative isadjusting the artificial lighting condition in a well-controlledenvironment, but such lighting condition is different fromnatural lighting condition and the diversity is limited. To obtainforegrounds in different capture conditions, an interesting wayis proposed in [130]. Specifically, they place the same physicalmodel (foreground) in different lighting conditions to capturedifferent images and align the foregrounds in different images.Nevertheless, the collection cost is still very high and thediversity of foreground is restricted.

Synthetic real dataset: Instead of adjusting the foregroundof composite image to get harmonized image, the inverseapproach is adjusting the foreground of real image to getsynthetic composite image. This approach has been adopted in[139, 31, 25] to construct harmonization dataset. They treat areal image as harmonized image, segment a foreground region,and adjust this foreground region to be inconsistent with thebackground, yielding a synthesized composite image. Then,pairs of synthesized composite image and real image can beused to supersede pairs of composite image and harmonizedimage. In [25], they adopt aforementioned strategy and alsodraw inspiration from real-world dataset [76], releasing thefirst large-scale image harmonization dataset iHarmony4 with73146 pairs of synthesized composite images and ground-truth real images. iHarmony4 consists of four sub-datasets:HCOCO, HAdobe5k, HFlickr, and Hday2night. 1) HCOCOsub-dataset is synthesized based on Microsoft COCO [86].In HCOCO, the composite images are synthesized from realimages by adjusting the segmented foregrounds using colortransfer methods [119, 155, 45, 114]. 2) HAdobe5k sub-datasetis generated based on MIT-Adobe FiveK dataset [17]. Thecomposite image is generated by exchanging the segmentedforeground between the real image and its five different

Page 6: Making Images Real Again: A Comprehensive Survey on Deep

6

Fig. 3. Some examples from iHarmony4 dataset [25] in the first row and from rendered image harmonization dataset [26] in the second row. In each example,from left to right are the composite image, the foreground mask, and the ground-truth harmonized image.

renditions. 3) HFlickr sub-dataset is synthesized based on thecrawled images from Flickr website. The composite images aresynthesized similarly to HCOCO. 4) Hday2night sub-datasetis generated based on Transient Attributes Database [76]. Thecomposite images are synthesized similarly to HAdobe5k,where the foreground is exchanged between images capturedin different conditions. Some examples in iHarmony4 are listedin the first row in Fig. 3.

Rendered dataset: Recall that the datasets like [76] containmultiple images of the same scene captured in differentconditions and the composite images could be generated bysimply exchanging the foregrounds. However, due to thehigh collection cost, such datasets are usually very small.To simulate this process virtually, Cong et al. [26] exploreto change the lighting condition of the same scene using3D rendering software Unity3D. Specifically, they import 3Dscenes and 3D models into Unity3D, and shoot 2D scenes. Toobtain multiple images of the same scene captured in differentconditions, they use Unistorm plug-in to control the captureconditions, leading to a set of images for each 2D scene. Inthis way, the composite rendered images could be obtainedby exchanging foregrounds between two images within thesame set. Some examples in the rendered image harmonizationdataset are listed in the second row in Fig. 3.

For quantitative evaluation, existing works adopt met-rics including Mean-Squared Errors (MSE), foreground MSE(fMSE), Peak Signal-to-Noise Ratio (PSNR), Structural SIM-ilarity index (SSIM), foreground SSIM (fSSIM), LearnedPerceptual Image Patch Similarity (LPIPS), foreground L1norm (fL1) to calculate the distance between harmonized resultand ground-truth. For qualitative evaluation, they conduct userstudy on real composite images by asking engaged users toselect the most realistic images and calculate the metric (e.g.,B-T score [15]).

B. Traditional Methods

Existing traditional image harmonization methods [132,158, 119, 77, 130] concentrate on better matching low-levelappearance statistics and thus all belong to color statisticsbased algorithms, which could be further divided into non-linear transformations [132, 158] (e.g., histogram matching)and linear transformations [119, 77, 178] (e.g., color shiftingand scaling).

For nonlinear transformation, Sunkavalli et al. [132] de-compose the foreground and background image into pyramidappearance-related subbands and learn harmonization pyramidsubband coefficients to fuse the subbands into a harmoniousoutput using smooth histogram matching technique. In [158],they find that matching zones of histogram is more effectivethan matching the whole histogram and utilize a simpleclassifier to predict the best zone to match for illumination,color temperature, and saturation.

For linear transformation, Reinhard et al. [119] pioneerin simple yet efficient linear transformation by adjusting themean and the standard deviation of Lαβ channels to match theglobal color distribution of two images. In [77], they representboth foreground and background by clusters and learn thecolor shift vectors for each cluster. In [178], they define asimple brightness and contrast model and learn the colorshifting matrix for three channels, and optimize the modelusing a discriminative model for realism assessment.

C. Deep Learning Methods

Existing deep image harmonization methods [139, 25, 31,56, 88, 96] could be divided into supervised methods andunsupervised methods depending on whether using pairedtraining data.

For supervised methods, Tsai et al. [139] propose an end-to-end CNN network for image harmonization and leverageauxiliary scene parsing branch to improve the basic imageharmonization network. Cun and Pun [31] design an addi-tional Spatial-Separated Attention Module to learn regionalappearance changes in low-level features for harmonization.Cong et al. [25] introduce the concept domain and propose adomain verification discriminator to pull close the foregrounddomain and background domain. Similarly, Cong et al. [27]formulate image harmonization as background-guided domaintranslation and utilize the background domain code extractedby the domain code extractor to directly guide the harmo-nization process. One byproduct of [27] is predicting theinharmony level of an image by comparing the domain codesof foreground and background, so that we can selectively har-monize those apparently inharmonious composite images. Haoet al. [56] present to align the standard deviation of the fore-ground features and background features using an attention-based module. In [129], they incorporate high-level auxiliary

Page 7: Making Images Real Again: A Comprehensive Survey on Deep

7

semantic features into the encoder-decoder architectures forimage harmonization. Guo et al. [54] realizes that inharmonyderives from intrinsic reflectance and illumination differenceof foreground and background, and propose an autoencoder todisentangle composite image into reflectance and illuminationfor separate harmonization, where reflectance is harmonizedthrough material-consistency penalty and illumination is har-monized by learning and transferring light from backgroundto foreground. In [88], they reframe image harmonizationas a background-to-foreground style transfer problem andpropose region-aware adaptive instance normalization moduleto explicitly formulate the visual style from the backgroundand adaptively apply them to the foreground.

For unsupervised methods, Chen and Kae [19] proposeGeometrically and Color Consistent GAN (FCC-GAN) forrealistic image composition. GCC-GAN considers geometric,color, and boundary consistency at the same time, makingthe composite images indistinguishable from real images. In[10], they propose to render convincing composite images byadjusting the cut-and-paste images so that some simple imageinferences are consistent with cut-and-paste predictions, i.e.,the albedo from the adjusted image should look like cut-and-paste albedo and the shading should look like a shading field.

Finally, we briefly discuss the relation and difference be-tween image relighting [112, 108, 128, 145] and image har-monization. Image relighting (e.g., face relighting, portrait re-lighting) can also adjust the foreground appearance accordingto the background. However, they usually achieve this goal byinferring explicit illumination information and 3D geometricinformation, in which the supervision for illumination andgeometry is difficult and expensive to acquire. Besides, theygenerally have strong assumption for the light source, whichmay not generalize well to complicated real-world scenes.

V. OBJECT SHADOW GENERATION

Shadow generation targets at dealing with the shadow in-consistency between foreground and background in compositeimage, that is, generating plausible shadow for the foregroundobject according to background illumination information tomake the composite image more realistic.

A. Datasets and Evaluation Metrics

Similar to image harmonization in Section IV, compositeimages without foreground shadows can be easily obtained.Nonetheless, it is extremely difficult to obtain paired data, i.e.,a composite image without foreground shadow and a ground-truth image with foreground shadow, which are required bymost deep learning methods on shadow generation [172, 91,127]. Therefore, these works [172, 91, 127] construct rendereddataset with paired data by inserting a virtual object into 3Dscene and generating shadow for this object with renderingtechnique.

ARShadowGAN [91] releases a rendered dataset namedShadow-AR by inserting a foreground object into real back-ground image and generating its corresponding shadow withrendering technique. Shadow-AR [91] contains 3, 000 quintu-ples and each quintuple consists of a synthetic image without

the virtual object shadow, its corresponding image containingthe virtual object shadow, a mask of the virtual object, alabeled real-world shadow matting, and its correspondinglabeled occluder. Some examples in Shadow-AR dataset areexhibited in the first row in Fig. 4. Similar to ARShadow-GAN [91], ShadowGAN [172] also adopt rendering techniqueto construct a rendered dataset, in which 9, 265 objects from asubset of commonly seen categories from ShapeNet [18] wereselected as foreground objects, and 110 textures are chosenfrom Internet as rendering plane. In SSN [127], 43 humancharacters sampled from Daz3D, 59 general objects such asairplanes, bags, bottles, and cars from ShapeNet [18] andModelNet [154] are chosen as foreground objects, and 15 softshadows are generated for each object. One common drawbackof the above datasets is that the number of foreground objectsis limited and the background scenes are very simple, which isdramatically different from the real-world complicated scenes.To overcome this drawback, Hong et al. [58] explore toconstruct paired data by manually removing the foregroundshadows from real shadowed images in SOBA dataset [142],leading to DESOBA dataset with several examples shown inthe second row in Fig. 4. This idea of creating syntheticcomposite image is similar to constructing synthetic realdataset for image harmonization (see Section IV).

To evaluate the quality of generated composite images withforeground shadows, existing shadow generation works [172,91, 163, 127] adopt Frechet Inception Distance (FID) [57],Manipulation Score (MS) [19], Structural SIMilarity index(SSIM) [125], Root Mean Square Error (RMSE), RMSE-s [6],Balanced Error Rate (BER) [110], accuracy, Zero-NormalizedCross-Correlation (ZNCC) for quantitative evaluation. SSIMand RMSE are adopted to measure the distance between gen-erated shadow images and ground-truth images, both of whichare used in [91, 127] with paired data. Besides, SSN [127]leverages scale-invariant zero-normalized ZNCC and RMSE-s to evaluate the quality of generated shadow images, whileaccuracy and BER are adopted in ARShadowGAN [91]. In[163], considering real shadow dataset without paired ground-truth image, FID and MS are utilized to measure the realismof generated shadow images. For qualitative evaluation, userstudy is used as a common evaluation metric by all of theabovementioned methods, which is calculated from the scoresof human raters and able to reflect human perception.

B. Traditional Methods

In [71], an illumination inference algorithm is proposed torecover a full lighting model of the scene from a single image,which can perform illumination editing for inserted foregroundobject with user interaction in the loop. In [89], without userinteraction, structure-from-motion and multi-view stereo areadopted to recover initial scene and estimate an environmentmap with spherical harmonics, which can generate shadowfor inserted object. In [2], based on low-cost sensors, alow-cost approach is proposed to detect the light source toilluminate virtual objects according to the real illumination ofthe ambient.

Page 8: Making Images Real Again: A Comprehensive Survey on Deep

8

Fig. 4. Some examples from Shadow-AR dataset [91] in the first row and from DESOBA dataset [58] in the second row. In each example, from left to rightare the composite image without foreground shadow, the foreground mask, and the ground-truth image with foreground shadow.

C. Deep Learning Methods

Foreground shadow generation can be treated as an image-to-image translation task, so existing generative modelslike Pix2Pix [66] can be extended for shadow generationby translating input composite image without foregroundshadow to the target image with foreground shadow. Mask-ShadowGAN [62] conducts shadow removal and mask-guidedshadow generation with unpaired data at the same time, so itcan be directly extended to generate foreground shadow usingmask-guided shadow generation module. ShadowGAN [172]constructs a rendered dataset and proposes a new methodwhich combines a global conditional discriminator and a localconditional discriminator to generate shadow for inserted 3Dforeground objects. In [163], a car-based real shadow datasetand a pedestrian-based dataset are constructed to evaluate theeffectiveness of proposed adversarial image composition net-work, in which shadow generation and foreground style trans-fer are accomplished simultaneously. However, the method[163] calls for extra indoor illumination dataset [48, 22].ARShadowGAN [91] releases a rendered dataset Shadow-AR, and proposes an attention-guided residual network. Thisnetwork directly models the mapping from real-world envi-ronment to the virtual object shadow to produce a coarseshadow image, which is sent to a refinement module for furtherimprovement. SSN [127] designs an interactive soft shadownetwork to generate soft shadow for foreground object withuser control, based on the light configurations captured fromimage-based environment light map. The above methods cangenerate reasonable shadows for foreground objects in simplerendered images, but often fail to generate reasonable shadowsfor the real-world images in complicated scenes.

There are also some rendering based methods which cangenerate shadows for inserted objects. In particular, theyrequire explicit knowledge of illumination, reflectance, ma-terial properties, and scene geometry to generate shadow forinserted virtual object using rendering techniques. However,such knowledge is usually unavailable in the real-world ap-plications. Some methods [84, 49, 168, 2] endeavor to inferexplicit illumination condition and scene geometry based on

a single image, but this estimation task is quite tough andinaccurate estimation may lead to terrible results [172].

VI. COMPOSITION-AWARE AESTHETIC EVALUATION

After addressing the appearance inconsistency (see Sec-tion III, IV, V) and geometry inconsistency (see Section II)between foreground and background, we can obtain high-fidelity and natural-looking composite images. Nevertheless,sometimes realism is not enough and aesthetics is in highdemand. In this case, we need to further evaluate the aestheticvalue of a composite image. Image aesthetic evaluation aimsto quantify the aesthetic quality of images automatically. Tosimulate the aesthetic perception of human, there are manyworth-considering factors (e.g., interesting lighting, harmony,vivid color, depth of field, rule of thirds, symmetry). Amongthese factors, harmony is closely related to image harmo-nization in Section IV, so image harmonization can alsobe deemed as improving the aesthetic quality of compos-ite images. The composition-relevant factors (e.g., rule ofthirds, symmetry) characterize the placement or arrangementof visual elements in an image, which is closely related toobject placement in Section II. In this paper, we focus oncomposition-aware aesthetic evaluation. Note that the termcomposition in composition-aware aesthetic evaluation andimage composition are different concepts, in which the formermeans the visual layout of an image while the latter means theprocess of combining foreground and background.

A. Datasets and Evaluation Metrics

In recent years, many large-scale aesthetic evaluationdatasets [69, 32, 106, 106, 75] have been contributed tothe research community for more standardized evaluation ofmodel performance. To the best of our knowledge, Photo.Net[69] and DPChallenge [32] are the earliest publicly avail-able datasets specifically built for this task. Aesthetic VisualAnalysis database (AVA) [106] is the largest and most chal-lenging public aesthetic assessment dataset so far. In thesedatasets, each image is rated by multiple professional ratersand associated with an aesthetic score (e.g., scale 1 to 7)

Page 9: Making Images Real Again: A Comprehensive Survey on Deep

9

or binary aesthetic label (i.e., high-quality and low-quality).Some datasets [106] also provide attribute annotations relatedto aesthetic evaluation (e.g., long exposure, motion blur, andrule of thirds). Besides these standard aesthetic evaluationdatasets, most recently, many aesthetic caption datasets havebeen contributed to describe the general aesthetic quality ofan image through text, like Photo Critique Captioning Dataset(PCCD) [107], AVA-Comments [176], AVA-Reviews [144],FLICKER-AES [121], and DPC-Captions [68].

Since the image aesthetic evaluation is generally consideredas a classification or regression task, it is natural to adoptclassification accuracy or Mean Squared Error (MSE) as theevaluation metrics. The classification accuracy is computed be-tween the predicted and ground-truth binary labels of aestheticquality, and MSE measures the error between the predictedand ground-truth aesthetic mean scores (mean opinion scores).Then, to measure the ranking correlation and linear associa-tion between the estimated and ground-truth aesthetic scores,Spearman’s Rank Correlation Coefficient (SRCC) and LinearCorrelation Coefficient (LCC) are used in recent aestheticevaluation approaches [75, 134, 21, 126]. Additionally, forthese methods [134, 21, 59, 126] that are capable of producingthe prediction of aesthetic score distribution, Earth Mover’sDistance (EMD) [59] is also frequently utilized to measure thecloseness between the predicted and ground-truth compositionscore distributions.

B. Traditional Methods

Traditional aesthetic evaluation methods mainly exploithandcrafted feature to encode the aesthetics-relevant infor-mation. Equipped with machine learning algorithms, theseconventional methods can accomplish aesthetic quality clas-sification or regression. Among these traditional methods,there are some composition-aware methods that manuallydesign feature extractors to model composition rules (e.g.,symmetry, rule of thirds, vanishing point and golden triangle)[137, 111, 177] or other composition-relevant factors (e.g.,global image layout and salient regions of the images) tointegrate image composition into aesthetic quality prediction.According to whether the used feature is strongly tied withcomposition rules, existing composition-aware methods canbe roughly divided into two groups: rule-based methods andrule-agnostic methods.

The rule-based methods [111, 177, 35, 99] generally lever-age the features that are capable of indicating whether theinput image conforms to specific composition rules properly.For example, Luo and Tang [99] design composition geometryfeature to model the rule of thirds. Zhou et al. [177] pro-pose to detect vanishing point in the images. Thommes andHubner [137] design feature to measure the visual balance.Unlike the above methods modeling single composition rule,the approaches in [111, 35] can estimate multiple types ofcomposition rules. Obrador et al. [111] propose to approximatemultiple photography composition guidelines such as the ruleof thirds, golden mean, and golden triangle rule.

The rule-agnostic methods utilize handcrafted features tocharacterize the image layout [123, 9, 95] and salient objects

Fig. 5. The game playing between image composition and image manipulationdetection. Image manipulation detection attempts to localize the insertedforeground while image composition strives to make the whole image realistic.

in images [149], instead of composition rules. To name a few,the methods in [98, 97] fuse the features of global image layoutand salient regions based on image content. Zhang et al. [171]propose a graphlet-based method learning the image descrip-tors that encode both local and global structural aesthetics frommultiple visual channels. Rawat and Kankanhalli [118] extractcomposition features formed by the landmark information.

C. Deep Learning Methods

Although abundant deep learning methods [75, 134, 21, 34,101, 92, 126] have achieved significant progress on imageaesthetic evaluation, only a few of them [101, 92, 126] canbe viewed as composition-aware methods which explicitlymodel image composition. These composition-aware methodsfocus on learning the relations among the visual elements(regions or objects) in the image. Furthermore, these methodstypically adopt graph-based architecture for capture the layoutinformation, due to the fact that graph-based architecture isconceivably suited for modeling the relations among the visualelements with irregular distribution of spatial locations. Forexample, Ma et al. [101] construct an attribute graph over thesalient objects to represent the spatial layout of the holisticimage. Liu et al. [92] propose to build graph over the localregions and learn the mutual dependencies of all regions. Sheet al. [126] take the extracted feature maps as representationsof nodes throughout all spatial locations and construct anaesthetic graph to capture the layout information. Zhanget al. [166] design saliency-augmented multi-pattern poolingto extract more expressive features from the perspectives ofmultiple patterns.

VII. IMAGE MANIPULATION DETECTION

By virtue of image composition techniques, we can generaterealistic composite images which are indistinguishable fromreal images. However, as a double-edged sword, image compo-sition techniques may also be maliciously used to create imageforgeries by attackers and affect our lives in a negative way.Therefore, it is in high demand to develop image manipulationdetection methods. Image composition and image manipula-tion detection can combat with each other (see Fig. 5), like thegenerator and discriminator in GAN [53, 105]. Concerning thedetection granularity, image manipulation detection could bedivided into two groups: classification task [55, 117, 122, 36]and localization task [29, 124, 150, 65, 152, 180, 153, 74,5, 63]. Furthermore, image manipulation methods could be

Page 10: Making Images Real Again: A Comprehensive Survey on Deep

10

categorized by tampering techniques, such as copy-move [29,146, 117, 151, 152], image splicing [30, 150, 65], imageinpainting [180], image enhancement [7, 8, 24], and so on.The manipulation type most related to image composition isimage splicing, which will be the focus in this paper.

A. Datasets and Evaluation Metrics

Image splicing is a long-standing research topic. In the earlypioneering works, Columbia gray [109], Columbia color [60],Carvalho [33] are commonly used datasets, which consist ofauthentic images and spliced images. The rapid developmentof deep learning techniques inspire people to rethink the neuralnetwork capability on image splicing detection. Therefore,to meet the increasing demand of public dataset for imageforensics, more splicing datasets are constructed and released,in which CASIA dataset [37] is the widely used dataset inrecent studies.

To evaluate the detection performance on image splicingdatasets, recent works usually adopt the following metrics:classification accuracy [122, 117, 41, 80, 103], Area under theROC Curve (AUC) [122, 150, 175, 153], F-Score [122, 124,150, 65, 175], Matthews Correlation Coefficient (MCC) [124,65], mean Average Precision (mAP), and per-class Intersectionover Union (cIOU) [65]. Among them, the commonly usedmetrics for classification task are accuracy and AUC whilethe commonly used metrics for localization task are F-Scoreand AUC.

B. Traditional Methods

Traditional image manipulation detection methods highlyrely on prior knowledge or assumptions on the datasets.Depending on the characteristics of the forgery, the tamperingclues could be divided into three groups: noise patterns [102,100, 30, 20, 23, 81, 90, 116, 70], Color Filter Array (CFA)interpolation patterns [36, 46], and JPEG-related compressionartifacts [87, 11, 12, 43, 82, 1, 13, 61, 143].

The first group of methods assume that the noise patternsvary from image to image, because different camera deviceshave different capture parameters and various post-processingtechniques make the noise patterns more unique in each image.In a composite image, the noise patterns of foreground andbackground are usually different, which could be used toidentify the spliced foreground [100, 30, 90].

The second group of methods are built on the Camera FilterArray (CFA) consistency hypothesis. Most images are derivedfrom a single image sensor of digital cameras. These imagesensors are overlaid with a CFA that produces one value andthese values are arranged in a predefined format. Combiningtwo different images can disrupt the CFA interpolation patternsin multiple ways [36, 46].

The third group of methods exploit the JPEG compressionconsistencies. Most of these methods [13, 143] utilize JPEGquantization artifacts and JPEG compression grid continuitiesto determine the spliced area.

C. Deep Learning MethodsSince traditional methods are limited to handcrafted features

or prior knowledge, recent works begin to leverage powerfuldeep learning techniques [5, 8, 159, 74, 83] to tackle imagesplicing based on the inconsistency from many aspects such asperspective, geometry, noise pattern, frequency characteristics,color, lighting, etc. Among them, Liang et al. [83] focus on lo-calizing the inharmonious region which has incompatible colorand illumination statistics compared with the other regions.These methods could be roughly divided into three groups:local patches comparison [7, 122, 117, 65, 4, 5], forgeryfeature extraction [103, 175, 8, 159, 153], and adversariallearning [124, 74].

In the first group, the methods in [7, 122, 117] compare dif-ferent regions in an image to spot the suspicious regions. Huhet al. [65] assume the EXIF consistency and reveal the splicingarea by comparing the EXIF attributes of image patches.J-LSTM [4] is an LSTM-based patch comparison methodwhich finds tempered regions by detecting edges between thetempered patches and authentic patches. H-LSTM [5] noticesthat manipulations will distort the natural statistics at theboundary of tampered regions.

In the second group, forgery features are derived andincluded in the network. Bayar and Stamm [8] devise a con-strained convolution layer capable of jointly suppressing imagecontent and adaptively learning prediction error features. Zhouet al. [175] use the Steganalysis Rich Model (SRM) [47]as kernel to produce the noise features. Yang et al. [159]use constrained convolution kernel [8] to predict pixel-leveltampered probability map. Wu et al. [153] propose ManTra-Net which exploits both SRM kernel [47] and constrainedconvolution kernel [8] to extract noise features. Wu et al. [153]also design Z-Score to capture local anomaly from the noisefeatures.

In the third group, adversarial learning [74] is adoptedin the image splicing detection task. The work [74] is thefirst to introduce adversarial learning to image manipulationdetection. They design a retoucher and an annotator, in whichthe retoucher accounts for generating plausible compositeimages and the annotator accounts for spotting the splicedregion. The retoucher and the annotator are trained in anadversarial manner. Similar idea has also been explored in[64], in which the generator performs image harmonizationwhile the discriminator strives to localize the inharmoniousregion.

VIII. LIMITATIONS AND FUTURE DIRECTIONS

The past few years have witnessed great progress achievedin the realm of image composition. However, there existseveral limitations which severely hinder the development ofimage composition techniques:

1) Most existing image composition methods focus on asingle unoccluded foreground and only a few works [164, 169]discuss multiple foregrounds. However, in real-world applica-tions, we are likely to paste multiple foregrounds on the back-ground and the foregrounds may be occluded. Therefore, moreefforts are needed for adapting image composition methods toreal-world applications.

Page 11: Making Images Real Again: A Comprehensive Survey on Deep

11

2) There are many issues to be solved to obtain a realisticcomposite image. As shown in Table I, existing methodsattempt to solve one issue or multiple issues simultaneously.Since each issue could be very complicated and hard to solve,different research directions (e.g., object placement, image har-monization) diverge from image composition to solve differentissues. There is no unified system to address all the issues inimage composition, which increases the difficulty in deployingimage composition model in real-world applications.

3) For image composition task, the ground-truth imagesare very difficult to acquire. Thus, for some research di-rections (e.g., object placement, image blending), there isno benchmark dataset and the subjective evaluation (e.g.,user study) is not very convincing, which makes it hard tofairly compare different methods. Therefore, it is imperativeto set up benchmark datasets with ground-truth images formore faithful evaluation. For example, after the first large-scale image harmonization dataset iHarmony4 [25] is releasedrecently, some follow-up works [56, 129, 27, 54, 88, 10]have sprouted out, which not only promote the harmonizationperformance but also offer different perspectives for this task.

At last, we suggest some potential research directions forimage composition:

1) As mentioned above, in real-world applications, weare likely to paste multiple foregrounds on the backgroundand the foregrounds may be occluded. To handle the oc-cluded foregrounds, we may need to complete the occludedforegrounds beforehand using image outpainting or objectcompletion techniques [140, 174, 14, 165, 156, 39]. To handlemultiple foregrounds, we need to consider their relative depthand complicated interaction to composite a realistic image.

2) Most existing image composition methods are from im-age to image, i.e., 2D → 2D. However, one intuitive solutionto image composition is to infer the complete 3D geometricinformation of foreground/background as well as the illumi-nation information. Ideally, with the complete geometric andillumination information, we can harmonize the foregroundand generate foreground shadow accurately. However, it is stillvery challenging to reconstruct the 3D world and estimate theillumination condition based on a single 2D image. During thisprocedure, the induced noisy or redundant information mayadversely affect the final performance. Nevertheless, 2D →3D → 2D is an intriguing and promising research direction.We can explore the middle ground between 2D → 2D and2D → 3D → 2D to integrate the advantages of both.

3) For real composite images, their ground-truth images arevery difficult to acquire. For rendered composite images, theirground-truth images are relatively easy to obtain. Specifically,we can create pairs of rendered composite images and renderedground-truth images by manipulating the foreground, back-ground, camera setting, and lighting condition in 3D renderingsoftwares, which may alleviate the problem of lacking ground-truth images.

IX. CONCLUSION

In this paper, we have conducted a comprehensive surveyon image composition. Although the image composition task

sounds simple, it actually requires a variety of techniques toproduce a realistic and aesthetic composite image. We havedivided the research works on image composition into objectplacement, image blending, image harmonization, shadowgeneration, based on which issues they aim to solve. We havediscussed how to make the composite image not only realisticbut also aesthetic. We have also introduced the approachesto fight against image composition, during which compositeimage generation and composite image detection could combatwith each other. In the future, we will extend the surveyto more general composition in other fields including videocomposition, 3D composition, text composition, audio com-position, and so on.

REFERENCES

[1] Irene Amerini, Rudy Becarelli, Roberto Caldelli, andAndrea Del Mastio. Splicing forgeries localizationthrough the use of first digit features. In IEEE Interna-tional Workshop on Information Forensics and Security,2014.

[2] Ibrahim Arief, Simon McCallum, and Jon Yngve Hard-eberg. Realtime estimation of illumination direction foraugmented reality on mobile devices. In Color andImaging Conference, 2012.

[3] Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, andTrevor Darrell. Compositional GAN: Learning image-conditional binary composition. International Journalof Computer Vision, 128(10):2570–2585, 2020.

[4] Jawadul H Bappy, Amit K Roy-Chowdhury, JasonBunk, Lakshmanan Nataraj, and BS Manjunath. Ex-ploiting spatial structure for localizing manipulated im-age regions. In ICCV, 2017.

[5] Jawadul H Bappy, Cody Simons, Lakshmanan Nataraj,BS Manjunath, and Amit K Roy-Chowdhury. Hybridlstm and encoder–decoder architecture for detection ofimage forgeries. IEEE Transactions on Image Process-ing, 28(7):3286–3300, 2019.

[6] Jonathan T Barron and Jitendra Malik. Shape, illumina-tion, and reflectance from shading. IEEE Transactionson Pattern Analysis and Machine Intelligence, 37(8):1670–1687, 2014.

[7] Belhassen Bayar and Matthew C Stamm. A deep learn-ing approach to universal image manipulation detectionusing a new convolutional layer. In ACM Workshop onInformation Hiding and Multimedia Security, 2016.

[8] Belhassen Bayar and Matthew C Stamm. Constrainedconvolutional neural networks: A new approach towardsgeneral purpose image manipulation detection. IEEETransactions on Information Forensics and Security, 13(11):2691–2706, 2018.

[9] Subhabrata Bhattacharya, Rahul Sukthankar, andMubarak Shah. A framework for photo-quality assess-ment and enhancement based on visual aesthetics. InACM MM, 2010.

[10] Anand Bhattad and David A. Forsyth. Cut-and-pasteneural rendering. arXiv preprint arXiv:2010.05907,2020.

Page 12: Making Images Real Again: A Comprehensive Survey on Deep

12

[11] Tiziano Bianchi and Alessandro Piva. Detection ofnonaligned double JPEG compression based on integerperiodicity maps. IEEE Transactions on InformationForensics and Security, 7(2):842–848, 2011.

[12] Tiziano Bianchi and Alessandro Piva. Image forgerylocalization via block-grained analysis of JPEG arti-facts. IEEE Transactions on Information Forensics andSecurity, 7(3):1003–1017, 2012.

[13] Tiziano Bianchi, Alessia De Rosa, and AlessandroPiva. Improved DCT coefficient analysis for forgerylocalization in JPEG images. In ICASSP, 2011.

[14] Richard Strong Bowen, Huiwen Chang, Charles Her-rmann, Piotr Teterwak, Ce Liu, and Ramin Zabih.OCONet: Image extrapolation by object completion. InCVPR, 2021.

[15] Ralph Allan Bradley and Milton E Terry. Rank analysisof incomplete block designs: I. the method of pairedcomparisons. Biometrika, 39(3/4):324–345, 1952.

[16] Peter J Burt and Edward H Adelson. A multiresolu-tion spline with application to image mosaics. ACMTransactions on Graphics, 2(4):217–236, 1983.

[17] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, andFredo Durand. Learning photographic global tonaladjustment with a database of input / output image pairs.In CVPR, 2011.

[18] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet:An information-rich 3D model repository. arXivpreprint arXiv: 1512.03012, 2015.

[19] Bor-Chun Chen and Andrew Kae. Toward realisticimage compositing with adversarial learning. In CVPR,2019.

[20] Mo Chen, Jessica Fridrich, Miroslav Goljan, and JanLukas. Determining image origin and integrity usingsensor noise. IEEE Transactions on information foren-sics and security, 3(1):74–90, 2008.

[21] Qiuyu Chen, Wei Zhang, Ning Zhou, Peng Lei, Yi Xu,Yu Zheng, and Jianping Fan. Adaptive fractional dilatedconvolution network for image aesthetics assessment. InCVPR, 2020.

[22] Dachuan Cheng, Jian Shi, Yanyun Chen, XiaomingDeng, and Xiaopeng Zhang. Learning scene illumi-nation by pairwise photos from rear and front mobilecameras. Computer Graphics Forum, 37(7):213–221,2018.

[23] Giovanni Chierchia, Giovanni Poggi, Carlo Sansone,and Luisa Verdoliva. A bayesian-MRF approach forPRNU-based image forgery detection. IEEE Transac-tions on Information Forensics and Security, 9(4):554–567, 2014.

[24] Hak-Yeol Choi, Han-Ul Jang, Dongkyu Kim, JeonghoSon, Seung-Min Mun, Sunghee Choi, and Heung-KyuLee. Detecting composite image manipulation based ondeep neural networks. In IWSSIP, 2017.

[25] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, ZhixinLing, Weiyuan Li, and Liqing Zhang. DoveNet: Deepimage harmonization via domain verification. In CVPR,

2020.[26] Wenyan Cong, Junyan Cao, Li Niu, Jianfu Zhang,

Xuesong Gao, Zhiwei Tang, and Liqing Zhang. Deepimage harmonization by bridging the reality gap. arXivpreprint arXiv:2103.17104, 2021.

[27] Wenyan Cong, Li Niu, Jianfu Zhang, Jing Liang, andLiqing Zhang. BargainNet: Background-guided domaintranslation for image harmonization. In ICME, 2021.

[28] Marius Cordts, Mohamed Omran, Sebastian Ramosan Timo Rehfeld, Markus Enzweiler, Rodrigo Benen-son, Uwe Franke, Stefan Roth, and Bernt Schiele. Thecityscapes dataset for semantic urban scene understand-ing. In CVPR, 2016.

[29] Davide Cozzolino, Giovanni Poggi, and Luisa Verdo-liva. Efficient dense-field copy–move forgery detection.IEEE Transactions on Information Forensics and Secu-rity, 10(11):2284–2297, 2015.

[30] Davide Cozzolino, Giovanni Poggi, and Luisa Verdo-liva. Splicebuster: A new blind image splicing detector.In IEEE International Workshop on Information Foren-sics and Security, 2015.

[31] Xiaodong Cun and Chi-Man Pun. Improving the har-mony of the composite image by spatial-separated atten-tion module. IEEE Transactions on Image Processing,29:4759–4771, 2020.

[32] Ritendra Datta, Jia Li, and James Z Wang. Algorithmicinferencing of aesthetics and emotion in natural images:An exposition. In ICIP, 2008.

[33] Tiago Jose De Carvalho, Christian Riess, ElliAngelopoulou, Helio Pedrini, and Andersonde Rezende Rocha. Exposing digital image forgeries byillumination color classification. IEEE Transactions onInformation Forensics and Security, 8(7):1182–1194,2013.

[34] Yubin Deng, Chen Change Loy, and Xiaoou Tang.Image aesthetic assessment: An experimental survey.IEEE Signal Processing Magazine, 34(4):80–106, 2017.

[35] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg.High level describable attributes for predicting aesthet-ics and interestingness. In CVPR, 2011.

[36] Ahmet Emir Dirik and Nasir Memon. Image tamperdetection based on demosaicing artifacts. In ICIP, 2009.

[37] Jing Dong, Wei Wang, and Tieniu Tan. Casia imagetampering detection evaluation database. In IEEE ChinaSummit and International Conference on Signal andInformation Processing, 2013.

[38] Nikita Dvornik, Julien Mairal, and Cordelia Schmid.Modeling visual context is key to augmenting objectdetection datasets. In ECCV, 2018.

[39] Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi.Segan: Segmenting and generating the invisible. InCVPR, 2018.

[40] Mark Everingham, Luc Van Gool, Christopher K. I.Williams, John M. Winn, and Andrew Zisserman. Thepascal visual object classes (VOC) challenge. Inter-national Journal of Computer Vision, 88(2):303–338,2010.

[41] Yu Fan, Philippe Carre, and Christine Fernandez-

Page 13: Making Images Real Again: A Comprehensive Survey on Deep

13

Maloigne. Image splicing detection with local illumi-nation estimation. In ICIP, 2015.

[42] Haoshu Fang, Jianhua Sun, Runzhong Wang, MinghaoGou, Yonglu Li, and Cewu Lu. InstaBoost: Boostinginstance segmentation via probability map guided copy-pasting. In ICCV, 2019.

[43] Hany Farid. Exposing digital forgeries from JPEGghosts. IEEE Transactions on Information Forensicsand Security, 4(1):154–160, 2009.

[44] Raanan Fattal, Dani Lischinski, and Michael Werman.Gradient domain high dynamic range compression. InACM Siggraph Computer Graphics, 2002.

[45] Ulrich Fecker, Marcus Barkowsky, and Andre Kaup.Histogram-based prefiltering for luminance and chromi-nance compensation of multiview video. IEEE Trans-actions on Circuits and Systems for Video Technology,18(9):1258–1267, 2008.

[46] Pasquale Ferrara, Tiziano Bianchi, Alessia De Rosa, andAlessandro Piva. Image forgery localization via fine-grained analysis of CFA artifacts. IEEE Transactionson Information Forensics and Security, 7(5):1566–1577,2012.

[47] Jessica Fridrich and Jan Kodovsky. Rich models forsteganalysis of digital images. IEEE Transactionson Information Forensics and Security, 7(3):868–882,2012.

[48] Marc-Andre Gardner, Kalyan Sunkavalli, Ersin Yumer,Xiaohui Shen, Emiliano Gambaretto, Christian Gagne,and Jean-Francois Lalonde. Learning to predict indoorillumination from a single image. ACM Transactionson Graphics, 36(6):1–14, 2017.

[49] Marc-Andre Gardner, Yannick Hold-Geoffroy, KalyanSunkavalli, Christian Gagne, and Jean-FrancoisLalonde. Deep parametric indoor lighting estimation.In ICCV, 2019.

[50] Georgios Georgakis, Md. Alimoor Reza, ArsalanMousavian, Phi-Hung Le, and Jana Kosecka. MultiviewRGB-D dataset for object instance detection. In 3DV,2016.

[51] Georgios Georgakis, Arsalan Mousavian, Alexander C.Berg, and Jana Kosecka. Synthesizing training data forobject detection in indoor scenes. In RSS XIII, 2017.

[52] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian,Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and BarretZoph. Simple copy-paste is a strong data augmenta-tion method for instance segmentation. arXiv preprintarXiv:2012.07177, 2020.

[53] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnetworks. NIPS, 2014.

[54] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, ZhaoruiGu, and Bing Zheng. Intrinsic image harmonization. InCVPR, 2021.

[55] Jong Goo Han, Tae Hee Park, Yong Ho Moon, andIl Kyu Eom. Efficient markov feature extraction methodfor image splicing detection using maximization andthreshold expansion. Journal of Electronic Imaging, 25

(2):023031, 2016.[56] Guoqing Hao, Satoshi Iizuka, and Kazuhiro Fukui.

Image harmonization with attention-based deep featuremodulation. In BMVC, 2020.

[57] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. GANs trainedby a two time-scale update rule converge to a local nashequilibrium. In NeurIPS, 2017.

[58] Yan Hong, Li Niu, Jianfu Zhang, and Liqing Zhang.Shadow generation for composite image in real-worldscenes. arXiv preprint arXiv:2104.10338, 2021.

[59] Le Hou, Chen-Ping Yu, and Dimitris Samaras. Squaredearth mover’s distance-based loss for training deep neu-ral networks. arXiv preprint arXiv:1611.05916, 2016.

[60] Yu-Feng Hsu and Shih-Fu Chang. Detecting imagesplicing using geometry invariants and camera charac-teristics consistency. In ICME, 2006.

[61] Wu-Chih Hu and Wei-Hao Chen. Effective forgerydetection using DCT+ SVD-based watermarking forregion of interest in key frames of vision-based surveil-lance. International Journal of Computational Scienceand Engineering, 8(4):297–305, 2013.

[62] Xiaowei Hu, Yitong Jiang, Chi-Wing Fu, and Pheng-Ann Heng. Mask-ShadowGAN: Learning to removeshadows from unpaired data. In ICCV, 2019.

[63] Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, SyomantakChaudhuri, Zhenheng Yang, and Ram Nevatia. SPAN:Spatial pyramid attention network for image manipula-tion localization. In ECCV, 2020.

[64] Hao-Zhi Huang, Sen-Zhe Xu, Jun-Xiong Cai, Wei Liu,and Shi-Min Hu. Temporally coherent video harmo-nization using adversarial networks. IEEE Transactionson Image Processing, 29:214–224, 2019.

[65] Minyoung Huh, Andrew Liu, Andrew Owens, andAlexei A Efros. Fighting fake news: Image splicedetection via learned self-consistency. In ECCV, 2018.

[66] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditionaladversarial networks. In CVPR, 2017.

[67] Max Jaderberg, Karen Simonyan, Andrew Zisserman,and Koray Kavukcuoglu. Spatial transformer networks.In NeurIPS, 2015.

[68] Xin Jin, Le Wu, Geng Zhao, Xiaodong Li, XiaokunZhang, Shiming Ge, Dongqing Zou, Bin Zhou, andXinghui Zhou. Aesthetic attributes assessment of im-ages. In ACM MM, 2019.

[69] Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya,Quang-Tuan Luong, James Z Wang, Jia Li, and JieboLuo. Aesthetics and emotions in images. IEEE SignalProcessing Magazine, 28(5):94–115, 2011.

[70] Thibaut Julliand, Vincent Nozick, and Hugues Talbot.Image noise and digital image forensics. In Interna-tional Workshop on Digital Watermarking, 2015.

[71] Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, NathanCarr, Hailin Jin, Rafael Fonte, Michael Sittig, and DavidForsyth. Automatic scene inference for 3d objectcompositing. ACM Transactions on Graphics, 33(3):1–15, 2014.

Page 14: Making Images Real Again: A Comprehensive Survey on Deep

14

[72] Michael Kazhdan and Hugues Hoppe. Streaming multi-grid for gradient-domain operations on large images.ACM Transactions on graphics, 27(3):1–10, 2008.

[73] Kotaro Kikuchi, Kota Yamaguchi, Edgar Simo-Serra,and Tetsunori Kobayashi. Regularized adversarial train-ing for single-shot virtual try-on. In ICCV Workshop,2019.

[74] Vladimir V Kniaz, Vladimir Knyaz, and Fabio Re-mondino. The point where reality meets fantasy: Mixedadversarial generators for image splice detection. InNeurIPs, 2019.

[75] Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, andCharless Fowlkes. Photo aesthetics ranking networkwith attributes and content adaptation. In ECCV, 2016.

[76] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, ChaoQian, and James Hays. Transient attributes for high-level understanding and editing of outdoor scenes. ACMTransactions on graphics, 33(4):1–11, 2014.

[77] Jean-Francois Lalonde and Alexei A. Efros. Using colorcompatibility for assessing image realism. In ICCV,2007.

[78] Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu,Ming-Hsuan Yang, and Jan Kautz. Context-aware syn-thesis and placement of object instances. In NeurIPS,2018.

[79] Anat Levin, Assaf Zomet, Shmuel Peleg, and YairWeiss. Seamless image stitching in the gradient domain.In ECCV, 2004.

[80] Ce Li, Qiang Ma, Limei Xiao, Ming Li, and AihuaZhang. Image splicing detection based on markovfeatures in QDCT domain. Neurocomputing, 228:29–36, 2017.

[81] Chang-Tsun Li and Yue Li. Color-decoupled photoresponse non-uniformity for digital image forensics.IEEE Transactions on Circuits and Systems for VideoTechnology, 22(2):260–271, 2011.

[82] Weihai Li, Yuan Yuan, and Nenghai Yu. Passivedetection of doctored JPEG image via block artifact gridextraction. Signal Processing, 89(9):1821–1829, 2009.

[83] Jing Liang, Li Niu, and Liqing Zhang. Inharmoniousregion localization. In ICME, 2021.

[84] Bin Liao, Yao Zhu, Chao Liang, Fei Luo, and ChunxiaXiao. Illumination animating and editing in a singlepicture using scene structure estimation. Computers &Graphics, 82:53–64, 2019.

[85] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shecht-man, and Simon Lucey. ST-GAN: spatial transformergenerative adversarial networks for image compositing.In CVPR, 2018.

[86] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar, andC. Lawrence Zitnick. Microsoft COCO: common ob-jects in context. In ECCV, 2014.

[87] Zhouchen Lin, Junfeng He, Xiaoou Tang, and Chi-Keung Tang. Fast, automatic and fine-grained tamperedJPEG image detection via DCT coefficient analysis.Pattern Recognition, 42(11):2492–2501, 2009.

[88] Jun Ling, Han Xue, Li Song, Rong Xie, and Xiao Gu.

Region-aware adaptive instance normalization for imageharmonization. In CVPR, 2021.

[89] Bin Liu, Kun Xu, and Ralph R Martin. Static sceneillumination estimation from videos with applications.Journal of Computer Science and Technology, 32(3):430–442, 2017.

[90] Bo Liu and Chi-Man Pun. Splicing forgery exposurein digital image by detecting noise discrepancies. In-ternational Journal of Computer and CommunicationEngineering, 4(1):33, 2015.

[91] Daquan Liu, Chengjiang Long, Hongpan Zhang, Han-ning Yu, Xinzhi Dong, and Chunxia Xiao. ARshadow-Gan: Shadow generative adversarial network for aug-mented reality in single light scenes. In CVPR, 2020.

[92] Dong Liu, Rohit Puri, Nagendra Kamath, and Sub-habrata Bhattacharya. Composition-aware image aes-thetics assessment. In WACV, 2020.

[93] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-ChunWang, Andrew Tao, and Bryan Catanzaro. Imageinpainting for irregular holes using partial convolutions.In ECCV, 2018.

[94] Ziwei Liu, Ping Luo, Xiaogang Wang, and XiaoouTang. Deep learning face attributes in the wild. InICCV, 2015.

[95] Kuo-Yen Lo, Keng-Hao Liu, and Chu-Song Chen. As-sessment of photo aesthetics with efficiency. In ICPR,2012.

[96] Fujun Luan, Sylvain Paris, Eli Shechtman, and KavitaBala. Deep painterly harmonization. In Computergraphics forum, volume 37, pages 95–106, 2018.

[97] Wei Luo, Xiaogang Wang, and Xiaoou Tang. Content-based photo quality assessment. In ICCV, 2011.

[98] Wei Luo, Xiaogang Wang, and Xiaoou Tang. Content-based photo quality assessment. IEEE Transactions onImage Processing, 15(8):1930–1943, 2013.

[99] Yiwen Luo and Xiaoou Tang. Photo and video qualityevaluation: Focusing on the subject. In ECCV, 2008.

[100] Siwei Lyu, Xunyu Pan, and Xing Zhang. Exposing re-gion splicing forgeries with blind local noise estimation.International Journal of Computer Vision, 110(2):202–221, 2014.

[101] Shuang Ma, Jing Liu, and Chang Wen Chen. A-Lamp:Adaptive layout-aware multi-patch deep convolutionalneural network for photo aesthetic assessment. InCVPR, 2017.

[102] Babak Mahdian and Stanislav Saic. Using noise incon-sistencies for blind image forensics. Image and VisionComputing, 27(10):1497–1503, 2009.

[103] Owen Mayer and Matthew C Stamm. Learned forensicsource similarity for unknown camera models. InICASSP, pages 2012–2016, 2018.

[104] Shervin Minaee, Yuri Y Boykov, Fatih Porikli, Anto-nio J Plaza, Nasser Kehtarnavaz, and Demetri Terzopou-los. Image segmentation using deep learning: A survey.IEEE Transactions on Pattern Analysis and MachineIntelligence, 2021.

[105] Mehdi Mirza and Simon Osindero. Conditional gener-ative adversarial nets. arXiv preprint arXiv:1411.1784,

Page 15: Making Images Real Again: A Comprehensive Survey on Deep

15

2014.[106] Naila Murray, Luca Marchesotti, and Florent Perronnin.

AVA: A large-scale database for aesthetic visual analy-sis. In CVPR, 2012.

[107] Naila Murray, Luca Marchesotti, and Florent Perronnin.Aesthetic critiques generation for photos. In ICCV,2017.

[108] Thomas Nestmeyer, Jean-Francois Lalonde, IainMatthews, and Andreas Lehrmann. Learning physics-guided face relighting under directional light. In CVPR,2020.

[109] Tian-Tsong Ng, Shih-Fu Chang, and Q Sun. A data setof authentic and spliced image blocks. Columbia Uni-versity, ADVENT Technical Report, pages 203–2004,2004.

[110] Vu Nguyen, Tomas F Yago Vicente, Maozheng Zhao,Minh Hoai, and Dimitris Samaras. Shadow detectionwith conditional generative adversarial networks. InICCV, 2017.

[111] Pere Obrador, Ludwig Schmidt-Hackenberg, and NuriaOliver. The role of image composition in image aes-thetics. In ICIP, 2010.

[112] Rohit Kumar Pandey, Sergio Orts Escolano, ChloeLeGendre, Christian Haene, Sofien Bouaziz, ChristophRhemann, Paul Debevec, and Sean Fanello. Totalrelighting: Learning to relight portraits for backgroundreplacement. In SIGGRAPH, 2021.

[113] Patrick Perez, Michel Gangnet, and Andrew Blake.Poisson image editing. In ACM SIGGRAPH. 2003.

[114] Francois Pitie, Anil C Kokaram, and Rozenn Dahyot.Automated colour grading using colour distributiontransfer. Computer Vision and Image Understanding,107(1-2):123–137, 2007.

[115] Thomas Porter and Tom Duff. Compositing digitalimages. In ACM Siggraph Computer Graphics, 1984.

[116] Chi-Man Pun, Bo Liu, and Xiao-Chen Yuan. Multi-scale noise estimation for image splicing forgery de-tection. Journal of Visual Communication and ImageRepresentation, 38:195–206, 2016.

[117] Yuan Rao and Jiangqun Ni. A deep learning approachto detection of splicing and copy-move forgeries inimages. In IEEE International Workshop on InformationForensics and Security, pages 1–6, 2016.

[118] Yogesh Singh Rawat and Mohan S Kankanhalli.Context-aware photography learning for smart mobiledevices. ACM Transactions on Multimedia Computing,Communications, and Applications, 12(1):1–24, 2015.

[119] Erik Reinhard, Michael Ashikhmin, Bruce Gooch, andPeter Shirley. Color transfer between images. IEEEComputer Graphics and Applications, 21(5):34–41,2001.

[120] Tal Remez, Jonathan Huang, and Matthew Brown.Learning to segment via cut-and-paste. In ECCV, 2018.

[121] Jian Ren, Xiaohui Shen, Zhe Lin, Radomir Mech, andDavid J Foran. Personalized image aesthetics. In ICCV,2017.

[122] Paolo Rota, Enver Sangineto, Valentina Conotter, andChristopher Pramerdorfer. Bad teacher or unruly stu-

dent: Can deep learning say something in image foren-sics analysis? In ICPR, 2016.

[123] R. Sukthankar S. Bhattacharya and M. Shah. A holis-tic approach to aesthetic enhancement of photographs.ACM Transactions on Multimedia Computing, Commu-nications, and Applications, 7(1):1–21, 2011.

[124] Ronald Salloum, Yuzhuo Ren, and C-C Jay Kuo. Imagesplicing localization using a multi-task fully convolu-tional network (mfcn). Journal of Visual Communica-tion and Image Representation, 51:201–209, 2018.

[125] Tiago A Schieber, Laura Carpi, Albert Dıaz-Guilera,Panos M Pardalos, Cristina Masoller, and Martın GRavetti. Quantification of network structural dissimi-larities. Nature Communications, 8(1):1–10, 2017.

[126] Dongyu She, Yu-Kun Lai, Gaoxiong Yi, and Kun Xu.Hierarchical layout-aware graph convolutional networkfor unified aesthetics assessment. In CVPR, 2021.

[127] Yichen Sheng, Jianming Zhang, and Bedrich Benes.SSN: Soft shadow network for image compositing. InCVPR, 2021.

[128] Zhixin Shu, Sunil Hadap, Eli Shechtman, KalyanSunkavalli, Sylvain Paris, and Dimitris Samaras. Por-trait lighting transfer using a mass transport approach.ACM Transactions on Graphics, 36(4):1, 2017.

[129] Konstantin Sofiiuk, Polina Popenova, and AntonKonushin. Foreground-aware semantic representationsfor image harmonization. In WACV, 2021.

[130] Shuangbing Song, Fan Zhong, Xueying Qin, andChanghe Tu. Illumination harmonization with graymean scale. In Computer Graphics International Con-ference, 2020.

[131] Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang,Manolis Savva, and Thomas A. Funkhouser. Semanticscene completion from a single depth image. In CVPR,2017.

[132] Kalyan Sunkavalli, Micah K. Johnson, Wojciech Ma-tusik, and Hanspeter Pfister. Multi-scale image harmo-nization. ACM Transactions on Graphics, 29(4):125:1–125:10, 2010.

[133] Richard Szeliski, Matthew Uyttendaele, and DrewSteedly. Fast poisson blending using multi-splines. InICCP, 2011.

[134] Hossein Talebi and Peyman Milanfar. NIMA: Neuralimage assessment. IEEE Transactions on Image Pro-cessing, 27(8):3998–4011, 2018.

[135] Fuwen Tan, Crispin Bernier, Benjamin Cohen, VicenteOrdonez, and Connelly Barnes. Where and who? au-tomatic semantic-aware person composition. In WACV,2018.

[136] Xuehan Tan, Panpan Xu, Shihui Guo, and WenchengWang. Image composition of partially occluded objects.In Computer Graphics Forum, volume 38, pages 641–650, 2019.

[137] K. Thommes and R. Hubner. Instagram likes for archi-tectural photos can be predicted by quantitative balancemeasures and curvature. Frontiers in Psychology, 9(1):1050–1067, 2018.

[138] Shashank Tripathi, Siddhartha Chandra, Amit Agrawal,

Page 16: Making Images Real Again: A Comprehensive Survey on Deep

16

Ambrish Tyagi, James M. Rehg, and Visesh Chari.Learning to generate synthetic data via compositing. InCVPR, 2019.

[139] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, KalyanSunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep imageharmonization. In CVPR, 2017.

[140] Basile Van Hoorick. Image outpainting and harmo-nization using generative adversarial networks. arXivpreprint arXiv:1912.10960, 2019.

[141] Hao Wang, Qilong Wang, Fan Yang, Weiqi Zhang, andWangmeng Zuo. Data augmentation for object detectionvia progressive and selective instance-switching. arXivpreprint arXiv:1906.00358, 2019.

[142] Tianyu Wang, Xiaowei Hu, Qiong Wang, Pheng-AnnHeng, and Chi-Wing Fu. Instance shadow detection. InCVPR, 2020.

[143] Wei Wang, Jing Dong, and Tieniu Tan. Tamperedregion localization of digital color images based onJPEG compression noise. In International Workshopon Digital Watermarking, 2010.

[144] Wenshan Wang, Su Yang, Weishan Zhang, and JiulongZhang. Neural aesthetic image reviewer. IET ComputerVision, 13(8):749–758, 2019.

[145] Zhibo Wang, Xin Yu, Ming Lu, Quan Wang, ChenQian, and Feng Xu. Single image portrait relightingvia explicit multiple reflectance channel modeling. ACMTransactions on Graphics, 39(6):1–13, 2020.

[146] Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. COV-ERAGE—a novel database for copy-move forgery de-tection. In ICIP, 2016.

[147] Shuchen Weng, Wenbo Li, Dawei Li, Hongxia Jin,and Boxin Shi. Misc: Multi-condition injection andspatially-adaptive compositing for conditional personimage synthesis. In CVPR, 2020.

[148] Huikai Wu, Shuai Zheng, Junge Zhang, and KaiqiHuang. GP-GAN: Towards realistic high-resolutionimage blending. In ACM MM, 2019.

[149] Yaowen Wu, Christian Bauckhage, and Christian Thu-rau. The good, the bad, and the ugly: Predictingaesthetic image labels. In ICPR, 2010.

[150] Yue Wu, Wael Abd-Almageed, and Prem Natarajan.Deep matching and validation network: An end-to-endsolution to constrained image splicing localization anddetection. In ACM MM, 2017.

[151] Yue Wu, Wael Abd-Almageed, and Prem Natarajan.Busternet: Detecting copy-move image forgery withsource/target localization. In ECCV, 2018.

[152] Yue Wu, Wael Abd-Almageed, and Prem Natarajan.Image copy-move forgery detection via an end-to-enddeep neural network. In WACV, 2018.

[153] Yue Wu, Wael AbdAlmageed, and Premkumar Natara-jan. Mantra-net: Manipulation tracing network for de-tection and localization of image forgeries with anoma-lous features. In CVPR, 2019.

[154] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu,Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3dshapenets: A deep representation for volumetric shapes.

In CVPR, 2015.[155] Xuezhong Xiao and Lizhuang Ma. Color transfer in

correlated color space. In ACM international conferenceon Virtual reality continuum and its applications, 2006.

[156] Yuting Xiao, Yanyu Xu, Ziming Zhong, Weixin Luo,Jiawei Li, and Shenghua Gao. Amodal segmentationbased on visible region segmentation and shape prior.AAAI, 2021.

[157] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang.Deep image matting. In CVPR, 2017.

[158] Su Xue, Aseem Agarwala, Julie Dorsey, and Holly E.Rushmeier. Understanding and improving the realismof image composites. ACM Transactions on Graphics,31(4):84:1–84:10, 2012.

[159] Chao Yang, Huizhou Li, Fangting Lin, Bin Jiang, andHao Zhao. Constrained R-CNN: A general imagemanipulation detection model. In ICME, 2020.

[160] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexan-der G Schwing, Mark Hasegawa-Johnson, and Minh NDo. Semantic image inpainting with deep generativemodels. In CVPR, 2017.

[161] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu,and Thomas S Huang. Free-form image inpainting withgated convolution. In ICCV, 2019.

[162] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spatialfusion GAN for image synthesis. In CVPR, 2019.

[163] Fangneng Zhan, Shijian Lu, Changgong Zhang, FeiyingMa, and Xuansong Xie. Adversarial image compositionwith auxiliary illumination. In ACCV, 2020.

[164] Fangneng Zhan, Jiaxing Huang, and Shijian Lu. Hierar-chy composition GAN for high-fidelity image synthesis.IEEE Transactions on cybernetics, 2021.

[165] Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu,Dahua Lin, and Chen Change Loy. Self-supervisedscene de-occlusion. In CVPR, 2020.

[166] Bo Zhang, Li Niu, and Liqing Zhang. Image composi-tion assessment with saliency-augmented multi-patternpooling. arXiv preprint arXiv:2104.03133, 2021.

[167] He Zhang, Jianming Zhang, Federico Perazzi, Zhe Lin,and Vishal M Patel. Deep image compositing. In WACV,2021.

[168] Jinsong Zhang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Sunil Hadap, Jonathan Eisenman, and Jean-Francois Lalonde. All-weather deep outdoor lightingestimation. In CVPR, 2019.

[169] Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang,David Han, and Jianbo Shi. Learning object placementby inpainting for compositional data augmentation. InECCV, 2020.

[170] Lingzhi Zhang, Tarmily Wen, and Jianbo Shi. Deepimage blending. In WACV, 2020.

[171] Luming Zhang, Yue Gao, Roger Zimmermann, Qi Tian,and Xuelong Li. Fusion of multichannel local andglobal structural cues for photo aesthetics evaluation.IEEE Transactions on Image Processing, 23(3):1419–1429, 2014.

[172] Shuyang Zhang, Runze Liang, and Miao Wang. Shad-owGAN: Shadow synthesis for virtual objects with

Page 17: Making Images Real Again: A Comprehensive Survey on Deep

17

conditional adversarial networks. Computational VisualMedia, 5(1):105–115, 2019.

[173] Song-Hai Zhang, Zhengping Zhou, Bin Liu, Xi Dong,and Peter Hall. What and where: A context-basedrecommendation system for object insertion. Compu-tational Visual Media, 6(1):79–93, 2020.

[174] Zibo Zhao, Wen Liu, Yanyu Xu, Xianing Chen, WeixinLuo, Lei Jin, Bohui Zhu, Tong Liu, Binqiang Zhao,and Shenghua Gao. Prior based human completion. InCVPR, 2021.

[175] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry SDavis. Learning rich features for image manipulationdetection. In CVPR, 2018.

[176] Ye Zhou, Xin Lu, Junping Zhang, and James Z Wang.Joint image and text representation for aesthetics anal-ysis. In ACM MM, 2016.

[177] Zihan Zhou, Siqiong He, Jia Li, and James Z Wang.Modeling perspective effects in photographic composi-tion. In ACM MM, 2015.

[178] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, andAlexei A Efros. Learning a discriminative model for theperception of realism in composite images. In ICCV,2015.

[179] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

[180] Xinshan Zhu, Yongjun Qian, Xianfeng Zhao, Biao Sun,and Ya Sun. A deep learning approach to patch-basedimage inpainting forensics. Signal Processing: ImageCommunication, 67:90–99, 2018.