arxiv:1912.01839v3 [cs.cv] 22 jun 2020arxiv:1912.01839v3 [cs.cv] 22 jun 2020 figure 2: an example...

14
Explorable Super Resolution Yuval Bahat and Tomer Michaeli Technion - Israel Institute of Technology, Haifa, Israel {yuval.bahat@campus,tomer.m@ee}.technion.ac.il ESRGAN Low-res input Other perfectly consistent reconstructions produced with our approach Figure 1: Exploring HR explanations to an LR image. Existing SR methods (e.g. ESRGAN [26]) output only one explanation to the input image. In contrast, our explorable SR framework allows producing infinite different perceptually satisfying HR images, that all identically match a given LR input, when down-sampled. Please zoom-in to view subtle details. Abstract Single image super resolution (SR) has seen major per- formance leaps in recent years. However, existing methods do not allow exploring the infinitely many plausible recon- structions that might have given rise to the observed low- resolution (LR) image. These different explanations to the LR image may dramatically vary in their textures and fine details, and may often encode completely different seman- tic information. In this paper, we introduce the task of ex- plorable super resolution. We propose a framework com- prising a graphical user interface with a neural network backend, allowing editing the SR output so as to explore the abundance of plausible HR explanations to the LR input. At the heart of our method is a novel module that can wrap any existing SR network, analytically guaranteeing that its SR outputs would precisely match the LR input, when down- sampled. Besides its importance in our setting, this module is guaranteed to decrease the reconstruction error of any SR network it wraps, and can be used to cope with blur ker- nels that are different from the one the network was trained for. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics. 1. Introduction Single image super resolution (SR) is the task of pro- ducing a high resolution (HR) image from a single low resolution (LR) image. Recent decades have seen an in- creasingly growing research interest in this task, peaking with the recent surge of methods based on deep neural networks. These methods demonstrated significant perfor- mance boosts, some in terms of achieving low reconstruc- tion errors [4, 10, 14, 24, 12, 29, 11] and some in terms of producing photo-realistic HR images [13, 26, 23], typically via the use of generative adversarial networks (GANs) [5]. However, common to all existing methods is that they do not allow exploring the abundance of plausible HR explanations to the input LR image, and typically produce only a single SR output. This is dissatisfying as although these HR expla- nations share the same low frequency content, manifested in their coarser image structures, they may significantly vary in their higher frequency content, such as textures and small details (see e.g., Fig. 1). Apart from affecting the image ap- pearance, these fine details often encode crucial semantic information, like in the cases of text, faces and even tex- tures (e.g., distinguishing a horse from a zebra). Existing SR methods ignore this abundance of valid solutions, and arbitrarily confine their output to a specific appearance with its particular semantic meaning. In this paper, we initiate the study of explorable su- per resolution, and propose a framework for achieving it through user editing. Our method consists of a neural net- work utilized by a graphical user interface (GUI), which allows the user to interactively explore the space of per- ceptually pleasing HR images that could have given rise to 1 arXiv:1912.01839v3 [cs.CV] 22 Jun 2020

Upload: others

Post on 25-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Explorable Super Resolution

Yuval Bahat and Tomer MichaeliTechnion - Israel Institute of Technology, Haifa, Israel{yuval.bahat@campus,tomer.m@ee}.technion.ac.il

ESRGANLow-res input Other perfectly consistent reconstructions produced with our approach

Figure 1: Exploring HR explanations to an LR image. Existing SR methods (e.g. ESRGAN [26]) output only oneexplanation to the input image. In contrast, our explorable SR framework allows producing infinite different perceptuallysatisfying HR images, that all identically match a given LR input, when down-sampled. Please zoom-in to view subtle details.

Abstract

Single image super resolution (SR) has seen major per-formance leaps in recent years. However, existing methodsdo not allow exploring the infinitely many plausible recon-structions that might have given rise to the observed low-resolution (LR) image. These different explanations to theLR image may dramatically vary in their textures and finedetails, and may often encode completely different seman-tic information. In this paper, we introduce the task of ex-plorable super resolution. We propose a framework com-prising a graphical user interface with a neural networkbackend, allowing editing the SR output so as to explore theabundance of plausible HR explanations to the LR input. Atthe heart of our method is a novel module that can wrapany existing SR network, analytically guaranteeing that itsSR outputs would precisely match the LR input, when down-sampled. Besides its importance in our setting, this moduleis guaranteed to decrease the reconstruction error of anySR network it wraps, and can be used to cope with blur ker-nels that are different from the one the network was trainedfor. We illustrate our approach in a variety of use cases,ranging from medical imaging and forensics, to graphics.

1. Introduction

Single image super resolution (SR) is the task of pro-ducing a high resolution (HR) image from a single low

resolution (LR) image. Recent decades have seen an in-creasingly growing research interest in this task, peakingwith the recent surge of methods based on deep neuralnetworks. These methods demonstrated significant perfor-mance boosts, some in terms of achieving low reconstruc-tion errors [4, 10, 14, 24, 12, 29, 11] and some in terms ofproducing photo-realistic HR images [13, 26, 23], typicallyvia the use of generative adversarial networks (GANs) [5].However, common to all existing methods is that they do notallow exploring the abundance of plausible HR explanationsto the input LR image, and typically produce only a singleSR output. This is dissatisfying as although these HR expla-nations share the same low frequency content, manifested intheir coarser image structures, they may significantly varyin their higher frequency content, such as textures and smalldetails (see e.g., Fig. 1). Apart from affecting the image ap-pearance, these fine details often encode crucial semanticinformation, like in the cases of text, faces and even tex-tures (e.g., distinguishing a horse from a zebra). ExistingSR methods ignore this abundance of valid solutions, andarbitrarily confine their output to a specific appearance withits particular semantic meaning.

In this paper, we initiate the study of explorable su-per resolution, and propose a framework for achieving itthrough user editing. Our method consists of a neural net-work utilized by a graphical user interface (GUI), whichallows the user to interactively explore the space of per-ceptually pleasing HR images that could have given rise to

1

arX

iv:1

912.

0183

9v3

[cs

.CV

] 2

2 Ju

n 20

20

Page 2: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

scribble

Our pre-edited SR Scribble editing Patch distribution editingLow-res input

Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions using avariety of tools. Here, local scribble editing is used to encourage the edited region to resemble the user’s graphical input.Then the entire shirt area (red) is edited by encouraging its patches to resemble those in the source (yellow) region. At anystage of the process, the output is perfectly consistent with the input (its down-sampled version identically matches the input).

ESRGANOurs

(pre-edited)

Low

-res

inpu

t

Resulting corresponding SR reconstructions:

Attempting to imprint optional digits:

Figure 3: Visually examining the likelihood of patternsof interest. Given an LR image of a car license plate, weexplore the possible valid SR reconstructions by attemptingto manipulate the central digit to appear like any of the dig-its 0− 9, using our imprinting tool (see Sec. 4). Though theground truth HR digit was 1, judging by the ESRGAN [26]result (or by our pre-edited reconstruction) would probablylead to misidentifying it as 0. In contrast, our results whenimprinting digits 0,1 and 8 contain only minor artifacts, thusgiving them similar likelihood.

a given LR image. An example editing process is shownin Fig. 2. Our approach is applicable in numerous scenar-ios. For example, it enables manipulating the image so asto fit any prior knowledge the user may have on the cap-tured scene, like changing the type of flora to match thecapturing time and location, adjusting shades according tothe capturing time of day, or manipulating an animal’s ap-pearance according to whether the image was taken in thezebras or horses habitat. It can also help determine whethera certain pattern or object could have been present in thescene. This feature is invaluable in many settings, includingin the forensic and medical contexts, exemplified in Figs. 3and 4, respectively. Finally, it may be used to correct un-pleasing SR outputs, which are common even with high ca-pacity neural network models.

Our framework (depicted in Fig. 5) consists of three keyingredients, which fundamentally differ from the common

7 mm(borderline)

8 mm(healthy)

6 mm(pathologic)

Forcing different Acromiohumeral distances:

Low

-res

inpu

t

artifactsFigure 4: Visually examining the likelihood of a medicalpathology. Given an LR shoulder X-ray image, we evalu-ate the likelihood of a Supraspinatus tendon tear, typicallycharacterized by a less than 7mm Acromiohumeral distance(measured between the Humerus bone, marked red, and theAcromion bone above it). To this end, we attempt to im-print down-shifted versions (see Sec. 4) of the Acromionbone. Using the image quality as a proxy for its plausibility,we can infer a low chance of pathology, due to the artifactsemerging when forcing the pathological form (right image).

practice in SR. (i) We present a novel consistency enforc-ing module (CEM) that can wrap any SR network, analyti-cally guaranteeing that its outputs identically match the in-put, when down-sampled. Besides its crucial role in oursetting, we illustrate the advantages of incorporating thismodule into any SR method. (ii) We use a neural networkwith a control input signal, which allows generating diverseHR explanations to the LR image. To achieve this, we relysolely on an adversarial loss to promote perceptual plau-sibility, without using any reconstruction loss (e.g. L1 orVGG) for promoting proximity between the network’s out-puts and the ground-truth HR images. (iii) We facilitate theexploration process by creating a GUI comprising a largeset of tools. These work by manipulating the network’scontrol signal so as to achieve various desired effects1. Weelaborate on those three ingredients in Secs. 2,3 and 4.

1Our code and GUI are available online.

Page 3: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

CEM6) (Fig.

SR Net

GUI

Figure 5: Our explorable SR framework. Our GUI al-lows interactively exploring the manifold of possible HRexplanations for the LR image y, by manipulating the con-trol signal z of the SR network. Our CEM is utilized to turnthe network’s output xinc into a consistent reconstruction x,presented to the user. See Secs. 2,3 and 4 for details.

1.1. Related Work

GAN based image editing Many works employed GANsfor image editing tasks. For example, Zhu et al. [30] per-formed editing by searching for an image that satisfies auser input scribble, while constraining the output image tolie on the natural image manifold, learned by a GAN. Per-arnau et al. [21] suggested to perform editing in a learnedlatent space, by combining an encoder with a conditionalGAN. Xian et al. [28] facilitated texture editing, by allow-ing users to place a desired texture patch. Rott Shahamet al. [23] trained their GAN solely on the image to beedited, thus encouraging their edited output to retain theoriginal image characteristics. While our method also al-lows traversing the natural image manifold, it is differentfrom previous approaches in that it enforces the hard consis-tency constraint (restricting all outputs to identically matchthe LR input when down-sampled).

GAN based super-resolution A large body of work hasshown the advantage of using conditional GANs (cGANs)for generating photo-realistic SR reconstructions [13, 26,22, 25, 27]. Unlike classical GANs, cGANs feed the gen-erator with additional data (e.g. an LR image) together withthe stochastic noise input. The generator then learns to syn-thesize new data (e.g. the corresponding SR image) condi-tioned on this input. In practice, though, cGAN based SRmethods typically feed their generator only with the LR im-age, without a stochastic noise input. Consequently, theydo not produce diverse SR outputs. While we also use acGAN, we do add a control input signal to our generator’sLR input, which allows editing its output to yield diverseresults. Several cGAN methods for image translation didtarget outputs’ diversity by keeping the additional stochas-tic input [31, 3, 9], while utilizing various mechanisms forbinding the output to this additional input. In our method,

we encourage diversity by simply removing the reconstruc-tion losses that are used by all existing SR methods. This ismade possible by our consistency enforcing module.

2. The Consistency Enforcing ModuleWe would like the outputs of our explorable SR method

to be both perceptually plausible and consistent with the LRinput. To encourage perceptual plausibility, we adopt thecommon practice of utilizing an adversarial loss, which pe-nalizes for deviations from the statistics of natural images.To guarantee consistency, we introduce the consistency en-forcing module (CEM), an architecture that can wrap anygiven SR network, making it inherently satisfy the consis-tency constraint. This is in contrast to existing SR networks,which do not perfectly satisfy this constraint, as they en-courage consistency only indirectly through a reconstruc-tion loss on the SR image. The CEM does not contain anylearnable parameters and has many notable advantages overexisting SR network architectures, on which we elaboratelater in this section. We next derive our module.

Assume we are given a low resolution image y that isrelated to an unknown high-resolution image x through

y = (h ∗ x) ↓α . (1)

Here, h is a blur kernel associated with the point-spreadfunction of the camera, ∗ denotes convolution, and ↓α sig-nifies sub-sampling by a factor α. With slight abuse of no-tation, (1) can be written in matrix form as

y = Hx, (2)

where x and y now denote the vectorized versions of theHR and LR images, respectively, and the matrix H corre-sponds to convolution with h and sub-sampling by α. Thissystem of equations is obviously under-determined, render-ing it impossible to uniquely recover x from y without ad-ditional knowledge. We refer to any HR image x satisfy-ing this constraint, as consistent with the LR image y. Wewant to construct a module that can project any inconsistentreconstruction xinc (e.g. the output of a pre-trained SR net-work) onto the affine subspace defined by (2). Its consistentoutput x is thus the minimizer of

minx‖x− xinc‖2 s.t. Hx = y. (3)

Intuitively speaking, such a module would guarantee thatthe low frequency content of x matches that of the ground-truth image x (manifested in y), so that the SR networkshould only take care of plausibly reconstructing the highfrequency content (e.g. sharp edges and fine textures).

Problems like (3) frequently arise in sampling theory(see e.g. [17]), and can be conveniently solved using a ge-ometric viewpoint. Specifically, let us utilize the fact that

Page 4: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Interpolation

passHigh SR

Net

CEM

Figure 6: CEM architecture. The CEM, given by Eq. (8),can wrap any given SR network. It projects its output xinconto the space of images that identically match input ywhen downsampled, thus producing a super-resolved imagex guaranteed to be consistent with y. See Sec. 2 for details.

PN (H)⊥ = HT (HHT )−1H is known to be the orthogonalprojection matrix ontoN (H)⊥, the subspace that is perpen-dicular to the nullspace of H . Now, multiplying both sidesof the constraint in (3) by HT (HHT )−1, yields

PN (H)⊥ x = HT (HHT )−1y. (4)

This shows that we should strictly set the component of xin N (H)⊥ to equal the right hand side of (4).

We are therefore restricted to minimize the objec-tive by manipulating only the complementary component,PN (H)x, that lies in the nullspace of H . Decomposing theobjective into the two subspaces,

‖PN (H)(x− xinc)‖2 + ‖PN (H)⊥(x− xinc)‖2, (5)

we see that PN (H)x only appears in the first term, which itminimizes when set to

PN (H)x = PN (H)xinc. (6)

Combining the two components from (4) and (6), and usingthe fact that PN (H) = I −HT (HHT )−1H , we get that

x = PN (H)x+ PN (H)⊥ x

= (I −HT (HHT )−1H)xinc +HT (HHT )−1y. (7)

To transform (7) into a practical module that can wrapany SR architecture with output xinc, we need to replacethe impractical multiplication operations involving the verylarge H matrix, with their equivalent operations: convolu-tions, downsampling and upsampling. To this end, we ob-serve that since H corresponds to convolution with h fol-lowed by sub-sampling, HT corresponds to up-samplingfollowed by convolution with a mirrored version of h,which we denote by h. The multiplication by (HHT )−1

can then be replaced by convolving with a filter k, con-structed by computing the inverse of the filter (h ∗ h) ↓αin the Fourier domain. We thus have that

x = xinc− h∗[k ∗ (h∗ xinc) ↓α

]↑α +h∗ (k ∗y) ↑α . (8)

2× 3× 4×EDSR [14] 35.97 32.27 30.30

EDSR+CEM 36.11 32.36 30.37

Table 1: Wrapping a pre-trained SR network with CEM.Mean PSNR values over the BSD100 dataset [15], super-resolved by factors 2, 3 and 4. Merely wrapping the networkwith our CEM can only improve the reconstruction error, asmanifested by the slight PSNR increase in the 2nd row.

Thus, given the blur kernel2 h, we can calculate the fil-ters appearing in (8) and hardcode their non-learned weightsinto our CEM, which can wrap any SR network, as shown inFig. 6 (see Supplementary for padding details). Before pro-ceeding to incorporate it in our scheme, we note the CEM isbeneficial for any SR method, in the following two aspects.

Reduced reconstruction error Merely wrapping a pre-trained SR network with output xinc by our CEM, can onlydecrease its reconstruction error w.r.t. the ground-truth x, as

‖xinc − x‖2 ≥ ‖PN (H)(xinc − x)‖2(a)= ‖PN (H)(x− x)‖2

(b)= ‖x− x‖2. (9)

Here, (a) is due to (6), and (b) follows from (4), whichimplies that ‖PN (H)⊥(x − x)‖2 = 0 (as PN (H)⊥ x =

HT (HHT )−1Hx = PN (H)⊥x). This is demonstrated inTab. (1), which reports PSNR values attained by the EDSRnetwork [14] with and without our CEM, for several differ-ent SR scale factors.

Adopting to a non-default down-sampling kernel Deeplearning based SR methods are usually trained on LR im-ages obtained by down-sampling HR images using a spe-cific down-sampling kernel (usually the bicubic kernel).This constitutes a major drawback, as their performancesignificantly degrades when processing LR images corre-sponding to different kernels, as is the case with most real-istic images [18, 24]. This problem can be alleviated usingour CEM, that takes the down-sampling kernel as a param-eter upon its assembly at test time. Thus, it can be used foradopting a given network that was pre-trained using somedefault kernel, to the kernel corresponding to the LR image.Figure 7 demonstrates the advantage of this approach on areal LR image, using the kernel estimation method of [2].

3. Editable SR NetworkTo achieve our principal goal of creating an explorable

super resolution network, we develop a new framework,that comprises a GUI with a neural network backend (seeFig. 5). Our SR network differs from existing SR meth-ods in two core aspects. First, it is capable of producing a

2We default to the bicubic blur kernel when h is not given.

Page 5: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Low-res input ESRGAN ESRGAN+CEM

Figure 7: Accounting for a different blur kernel at testtime. The CEM can be used to adopt any SR network to ablur kernel other than the one it was trained on. We demon-strate this on a real LR image. Here, ESRGAN [26], whichwas trained with the bicubic kernel, produces blurry results.However, wrapping it with our CEM, with the kernel esti-mated by [2], results in a much sharper reconstruction. Cor-responding kernels are visualized on each image.

wide variety of plausible and inherently consistent HR re-constructions x, for any input LR image y. This is achievedthanks to the CEM, which omits the need to use any re-construction loss (e.g. L1 or VGG) for driving the outputsclose on average to the corresponding ground truth trainingimages. Such losses bias the outputs towards the average ofall possible explanations to the LR image, and are thus notoptimal for the purpose of exploration. Second, our networkincorporates a control input signal z, that allows traversingthis manifold of plausible images so as to achieve variousmanipulation effects on the output x. Using our GUI, theuser can either tune z manually to achieve simple effects(e.g. control the orientation and magnitude of gradients asin Fig. 8) or use any of our automatic manipulation toolsthat take as input e.g. user scribbles or some pattern to im-print (see Figs. 2-4,10,12).

We use the same network architecture as ESRGAN [26],but train it wrapped by the CEM, and by minimizing a lossfunction comprising four terms,

LAdv + λRangeLRange + λStructLStruct + λMapLMap. (10)

Here, LAdv is an adversarial loss, which encourages thenetwork outputs to follow the statistics of natural images.We specifically use a Wasserstein GAN loss with gradi-ent penalty [6], and avoid using the relativistic discrimina-tor [8] employed in ESRGAN, as it induces a sort of full-reference supervision. The second loss term, LRange, pe-nalizes for pixel values that exceed the valid range [0, 1],and thus helps prevent model divergence. We use LRange =1N ‖x−clip[0,1]{x}‖1, whereN is the number of pixels. The

Diversity Percept. Reconst.quality error

(σ) (NIQE) (RMSE)ESRGAN 0 3.5± 0.9 17.3± 7.2

ESRGAN with z 3.6± 1.7 3.7± 0.8 17.5± 6.9Ours 7.2± 3.4 3.7± 0.9 18.2± 7.4

Table 2: Quality and diversity of SR results. We reportdiversity (standard deviation, higher is better), perceptualquality (NIQE [19], lower is better) and RMSE (lower isbetter), for 4× SR on the BSD100 test set [15]. Values aremeasured over 50 different SR outputs per input image, pro-duced by injecting 50 random, spatially uniform z inputs.Note that our model, trained without any full-reference lossterms, shows a significant advantage in terms of diversity,while exhibiting similar perceptual quality. See Supplemen-tary for more details about this experiment.

last two loss terms, LStruct and LMap, are associated with thecontrol signal z. We next elaborate on the control mecha-nism and these two penalties.

3.1. Incorporating a Control Signal

As mentioned above, to enable editing the output imagex, we introduce a control signal z, which we feed to thenetwork in addition to the input image y. We define thecontrol signal as z ∈ Rw×h×c, where w × h are the dimen-sions of the output image x and c = 3, to allow intricateediting abilities (see below). To prevent the network fromignoring this additional signal, as reported for similar casesin [7, 16], we follow the practice in [20] and concatenate zto the input of each layer of the network, where layers withsmaller spatial dimensions are concatenated with a spatiallydownscaled version of z. At test time, we use this signal totraverse the space of plausible HR reconstructions. There-fore, at train time, we would like to encourage the networkto associate different z inputs to different HR explanations.To achieve this, we inject random z signals during training.

Incorporating this input signal into the original ESR-GAN method already affects outputs diversity. This canbe seen in Tab. 2, which compares the vanilla ESRGANmethod (1st row) with its variant, augmented with z as de-scribed above (2nd row), both trained for additional 6000generator steps using the original ESRGAN loss. However,we can obtain an even larger diversity. Specifically, recallthat as opposed to the original ESRGAN, in our loss we useno reconstruction (full-reference) penalty that resists diver-sity. The effect can be seen in the 3rd row of Tab. 2, whichcorresponds to our model trained for the same number ofsteps3 using only the LAdv and LRange loss terms. Note that

3Weights corresponding to z in the 2nd and 3rd rows’ models are ini-tialized to 0, while all other weights are initialized with the pre-trained

Page 6: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Low-res input Outputs produced by manually editing spatial derivatives

Edited Edited regionsregions

Pre-edited output

Strongvertical gradientsWeak gradients Strong gradients

Stronghoriz. gradients

Figure 8: Manipulating image gradients. A user can explore different perceptually plausible textures (e.g. the Mandrill’scheek fur), by manually adjusting the directions and magnitudes of image gradients using the control signal z.

all three models are on par in terms of perceptual quality.Now that the output of our network strongly depends on

z, we move on to make this dependence easy to manipulateby a user. To allow manual control, we want simple modi-fications to z to lead to variations that are intuitive to grasp.To allow optimization-based manipulations, we want to en-sure that any plausible output image could be generated bysome z. These two requirements are encouraged by the lasttwo loss terms in (10), as we describe next.

Controlling image gradients with the LStruct loss To al-low intuitive manual control, we encourage spatially uni-form perturbations to z to affect the spatial derivatives ofthe output image x. Figure 8 demonstrates how manipulat-ing the gradients’ magnitudes and directions, can be usedto edit textures, like fur. We choose to associate the 3channels of z with the 3 degrees of freedom of the struc-ture tensor of x, which is the 2 × 2 matrix defined bySx =

∫∫(∇x(ξ, η))(∇x(ξ, η))T dξdη. We encode this link

into our network through the loss term LStruct, which pe-nalizes for the difference between desired structure tensorsSd, determined by a randomly sampled, spatially uniform z,and the tensors corresponding to the actual network’s out-puts, Sx (see Supplementary for more details).

Facilitating optimization based editing via the LMap lossDenoting our network output by x = ψ(y, z), we wouldlike to guarantee that ψ can generate every plausible im-age x with some choice of z. To this end, we introducethe loss term LMap = minz ‖ψ(y, z) − x‖1, which penal-izes for differences between the real natural image x, andits best possible approximation using some signal z. Withineach training step, we first solve the internal minimizationofLMap over z for 10 iterations, and then freeze this z for theminimization of all loss terms in (10). Note that in contrastto LStruct, which only involves spatially uniform z inputs,the minimization over z in LMap exposes our network tothe entire z space during training, encouraging its mappingonto the entire natural image manifold. Figure 9 illustrateshow incorporating the LMap loss term improves our model’sability to generate plausible patterns.

ESRGAN model from the 1st row.

LR    Imprint

Figure 9: Effect of LMap. Incorporating the LMap loss terminto our model’s training introduces spatially non-uniformz inputs, and improves the model’s ability to cover the nat-ural image manifold. We demonstrate this by comparinga model trained with vs. without this term, in the task ofimprinting the digit ‘8’ on the license plate of Fig. 3.

3.2. Training details

We train our network using the 800 DIV2K training im-ages [1]. We use the standard bicubic downsampling kernel,both for producing LR training inputs and as h in our CEM(Eq. (8)), and feed our network with 64× 64 input patches,randomly cropped from these LR images. We initialize ournetwork weights with the pre-trained ESRGAN [26] model,except for weights corresponding to input signal z, whichwe initialize to zero. We minimize the loss in (10) withλRange = 5000, λStruct = 1 and λMap = 100. We set theWasserstein GAN gradient penalty weight to λgp = 10. Weestablish critic’s credibility before performing each genera-tor step, by verifying that the critic correctly distinguishesfake images from real ones for 10 consecutive batches. Weuse a batch size of 48 and train for ∼ 80K steps.

4. Editing ToolsWe incorporate our trained SR network described in

Sec. 3 as the backend engine of an editing GUI, that al-lows manipulating the output SR image x by properly shap-ing the control signal z. Our GUI comprises several toolsthat can be applied on selected regions of the image, whichwe achieve by manipulating only the corresponding regionsof z (recall that the dimensions of z and x are the same).

The most basic tool constitutes three knobs (top middlein Fig. 5) that manually control the triple channeled z sig-nal in a spatially uniform manner, so as to affect the im-age structure tensor Sx (Sec. 3.1). We make this tool userintuitive by binding the knobs with the eigenvalue decom-position of Sx, thus affecting image gradients. The user

Page 7: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Low-res input

Our pre-edited SR

Manipulating variance

 

  

 

Adjusting periodicity

 

Propagating patch distribution

Figure 10: Editing using tools that optimize over z. Here we use our variance manipulation, periodicity, and patch dictio-nary tools. Each optimizes over z space to achieve a different objective. See Sec. 4 for more details.

Imprinting eyes from Imprinting eyes from external imageexternal image

Brightness Brightness reductionreduction

Piecewise TV Piecewise TV minimizationminimization

ESRGANLow-res input Utilizing our editing tools Our edited result

Figure 11: Utilizing scribble editing tools. Our tools allow regaining small details and producing crisper and more realisticimages (compared to, e.g. ESRGAN [26]). Here we use the piecewise smoothness tool (TV minimization) to enhance edgesbetween teeth, and restore forehead wrinkles using our brightness reduction pen. We imprint the eye region from an onlineavailable image of (brown-eyed) Mark Ruffalo to restore the crisp appearance of the blue eyes of Emmanuel Macron.

can control the orientation θ and magnitude λ1 of promi-nent gradients, and the magnitude λ2 in the perpendiculardirection (see Fig. 8).

To allow a more diverse set of intricate editing opera-tions, we also propose tools that optimize specific editingobjectives (e.g. increasing local image variance). Thesetools invoke a gradient descent optimization process over z,whose goal is to minimize the chosen objective. This isanalogous to traversing the manifold of perceptually plau-sible valid SR images (captured by our network), while re-maining consistent with the LR image (thanks to the CEM).

Our GUI recomputes x after each edit in nearly real-time, ranging from a couple of milliseconds for the basic

tools to 2-3 seconds for editing a 100 × 100 region usingthe z optimization tools (with an NVIDIA GeForce 2080GPU). We next briefly survey and demonstrate these tools,and leave the detailed description of each objective to thesupplementary.

Variance manipulation This set of tools searches for a zthat decreases or increases the variance of pixel values. Thisis demonstrated in Fig. 10 for yielding a smoother table mapand more textured trousers. A variant of this tool allowslocally magnifying or attenuating exiting patterns.

User scribble Our GUI incorporates a large set of toolsfor imposing a graphical input by the user on the SR image.

Page 8: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

A user first scribbles on the current SR image and choosesthe desired color and line width. Our GUI then optimizesover z, searching for the image x that satisfies the imposedinput while lying on the learned manifold of valid percep-tually plausible SR images. Figure 2 (1st editing operation)demonstrates how we use line scribbles for creating the de-sired shirt pattern. Variants of this tool (demonstrated inFig. 11) enable increasing or decreasing brightness, as wellas enforcing smoothness across a specific scribble mark, byminimizing local TV.

Imprinting This tool enables a user to impose contenttaken from an external image, or from a different regionwithin the same image. After selecting the desired region,our GUI utilizes the CEM to combine the low frequencycontent corresponding to y with the high frequency contentof the region to be imprinted. This results in a consistent im-age region, but which may not necessarily be perceptuallyplausible. We then invoke an optimization over z, whichattempts to drive x close to this consistent region, while re-maining on the natural image manifold. Example usagesare shown in Figs. 1 (right-most image), 3, 11 and 4, wherethe tool is used for subtle shifting of image regions.

Desired dictionary of patches This tool manipulates theedited region to comprise patches stemming from a desiredpatch collection. A user first marks a source region contain-ing desired patches. This region may be taken either fromthe edited image or from an external image. Our GUI thenoptimizes z to encourage the target edited region to com-prise the same desired patches, where we use patches ofsize 6 × 6. Figures 2 (2nd editing operation) and 10 (right)show how we use this tool to propagate a desired local clothpattern (yellow region) to the entire garment. Variants ofthis tool allow matching patches while ignoring their meanvalue or variance, to allow disregarding color and patternmagnitude differences between source and target patches.

Encouraging periodicity This tool enhances the periodicnature of the image along one or two directions marked bythe user. The desired period length can be manually ad-justed, as exemplified in Fig. 10 for the purpose of produc-ing different stripe widths. Alternatively, a user can chooseto enhance the most prominent existing period length, auto-matically estimated by our GUI.

Random diverse alternatives This tool offers an easyway to explore the image manifold. When invoked, it op-timizes z to simultaneously produce N different SR imagealternatives, by maximizing the L1 distance between them,in pixel space. The user can then use each of the alternatives(or sub-regions thereof) as a baseline for further editing. Avariant of this tool constraints the different alternatives toremain close to the current SR image.

The wide applicability of our method and editing toolsis further demonstrated in Figs. 12 and 13 in two use cases.

Low-res inputs

Gray horse Young zebra

Spotless deer Spotted deerConsistent SR reconstructions

Figure 12: More exploration examples. Using our frame-work to explore possible SR images corresponding to dif-ferent semantic contents, all consistent with the LR input.

Low-res inputs ESRGAN Our edited results

Figure 13: Correcting SR outputs. Our method can beused to enhance SR results, relying on the user’s priorknowledge (e.g. the appearance of buildings and Alphabetcharacters). Please zoom-in to view subtle differences, andrefer to the Supplementary for the editing processes.

Figure 12 demonstrates our method’s exploration capabili-ties, illustrating how utterly different semantic content canbe generated for the same LR input. Figure 13 shows howour method’s editing capabilities can enhance SR outputs,by relying on the user’s external knowledge. Please seemore editing process examples and results in Supplemen-tary.

5. ConclusionA low resolution image may correspond to many dif-

ferent HR images. However, existing SR methods gener-ate only a single, arbitrarily chosen, reconstruction. In thiswork, we introduced a framework for explorable super reso-lution, which allows traversing the set of perceptually plau-sible HR images that are consistent with a given LR image.We illustrated how our approach is beneficial in many do-mains, including in medicine, surveillance, and graphics.

Acknowledgment This research was supported by the Is-rael Science Foundation (grant 852/17) and by the TechnionOllendorff Minerva Center.

Page 9: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Explorable Super Resolution - Supplementary Material

Yuval Bahat and Tomer MichaeliTechnion - Israel Institute of Technology, Haifa, Israel{yuval.bahat@campus,tomer.m@ee}.technion.ac.il

A. Boundary Effects and PaddingOur consistency enforcing module comprises several fil-

ter convolution operations (see Fig. 6), each resulting inreduced output size. We follow the common practice ofzero-padding filters’ outputs to match their input sizes,which causes the consistency guarantee to brake, as we getcloser to image boundaries, potentially resulting in unde-sired boundary effects.

To prevent these boundary effects from affecting thetraining of our editable SR network, we avoid utilizing outerimage regions when calculating our loss function (10), byfirst removing the 10×α outer pixels of our network’s out-put, where α is the SR scale factor. In inference mode, wereduce boundary effects by replicating the LR image bound-ary pixels 10 times, before feeding it to our network, thenremoving the 10×α outer pixels from the network’s output.This works better than the common zero-padding.

B. Diversity Comparison Experiment (Tab. 2)We performed an experiment to compare the effect of in-

corporating control input signal z and omitting reconstruc-tion loss terms on the diversity and quality of SR outputs.The results in Tab. 2 indicate that while incorporating z intothe original ESRGAN model already facilitates producingdiverse outputs, training using our scheme, that uses no re-construction loss (thanks to the CEM) yields a significantboost in outputs diversity. This is while all compared mod-els demonstrate similar perceptual quality. We next presentdetails about the experiment setup and results processing.

B.1. Experimental Setup

We study the case of 4× SR and initialize weights ofall three models in the table using the official pre-trainedESRGAN model [26], where weights corresponding to thez signal (in the 2nd and 3rd rows’ models) are initializedto zero, and train each model for additional 6000 generatorsteps on the DIV2K dataset [1]. Models in first two rows aretrained using the original ESRGAN loss function, that in-cludes full-reference reconstruction loss terms, namely L1

and VGG. Even the adversarial loss is computed relative toa reference ground truth image, as they employ a relativisticdiscriminator [8]. Our model in the 3rd row is trained usingonly the LAdv and LRange loss terms, without using any fullreference terms.

We evaluate all three models using the 100 BSD100dataset [15] images. For the 2nd and 3rd models we pro-duce 50 different versions of x, by running the model 50times for each input image y, each time with a different,randomly sampled, spatially uniform z.

B.2. Metrics Calculation

The first row’s model has 0 diversity, as it can only out-put one SR output per LR input. To evaluate diversity forthe other two models, we calculate the per-pixel standarddeviation across the different z inputs, and average over allimage pixels and color channels, yielding a per input im-age value σi, i = 1 . . . 100. The diversity and error marginvalues presented for each model in Tab. 2 correspond to theaverage and standard deviation of σi over all 100 datasetimages, respectively. However, recall from Sec. 2 that sinceour method’s outputs are consistent with their correspond-ing input image y, their diversity can only be manifested inthe component of x lying in the nullspace of H , PN (H)x.We therefore project x onto this nullspace before perform-ing the diversity calculations described above for both lattermodels, to yield a more relevant diversity score. To put thepresented diversity values in context, note that the averagespatial standard deviation of PN (H)x (the component of thereal HR image x lying in H’s nullspace) is 15.2.

We use the NIQE score [19] to evaluate perceptual qual-ity. We compute the score for each image, and for the lattertwo models average over all image versions. The percep-tual quality score and error margins for each model are thencalculated by calculating the mean and standard deviationover the entire BSD100 dataset, respectively. To put thepresented NIQE scores into context, note that the averagescore for the real HR images is 3.1.

Finally, we use the Root Mean Square Error (RMSE)measure to quantify reconstruction error. For each output

Page 10: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

image x, we measure the square differences between x andits corresponding ground truth image x, and then averageover all image pixels and over the three color channels. Tak-ing the square root yields a per image RMSE. We then fol-low the same procedure as with the former two metrics toyield the average score and error margins for each model.

C. Utilizing Loss Term LStruct

As we explain in Sec. 3.1, we encourage control signal zto affect the magnitude and direction of spatial derivativesof output image x, by associating z with the structure ten-sor of x, denoted Sx. The structure tensor is a 2× 2 matrixcomputed by Sx =

∫∫(∇x(ξ, η))(∇x(ξ, η))T dξdη, where

∇x(ξ, η) is the local gradient of image x in pixel (ξ, η).We encode this link into our network through the loss termLStruct, which penalizes for the difference between the struc-ture tensors corresponding to the network’s outputs, Sx,and desired structure tensors Sd, determined by a randomlysampled, spatially uniform z, as we explain next.

For each training image in each optimization batch, weperform the following steps:

1. Drawing random desired structure tensors Sd:

(a) We uniformly sample λ1,2 ∈ [0, 1] and θ ∈[0, 2π], corresponding to desired magnitudes anddirection of prominent image edges, respec-tively. These parameters induce a specific sin-gular value decomposition (SVD) of an imagestructure tensor, and therefore uniquely deter-mine it.

(b) We then compose the induced structure tensor as

Sd =

[λ1 cos

2 θ + λ2 sin2 θ λ1λ2 sin θ cos θ

λ1λ2 sin θ cos θ λ1 sin2 θ + λ2 cos

2 θ

].

2. Feeding and running our model:

(a) We set the input control signal z to be spatiallyconstant. Its three channels are set to the valueson and above the diagonal of Sd, normalized tolie in a symmetric range around zero.

(b) We feed z along with the image y into our model,and compute the resulting image x.

(c) We calculate the corresponding structure tensorof x, Sx.

3. The plausible magnitudes range of spatial derivativesvaries according to the local nature of the image, e.g.images containing prominent textures correspond tolarger magnitudes than those corresponding to smoothimages. It therefore makes sense to grant the user con-trol over relative magnitudes rather than over absoluteones, by adjusting the values of Sx and Sd as follow-ing:

(a) We adjust Sx by computing the sum over themagnitude of the product of the horizontal andvertical spatial image derivatives in the corre-sponding ground truth image x, and use thisvalue to normalize Sx, yielding Sx.

(b) We adjust each value Sd[i, j] in Sd to lie in[P5(Sx[i, j]), P95(Sx[i, j])], the 5th and 95th per-centiles of Sx[i, j], respectively, collected over10K images:

Sd[i, j] ,P95(Sx[i, j])− P5(Sx[i, j])

2Sd[i, j]

+P95(Sx[i, j]) + P5(Sx[i, j])

2.

4. Finally, we calculate the per-image loss as the differ-ence between Sx and Sd:

LStruct = |Sx[1, 1]− Sd[1, 1]|+ |Sx[1, 2]− Sd[1, 2]|+ |Sx[2, 2]− Sd[2, 2]|.

D. Editing ToolsOur GUI comprises many editing and exploration tools.

The most basic tool allows a user to adjust the local spa-tial image derivatives, as demonstrated in Fig. 8, by tuningthree knobs corresponding to the three channels of z. To al-low more intricate operations, we propose additional toolsthat optimize specific editing objectives (e.g. increasing lo-cal image variance). These tools invoke a gradient descentoptimization process over z, whose goal is to minimize thechosen objective. This is analogous to traversing the man-ifold of perceptually plausible valid SR images (capturedby our network), while always remaining consistent withthe LR image (thanks to the CEM). After briefly surveyingthese tools in Sec. 4, we next elaborate in detail on each oneof them. Tools operating on image patches (rather than di-rectly on the image itself) use partially overlapping 6 × 6patches. The degree of overlap varies, and indicated sepa-rately for each tool.

D.1. Variance Manipulation

This set of tools operates by manipulating the local vari-ance of all partially overlapping image patches in the se-lected region. We optimize over z to increase or decreasethe per-patch variance by a pre-determined given user value.

Signal magnitude manipulation A variant of this toolpreserves patches’ structures and only manipulates the mag-nitude of their signal, after removing their mean values.This variant operates on less patches, by sub-sampling over-lapping patches using a 4 rows stride.

Page 11: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

D.2. User Scribble

Our GUI constitutes a large set of Microsoft-Paint-liketools, allowing a user to impose a graphical input on the out-put image. These include pen and straight line tools (withadjustable line width), as well as polygon, square and cir-cle drawing tools. Scribbling color can be chosen manu-ally or sampled from any given image (including the editedone). After scribbling, the user initiates the optimizationprocess, traversing the z space in an attempt to find the im-age x that is closest to the desired user input, while lyingon the learned manifold of valid, perceptually plausible im-ages.

Brightness manipulation A user can use a variant of allscribble tools to increase or decrease the current local im-age brightness by a given, adjustable, factor. To this end,the user chooses the brightness increase/decrease “color”rather than a standard color, when drawing the graphical in-put. Once initiated, the optimization process of this varianttool attempts to satisfy the desired local relative brightnesschanges, rather than absolute graphical inputs.

Local TV minimization Similar to the brightness manip-ulation variant, this variant operates by separately minimiz-ing the total variations (TV) in each separate scribble input.Choosing the TV minimization “color”, the user can use anyscribble tool to mark desired regions. The optimization pro-cess would then operate on all pixels in each marked region,attempting to minimize the magnitude of the differences be-tween each pixel and its 8 neighboring pixels.

D.3. Imprinting

This tool enables a user to enforce graphical content onthe SR output. The user first selects the desired content, ei-ther from within the edited image or from an external one.The user then marks a bounding rectangle on the edited im-age, to which the desired content is imprinted. The imprint-ing itself uses the CEM module to keep the original lowfrequency from y, while manipulating only the high fre-quency content using the desired input. This already guar-antees the imprinted region is consistent with the LR imagey. The user can then modify the exact location and size ofimprinted content using 4 arrow buttons. Finally, optimiz-ing over the z space searches for the closest image x that isnot only consistent, but also lies on the learned manifold ofnatural images.

Subtle region shifting Instead of marking the desired im-printing location and size, a user can choose to imprint con-tent taken from the edited image onto itself. The user canthen utilize the 4 arrow buttons to move the selected content

from its original location and modify its size, thus inducingsubtle shifting of selected image regions.

D.4. Desired Dictionary of Patches

This tool manipulates (target) patches in a desired regionto resemble the patches comprising a desired (source) re-gion, either taken from an external image or from a differentlocation in the edited image. To allow encouraging desiredtextures across regions with different colors, we first removemean patch values from each patch, in both source and tar-get patches. To reduce computational load, we discard someof the overlapping patches, by using 2 and 4 rows strides inthe source and target regions, respectively. Once source andtarget regions are selected, optimizing over z traverses thelearned manifold of valid plausible images to minimize thedistance between each target patch and its closest sourcepatch.

Ignoring patches’ variance A variant of this tool allowsencouraging desired textures without changing current lo-cal variance. To this end, we normalize patches’ variance,in addition to removing their mean. Then while optimiz-ing over z, we add an additional objective to preserve theoriginal variance of each target patch, while encouraging its(normalized) signal to resemble that of its closest (normal-ized) source patch.

D.5. Encouraging Periodicity

This tool encourages the periodic nature of an image re-gion, across one or two directions determined by a user. Thedesired period length (in pixels) for each direction can bemanually set by the user, or it can be automatically set tothe most prominent period length, by calculating local im-age self-correlation. Periodicity is then encouraged by pe-nalizing for the difference between the image region and itsversion translated by a single period length, for each desireddirection.

D.6. Random Diverse Alternatives

As we describe in Sec. 4, this tool allows exploring theimage manifold, producingN different SR outputs by max-imizing the L1 distance between them in pixel space. Theseimages (or sub-regions thereof) can then serve as a baselinefor further editing and exploration.

Constraining distance to current image This variantadds the requirement that all N images should be close tothe current x, in terms of L1 distance in pixel space.

E. Additional Results and Editing ProcessesWe next present editing processes and some additional

results obtained using our framework. Figs. 14 and 15

Page 12: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

present editing processes and resulting SR outputs for edit-ing face images of Bruce Willis and Emmanuel Macron,respectively. Unlike in the editing process presented inFig. 11, the process in Fig. 15 makes use of an exter-nal image of Macron himself, yielding enhanced imprint-ing results. Figs. 16 and 17 present the exploration edit-ing processes yielding the different SR outputs presented inFig. 12, and Figs. 18 and 19 depict the editing processesleading to the enhanced SR images presented in Fig. 13.

References[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge

on single image super-resolution: Dataset and study. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops, pages 126–135, 2017. 6, 9

[2] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blindsuper-resolution kernel estimation using an internal-gan.arXiv preprint arXiv:1909.06581, 2019. 4, 5

[3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, IlyaSutskever, and Pieter Abbeel. Infogan: Interpretable repre-sentation learning by information maximizing generative ad-versarial nets. In Advances in neural information processingsystems, pages 2172–2180, 2016. 3

[4] Chao Dong, Chen Change Loy, Kaiming He, and XiaoouTang. Learning a deep convolutional network for imagesuper-resolution. In European conference on computer vi-sion, pages 184–199. Springer, 2014. 1

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014. 1

[6] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In Advances in neural information pro-cessing systems, pages 5767–5777, 2017. 5

[7] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. arxiv, 2016. 5

[8] Alexia Jolicoeur-Martineau. The relativistic discriminator:a key element missing from standard gan. arXiv preprintarXiv:1807.00734, 2018. 5, 9

[9] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4401–4410, 2019. 3

[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurateimage super-resolution using very deep convolutional net-works. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 1646–1654, 2016. 1

[11] Idan Kligvasser, Tamar Rott Shaham, and Tomer Michaeli.xunit: Learning a spatial activation function for efficient im-age restoration. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2433–2442, 2018. 1

[12] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Fast and accurate image super-resolution withdeep laplacian pyramid networks. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 2018. 1

[13] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero,Andrew Cunningham, Alejandro Acosta, Andrew Aitken,Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative ad-versarial network. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 4681–4690,2017. 1, 3

[14] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, andKyoung Mu Lee. Enhanced deep residual networks for singleimage super-resolution. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition workshops,pages 136–144, 2017. 1, 4

[15] David Martin, Charless Fowlkes, Doron Tal, Jitendra Malik,et al. A database of human segmented natural images and itsapplication to evaluating segmentation algorithms and mea-suring ecological statistics. In Proceedings of the IEEE In-ternational Conference on Computer Vision, 2001. 4, 5, 9

[16] Michael Mathieu, Camille Couprie, and Yann LeCun. Deepmulti-scale video prediction beyond mean square error.arXiv preprint arXiv:1511.05440, 2015. 5

[17] Tomer Michaeli and Yonina C. Eldar. Optimization tech-niques in modern sampling theory. Cambridge, UK: Cam-bridge Univ. Press, 2010. 3

[18] Tomer Michaeli and Michal Irani. Nonparametric blindsuper-resolution. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 945–952, 2013. 4

[19] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak-ing a completely blind image quality analyzer. IEEE SignalProcessing Letters, 20(3):209–212, 2012. 5, 9

[20] Pablo Navarrete Michelini, Dan Zhu, and Hanwen Liu.Multi–scale recursive and perception–distortion controllableimage super–resolution. In Proceedings of the EuropeanConference on Computer Vision Workshops, 2018. 5

[21] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, andJose M Alvarez. Invertible conditional gans for image edit-ing. arXiv preprint arXiv:1611.06355, 2016. 3

[22] Mehdi SM Sajjadi, Bernhard Scholkopf, and MichaelHirsch. Enhancenet: Single image super-resolution throughautomated texture synthesis. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 4491–4500, 2017. 3

[23] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Sin-gan: Learning a generative model from a single natural im-age. arXiv preprint arXiv:1905.01164, 2019. 1, 3

[24] Assaf Shocher, Nadav Cohen, and Michal Irani. zero-shotsuper-resolution using deep internal learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3118–3126, 2018. 1, 4

[25] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.Recovering realistic texture in image super-resolution bydeep spatial feature transform. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 606–615, 2018. 3

Page 13: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Brightness reduction

Piecewise TV minimization

Polygon & ellipse tools

ESRGANLow-res input Utilizing scribble tools Our edited result

Figure 14: Face editing using only the scribble tool. Editing a face image of Bruce Willis, utilizing only the scribble toolsand its variants, brightness reduction and local TV minimization tools.

Imprinting eyes

and teeth from

external image

Brightness

reduction

Piecewise TV

minimization

ESRGANLow-res input Utilizing our editing tools Our edited result

Local variance

reduction

Figure 15: An alternative to the editing process in Fig. 11. We use our imprinting tool to imprint the high resolutionversions of Macron’s eyes and teeth. Unlike the case in Fig. 11, having access to another image of the subject at hand canyield enhanced results. Other tools utilized in this example are piecewise TV minimization, local variance reduction andlocal brightness reduction.

Low-res input

Scribbling

spots

Spotless versionSpotted version Editing our output Editing our output

Locally decreasing patch variance

Figure 16: The making of Fig. 12 (top). Exploring possible SR solutions to the LR input, by performing two differentediting processes, yielding two different deer species.

[26] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-hanced super-resolution generative adversarial networks. InProceedings of the European Conference on Computer Vi-sion (ECCV), pages 0–0, 2018. 1, 2, 3, 5, 6, 7, 9, 14

[27] Yifan Wang, Federico Perazzi, Brian McWilliams, Alexan-der Sorkine-Hornung, Olga Sorkine-Hornung, and Christo-pher Schroers. A fully progressive approach to single-imagesuper-resolution. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages

864–873, 2018. 3[28] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj,

Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. Tex-turegan: Controlling deep image synthesis with texturepatches. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2018. 3

[29] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, andYun Fu. Residual dense network for image super-resolution.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018. 1

Page 14: arXiv:1912.01839v3 [cs.CV] 22 Jun 2020arXiv:1912.01839v3 [cs.CV] 22 Jun 2020 Figure 2: An example user editing process. Our GUI allows exploring the space of plausible SR reconstructions

Low-res input Gray horseYoung zebra Editing our output Minor editing

Imprinting stripes from external image

Locally decreasing patch variance

Figure 17: The making of Fig. 12 (bottom). Exploring possible SR solutions to the LR input, by performing two differentediting processes, yielding two different animals.

ESRGANLow-res input Editing operations Our edited result Ground truth high-res

Piecewise TV

minimizations

Periodicity

promoting

Figure 18: The making of Fig. 13 (top). Enhancing SR results by utilizing local TV minimization (green-cyan polygons),encouraging periodicity of building patterns and editing distribution of patches - propagating desired patches from source(yellow) to target (red) regions. Corresponding ground truth HR image and result by ESRGAN [26] are presented forcomparison.

Pre-edited imageLow-res image Edited image ESRGAN

External image

Figure 19: Imprinting across scales. Using our imprinting tool to produce the enhanced SR output presented in Fig. 13(bottom). Most of the imprinted patterns in this example are taken from the same image, exploiting the cross scale recurrenceof patterns in natural images, as manifested in this simple case of a Snellen eye-sight chart.

[30] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, andAlexei A Efros. Generative visual manipulation on the nat-ural image manifold. In European Conference on ComputerVision, pages 597–613. Springer, 2016. 3

[31] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-

rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-ward multimodal image-to-image translation. In Advancesin Neural Information Processing Systems, pages 465–476,2017. 3