imaginenet: restyling apps using neural style transfer · style transfer is the computer vision...

ImagineNet: Restyling Apps Using Neural Style Transfer

Michael H. Fischer, Richard R. Yang, Monica S. LamComputer Science Department

Stanford UniversityStanford, CA 94309

{mfischer, richard.yang, lam}@cs.stanford.edu

ABSTRACTThis paper presents ImagineNet, a tool that uses a novel neuralstyle transfer model to enable end-users and app developersto restyle GUIs using an image of their choice. Former neuralstyle transfer techniques are inadequate for this applicationbecause they produce GUIs that are illegible and hence non-functional. We propose a neural solution by adding a new lossterm to the original formulation, which minimizes the squarederror in the uncentered cross-covariance of features from dif-ferent levels in a CNN between the style and output images.ImagineNet retains the details of GUIs, while transferring thecolors and textures of the art. We presented GUIs restyledwith ImagineNet as well as other style transfer techniques to50 evaluators and all preferred those of ImagineNet. We showhow ImagineNet can be used to restyle (1) the graphical assetsof an app, (2) an app with user-supplied content, and (3) anapp with dynamically generated GUIs.

Author KeywordsStyle transfer; graphical user interfaces; automated design.

CCS Concepts•Human-centered computing → User interface design;•Computing methodologies→ Neural networks;

INTRODUCTIONWhile the majority of mobile apps have “one-size-fits-all”graphical user interfaces (GUIs), many users enjoy varietyin their GUIs [39, 28]. Variety can be introduced by theapp maker. For example, Candy Crush uses different stylesat different stages of their game, or Line messenger allowschanging the start screen, friends list, and chat screens. Vari-ety can also be controlled by the users. Application launchersfor Android allow users to choose from a wide collection ofthird-party themes that change common apps and icons tomatch the theme of choice. These stylizing approaches requirehardcoding specific elements for a programming framework;

they don’t generalize and are brittle if the developers changetheir application.

Our goal is to develop a system that enables content creators orend users to restyle an app interface with the artwork of theirchoosing, effectively bringing the skill of renowned artists totheir fingertips. We imagine a future where users will expectto see beautiful designs in every app they use, and to enjoyvariety in design like they would with fashion today.

Gatys et al. [6] proposed neural style transfer as a methodto turn images into art of a given style. Leveraging humancreativity and artistic ability, style transfer technique generatesresults that are richer, more beautiful and varied than the pre-dictable outcomes produced with traditional image transformssuch as blurring, sharpening, and color transforms. The in-sight behind the original style transfer algorithm is to leveragethe pre-trained representation of images in deep convolutionalneural networks such as VGG19 [36]. Using information fromthe activations of the input image and the style image in theneural network, the transfer algorithm can reconstruct an artis-tic image that shares the same content in the input but with thetexture of the style image.

While the current style transfer model may suffice in creatingremixed works of art, the boundaries of objects in the resultingimage are no longer well defined, and the and the messagein the original content can be distorted. For example, if it isapplied to a GUI, we may no longer recognize the input andoutput elements. Previous research to regain object definitionapplies imaging processing techniques to the content imagesuch as edge detection [15] or segmentation [24], and has seenlimited success. Figure ?? shows the result of applying thetechniques of Gatys [6], WCT [17], AdaIn [9], Lapstyle [15],Linear [16], and ours to a simple Pizza Hut app and a TikTokphoto. None of the previous techniques retains the boundariesof the buttons in the Pizza Hut app, let alone the labels onthe buttons, rendering the display unsuitable as a GUI. In thestreet scene, the woman’s face is severely distorted, and noneof the captions can be read.

In the original style transfer, the color/texture is captured bythe uncentered covariance of features in the same level in theVGG19 model [36]; to copy the style, the algorithm minimizesthe squared error of the covariance between the style and theoutput images. The covariance of features represents the jointdistribution of pairs of features, and it is not positional. Weconjecture that since each level is optimized independently, the

arX

iv:2

001.

0493

2v2

[cs

.CV

] 4

Mar

202

0

texture used for the same image position at different levels ischosen independently of each other, thus the lack of positionallinkage renders the output incoherent. This leads us to thinkthat the original style transfer does not fully model style.

We propose adding a new term to the modeling of style, theuncentered cross-covariance of features across layers, to linkup the style elements chosen for each position across levels.And we propose adding to style transfer the goal of minimiz-ing the squared error of this term between the style and outputimages. Note that prior work to improve object definitionfocused on copying more information from the content image.We show empirically that doing so gives objects well-definedboundaries and better color coherence. Without doing any-thing special for text, text and any other detailed objects likelogos become legible with this technique. We refer to thecross-covariance as structure. Figure ?? shows that our algo-rithm successfully copies the style, while retaining the details,as all the buttons in the Pizza Hut app are clearly visible andthe labels are legible. Similarly, the details of the woman’sface and the caption in the TikTok photo are preserved.

The contributions of this paper include:

1. ImagineNet: a new style transfer model that can copy thestyle of art onto images while retaining the details. We addto the traditional modeling of style a new structure compo-nent, which is computed as the uncentered cross-covarianceof features across layers in a VGG neural network. By mini-mizing the squared loss of the structure component betweenthe style image and the output image, the output has well de-fined object boundaries and coherent color schemes, unlikethe result of the original style transfer.

2. The improvements in ImagineNet make possible an impor-tant application for style transfer: to improve the aestheticsof applications by restyling them with artwork. ImagineNetis the first style transfer algorithm that can be applied toGUIs without rendering it unusable. We demonstrate threeways to use style transfer on GUIs (a) change the entirelook-and-feel of an app by restyling its graphical assets; (b)restyle user-supplied content in an app, (c) restyle an appwith dynamically generated content.

RELATED WORKStyle transfer is the computer vision task of extracting thestyle from a reference image and applying it to an input image.Classic style transfer techniques have approached this problemby altering the inputâAZs image statistics to match that of thereference [32] [34]. Another type of approach, called imageanalogies, is to identify the relationship between two referenceimages and applying the relationship on a separate input imageto synthesize a new image [8] [35].

With the rising popularity of deep learning, Gatys et al. showedthat the internal representations learned by a convolutionalneural network (CNN) for image classification tasks can beused to encode the content and style characteristics of an image[6]. This influential work significantly advanced the field ofstyle transfer in two ways. First, it showed that the texture andcolor style of an image can be encapsulated by computing

Figure 1: Diagram of the training process. At test time, onlythe image tranform network is used.

the uncentered covariance of an imageâAZs feature mapsextracted from a CNN. The feature map representations haveeven been used with the classic image analogy style transferapproach [20]. Second, it introduced an iterative optimizationalgorithm, called neural style transfer, to synthesize imagesthat contain the content of one reference image, and the styleof another.

Both fronts have then been expanded upon to further improveneural style transfer type of algorithms. On the style repre-sentation front, Li and Wand showed that Markov RandomFields can be used as an alternative to the covariance matrixto capture style [14]. Furthermore, Li et al. showed that gen-erating an image to match the covariance matrix of the styleimageâAZs feature maps is actually equivalent to matchingthe statistical distribution of the style imageâAZs feature mapsas computed by the maximum mean discrepancy function [19].This work lead to an alternate view of neural style transfer asa statistical feature transfer problem.

On the algorithmic front, several works expanded upon theoriginal optimization-based neural style transfer. A drawbackof the optimization approach is its speed, which cannot runin real-time. Johnson et al. trained an image transformationneural network that can perform style transfer with a singleforward pass of the input image through the network, allowingreal-time stylization but at the cost that the network can onlyhave one fixed style [10]. Subsequent works have createdreal-time arbitrary style transfer neural networks by usingthe statistical feature transfer formulation [9] [4] [17] [16].While these networks are optimized for speed and without thelimitation of a fixed style, improving the visual quality of thegenerated images is still an active area of research.

There have been numerous methods to improve on each ofthe aforementioned neural style transfer algorithms. Spatialinformation and semantic masks have been used to constraintransfer of style in corresponding regions [3] [7]. For pho-torealistic applications of style transfer, several methods uselocal image statistics and scene information to prevent distor-tions in the generated image [24] [26] [40] [18]. Other worksattempt to preserve lines and edges in the generated imageguided by Laplacian operators [15] and depth maps of thereference and generated images [23]. ImagineNet is the onlyproposal that augments the original neural formulation with anew loss term, with the goal to improve the definition of theoutput image. Furthermore, unlike other techniques, ours copymore information from the style image rather than the contentimage.

THE STYLE TRANSFER MODELOur style transfer algorithm is based on the feed-forward im-age transformation network [10], as shown in Figure 1. Notethat our system consists of two neural networks, the imagetransformation network that takes an input image and appliesa particular style to it, and a fixed loss network used to extractfeatures between the input image and generated image for aloss function we describe later. The fixed loss network we useis VGG19 [37] pre-trained on the ImageNet dataset [33].

The image transformation network, T is a convolutionalencoder-decoder network with residual connections [31, 29].For a specific style reference image s and an input contentreference image x, the network produces an output image y ina single forward-pass. We train each transformation networkwith one pre-selected reference style image, we use imagesfrom the entire COCO dataset[22] as reference content im-ages. The network minimizes the following loss function withrespect to the input reference images s,x and the generatedimage y:

L =λ1LCONTENT(x,y)+λ2LTEXTURE(s,y)+λ3LSTRUCTURE(s,y)+λ4LTV(y)

The formula above is identical to the traditional formulation,except for the introduction of LSTRUCTURE, the structure lossterm, which we will discuss last. LCONTENT copies the contentfrom the input image, LTEXTURE copies the color and texture ofthe style image, and LTV is the total variation regularization[25] to the loss function to ensure smoothness in the generatedimage [6, 10].

Content ComponentThe feature maps in layer ` of VGG19 of image x are denotedby F`(x)∈RCl×Nl , where Cl is the number of channels in layer` and N` is the height times the width of the feature map inlayer `. F`

i, j(x) is the activation of the i-th channel at positionj in layer ` for input image x, To ensure that the generatedimage y retains the content of the input image x, the contentloss function LCONTENT(x,y) is used to minimize the squareerror between the features maps extracted from the high-levellayers in the VGG19 loss network for x and y. That is,

LCONTENT(x,y) =C`

∑i=1

N`

∑j=1

(Fì, j(x)−F`

i, j(y))2

Texture ComponentGatys et al. propose that the colors and textures of an im-age can be captured by the uncentered covariance of featuresin each layer: G`(x) ∈ RC`×C` , where C` is the number ofchannels in layer `:

Gì, j(x) =

N`

∑k=1

Fì,k(x)F

`j,k(x)

Gì, j is the joint probability of feature i colocating with feature

j in layer `. Gatys et al. show layers with similar textureand colors have similar covariance. To copy the style, Gatys

proposed to minimize the squared error between the covarianceof the style and output images in the low levels:

LTEXTURE(s,y) = ∑`∈layers

β`

C`

∑i=1

C`

∑j=1

(Gì, j(s)−G`

i, j(y))2

As the number of channels increases in higher levels, so dothe dimensions of the covariance. The dimensions of thecovariance matrix are 64× 64 for layers 1 and 2, 128× 128for layers 3 and 4, and 512×512 for layers 4, 5, 6, and 7.

We see from Figure 1 that the style image has well-definedshapes and boundaries, yet the original style transfers generateimages that lack definition. The object boundaries are nolonger sharp and the colors and textures are no longer coherentwithin an object.

Texture and color are captured as covariance of features in thesame level, and this information is not positional. It is possibleto generate many output images that have equivalent differ-ences in covariance with the style image. The choices made ineach layer to minimize style loss are independent of each other.Content loss provides the only cross-layer constraints in theoriginal formulation. The content representation at the highlayer of the convolutional neural network, however, is coarse,since the convolution filters at the upper layer cover a largespatial area. The image transformer has the freedom to choosetextures as long as the objects are still recognizable at coarsegranularity. Boundaries are lost in this representation becausethe same object can be recognized whether the edge is straightor blurry or jagged, or if the line is drawn with different colors.

Structure ComponentWe introduce the concept of uncentered cross-covariance offeatures across layers, referred to as structure, to model thecorrelation of texture across layers. Our objective is to furtherconstrain the output image so it has similar cross-covarianceas the style image.

Lower-level layers have less channels but more data pointsthan higher-level layers. To compute the cross-covariance,we give the feature maps of different levels the same shapeby upsampling the lower-level feature maps along the spatialdimension, and upsampling the higher-level feature maps inthe channel dimension, by duplicating values. The upsampledfeature map of layer `1 to match layer `2 for image x is denotedby F`1,`2(x) ∈ Rmax(C`1 ,C`2 )×max(N`1 ,N`2 ).

The uncentered cross-covariance of feature maps betweenlayers `1 and `2, `1 < `2, is denoted by G`1,`2(x) ∈ RC`2×C`2 ,where Cl2 is the number of channels in layer `2. The cross-covariance computes the joint probability of a pair of featuresbelonging to different layers.

G`1,`2i, j (x) =

N`1

∑k=1

F`1,`2i,k (x)F`2,`1

j,k (x)

To copy the covariance to the output image, we introduce astructure loss function to minimize the squared error between

the cross-covariance of the style and output images:

LSTRUCTURE(s,y)= ∑`1,`2∈layers

γ`1,`2

C`2

∑i=1

C`2

∑j=1

(G`1,`2i, j (s)−G`1,`2

i, j (y))2

where `1 < `2 and γ`1,`2 is the loss weight factor. The structureloss term constrains the choice of textures across layers in theoutput image to match the correlation between layers in thestyle image, thus copying the style more faithfully.

Figure 2: An ablation example on structure loss: a Tiktokpicture restyled with no structure, half structure, and full struc-ture. The structure and legibility improve as the structure lossterm increases.

As an example, Figure 2 shows the effect of adding the cross-covariance loss term to style transfer by applying La Muse toa TikTok photo. Without cross-covariance, the result is similarto that from Johnson[10]. It is hard to pick out the womenin the photo. With cross-covariance, the people and even thenet overhead are visible easily; it has interesting colors andtextures; the words in the image are legible. When we half theweight of the cross-covariance error in the loss function, wesee an intermediate result where the structure starts to emerge,hence we coined this component “structure”.

Ensuring Target Image ClarityTo understand how edges are learned and represented in aCNN, we visualize the neuron activation maps of VGG19to identify the layers of neurons salient to text in an image.Neuron activation maps are plots of the activations duringthe forward propagation of an image. We sample randomimages from ImageNet classes and overlay objects on it. Wethen forward pass this combination image into VGG19 toobtain the activations, and identify the layers with the highestactivations with respect to outlines of objects.

VGG has 5 convolution groups, each of which consists of 2 to4 convolution-and-relu pairs, followed by a max pooling layer.A relu layer is numbered with two indices, the first refers tothe group, and the second the number within the group; so,relu3_1 refers to the first relu in group 3.

From our visualizations and analysis, we observe that neuronsin the “relu1_1” and “relu3_1” layers of VGG carry the mostsalient representations of image outlines. Earlier layers inimage classification neural networks tend to be edge and blob

detectors, while later layers are texture and part detectors [13].These results suggest that sufficient weight must be placed inthese layers so that the outline information is carried to theoutput image.

Figure 3: Layers “relu1_1” and “relu3_1” in the VGG19network are highly responsive to text within images.

Loss Function Layer WeightLCONTENT “relu4_1” 5.6LTEXTURE “relu1_1” 1.1

“relu2_1” 1.3“relu3_1” .5“relu4_1” 1.0

LSTRUCTURE “relu1_1” × “relu1_2” 1.5“relu1_1” × “relu2_1” 1.5“relu2_1” × “relu2_2” 1.5“relu2_1” × “relu3_1” 1.5

LTV 150

Table 1: Weights for the loss functions.Functionality and Aesthetics TradeoffWe tuned the hyperparameters of the loss functions through abinary search. By evaluating the text legibility, edge distinc-tions, and distortions in the output obtained from 20 diversepairs of content and style images, we derived the parametersshown in Table 1. As shown in the table, the content loss termis calculated on the feature maps in group 4.

When too much weight is placed on the lower VGG convolu-tion layers in the loss function, only color and smaller elementsof styles are transferred over, and the results look flat. Due tothis, we increase loss weights in convolutional groups 3 and4. However, we observe that lines begin to blur in the results.Finally, we increase the weight of the structural loss, and arriveat a model that transfers global style and maintains text clarity.To determine the total variational loss weight, we iterativelydecrease the weight until deconvolutional artifacts–such as thecheckerboarding effect–begin to appear.

HOW TO RESTYLE APPSIn this section, we show how style transfer can be appliedto three kinds of mobile apps: apps that use a static set of

Table 2: Example of how ImagineNet can be used to style assets in the Memory Game app to create a fully functional app.

graphical assets, apps that accept user graphical contributions,and finally apps that generate interfaces dynamically.

Graphical Asset RestylingThe look-and-feel of many applications is defined by a set ofgraphical assets stored in separate files. For such apps, we cangenerate a new visual theme for the app by simply replacingthe set of assets with those restyled according to a given style.

Apps for mobile devices typically separate out their graph-ics into a separate folder called “assets”. The assets foldercontains the sets of similar images for different resolutions,themes, or geographic localities. The text, such as the currenttime or the number of points a user scored, is dynamicallygenerated and overlaid on top of the images. Thus, we cancreate a fully functional app, with a different style, by simplyreplacing the files in the “asset” folder with restyled images.Since assets are dynamically loaded as the app runs, this pro-cess does not require any functionality changes in the appsource code.

To restyle an app, we first apply ImagineNet to the assetsfor a chosen style image. Next, we clean up the border ina post-processing step. Unless the graphics in the asset isrectangular, it carries a transparency mask to indicate theborder of the asset. We copy over the same transparency

mask to the restyled image. This is necessary because theconvolution-based neural network have padding and strideparameters that generate additional pixels outside the originalpixels.

To demonstrate this process, we use an open-source Androidapp called Memory Game [11]. We unpack the apk, extract theassets, apply style transfer to all assets, and then resign the app.Various screen shots of this style-transferred app are shownin Table 2. Here we show the results of two automaticallygenerated, fully functional versions of Memory Game withdifferent look-and-feel. ImagineNet reproduces the stylesof the artwork and yet retains the full details of the images,making the game perfectly playable.

User-Supplied ContentConsumers have shown a lot of interest in apps that let thempersonalize their content; examples include applying filtersin Instagram to enhance their photos and using artistic styletransfers in Prisma and DeepArt. As demonstrated in Figure 2,our style transfer can faithfully transfer the style of the artwhile preserving the message of the original message. Herewe illustrate how we can take advantage of this property toprovide personalization, using launchers as an example.

Figure 4: Style transfer applied to the iOS home screen. Weshow, from left to right, a given style image, a user-suppliedphoto, a re-styled photo given the style, the original home-screen, and the new theme with re-styled icons and the re-styled photo as a background. Best viewed digitally in fullresolution.

A launcher is a mobile app that changes the app icons andthe background to a different theme. Launchers are popularamong mobile phone users; over a million different themesare available, many for purchase. Style transfer offers an auto-matic technique to generate new themes based on masterpieceartwork. As an automatic procedure, it can also support users’provided content. In particular, mobile users often want to usepersonal photos as the background of their phones, we restyletheir background in the same artistic style as the icon themeto create a stylishly consistent home screen experience.

The algorithm consists of two parts. The first is to staticallyrestyle the icons, using the same algorithm as described above.The second is to restyle a user-supplied image for the back-ground. The time and cost in restyling an image are negligibleas discussed in Section 6.4. The installer can then accept theseassets and draw the homescreen accordingly.

We show two examples of this process in Figure 4. We showthe results of applying a given style to the background, thenthe result of icons on the background. The results show thatthe re-styled backgrounds retain the message in the originalimages. Moreover, the restyle icons match the backgrounds.We reduce opacity on the background pictures to make theicons and names of the apps more visible.

Dynamically Restyling an AppSome apps generate graphical displays dynamically. For ex-ample, a map app needs to present different views accordingto the user input on the location as well as the zoom factor.Such dynamic content needs to be dynamically re-styled asthey are generated. Here we use Brassau, a graphical virtualassistant [5], as an extreme case study, since it generates itsGUI dynamically and automatically.

Brassau accepts user natural language input, drawn from anextensible library of virtual assistant commands, and automat-ically generates a widget. For example, the command such as“adjust the volume of my speakers” would generate a widgetthat lets users control the volume with a slider. Brassau hasa database of GUI templates that it matches to commandsusers issue to a virtual assistant. However, the templates in thedatabase come from a variety of designers and the styles acrosstemplates are inconsistent. As a result, the widgets generated,as shown in Figure 5, are inconsistent in style. A developer ofBrassau, not an author of this paper, uses ImagineNet to giveits dynamically widgets a uniform style according to its user’spreference.

To support interactivity, the developer separates Brassau’s GUIinto three layers. The bottom layer contains the original dis-play, including the buttons and background templates. Themiddle layer, hiding the bottom layer, is the stylized image,obtained by rasterizing the bottom layer with RasterizeHTMLand stylizing it via the ImagineNet server. The top layer hasuser inputs and outputs, providing real-time GUI interaction.Button clicking is supported by the bottom layer, as the toptwo layers are made transparent to clicking. It took the de-veloper 8 hours to add ImagineNet to Brassau, by specifyingthe “absolute positioning” and “z-index” CSS properties ofthe user interface elements.

In a pilot study with 8 participants, 4 male and 4 female,6 preferred the stylized application over the original, 1 wasindifferent, and 1 liked it less. Users who liked the styled appsaid that they liked the consistency in style across apps, withone commenting “now it looks like it was all designed by oneperson instead of a group”. They all cited legibility as thehighest priority in their choice of styles.

COMPARISON OF STYLE TRANSFER TECHNIQUESIn this section, we discuss how ImagineNet compares withprevious style transfer algorithms on style transfer for photo-realistic images and for GUIs.

Photorealistic Style TransferFrom our literature review, we did not find any other worksattempting style transfer on GUIs. The closest prior work isstyle transfer on photo-realistic images, where the generatedimages are expected to look like a natural, realistic image. Thiswork is the most relevant to ours because of the expectation tomaintain lines and edges without distortions in the final image.

Table 3 shows the results of the original style transfer algo-rithm [6], semantic-based style transfer [6], Laplacian-basedstyle transfer [15], and ours on three pairs of photos and style

Table 3: Photorealistic style transfer using, from left to right, the original style transfer algorithm, semantic-based style transfer,Laplacian-based style transfer, and ours.

Figure 5: ImagineNet applied to Brassau to generates wid-gets from natural language commands (column 1) and restylesthem, in this case using Lichtenstein’s Bicentennial Print (col-umn 2).

images. We chose these works for comparison since the au-thors specifically target the photo-realistic domain.

The first is a typical example used in photo-realistic styletransfer where the style of Van Gogh’s Starry Night is copiedto a photo of the Golden Gate bridge. ImagineNet producesthe only output image that shows the bridge cables clearly,demonstrating how a purely neural algorithm is more robustthan using the Laplacian for edge detection. Additionally,ImagineNet is not prone to segmentation failures, which isobserved in the semantic-based result. However, ImagineNetcopies mainly the color scheme, as Van Gogh’s signature styleis in conflict with the priority of preserving details.

In the second example, we choose a real-world image withreflections as a challenging content image and pick for stylethe Lichtenstein’s Bicentennial Print which has distinct lines,as a contrast to Van Gogh’s Starry night. A good style transferwill copy the clean lines from the style to this photo. It can beobserved that the original and Laplacian-based algorithms failto contain style within the boundaries of the countertop. Whilethe semantic-based model is able to maintain the boundaries, itonly transfers a small amount of style to a corner of the image,which is due to a failed segmentation of the content image.Ours, on the other, has clean straight lines for the kitchen tableand counters, and also copies the style to the table top well.

In the third example, we use source images from Li et al. [15]for comparison. ImagineNet is the only technique that does notdisfigure the girl’s face or hair, while faithfully changing herhair to match that of the style image. This example illustratesthat our algorithm adapts to the structure of images and doesnot produce overly-styled images when the inputs are simple.

Comparison with Style Transfers on GUIsWe applied former style transfer techniques to all the inputpairs shown in Fig. 4, and none succeeded in preserving thedetails of a GUI. Due to space constraint, we can only showtwo representative results in Figure ??. When we apply “La

Figure 6: Applying different style transfers to the MemoryGame; only ImagineNet can make the words legible.

Style Style DescriptionLa Muse [30]Pablo Picasso 1935

Cubism, Surrealism. Bold, colorful, whimsi-cal.

La danse, Bacchante [27]Jean Metzinger, 1906

Divisionism. Distinct mosaic pattern, pas-tel colors with blue borders around the mainfigure.

Les régates[12]Charles Lapicque, 1898

Fauvism, Tachisme. Colorful with small con-tiguous areas with white borders

Spiral Composition [2]Alexander Calder, 1970

Abstract. A strong yellow area, comple-mented with blue, red, and white concentriccircles with black radials.

Bicentennial Print [21]Roy Lichtenstein, 1975

Abstract Expressionism. Strong geometricshapes with yellow highlights.

A Lover [38]Toyen, unknown

Expressionism. Simple design and nearlymono-chromatic.

Table 4: Description of styles used in Table 5.

danse, Bacchante” to the Pizza Hut app, we see that Imag-ineNet is the only technique that makes the logo “Pizza Hut”recognizable; it transfers the mosaic texture properly to thebackground; the buttons are well defined; and the letters onthe buttons are legible. When we apply “La Muse” to a Tiktokpicture, the woman’s face is distorted by all the other styletransfers; none of the writing can be read; none of the peoplein the background are recognizable. ImagineNet carries allthe details, while adding texture whenever possible withoutruining the message, as shown by the lines on the ground andthe man’s pants.

We also applied the former techniques to the MemoryGame tounderstand if they can be used for coloring assets. We use theBicentennial Print as the input style as its clean style makes iteasiest to transfer. Fig. 6 shows that none of the games styledby other techniques are playable. The best result is obtainedby Linear [16], but the words are not legible.

Figure 7: Examples of illegible GUIs generated by Imag-ineNet. The style images, shown below the GUIs, lack struc-tural elements in the style to make letters legible.

EXPERIMENTATION WITH IMAGINENET

Results of ImagineNet on AppsHere we show how the results of applying ImagineNet to awide range of apps. For these, as we do not have access to thesources, we simply show the effect of applying ImagineNetto the images captured on the mobile devices. In Table 5, theselect apps are shown on the left column; the styles are shownon the top. ImagineNet shows detailed information clearly,while still showing the style of the original artwork. Imag-ineNet handles the high information density across differentkinds of GUI elements: text, icons, country boundaries, andgaming assets. Note how the music keys of the last GUI are allrestyled the same way, providing coherence across the wholeapp.

Quality of Results StudyWe conduct a user study to see the quality of ImagineNet re-sults. We present GUIs generated by style transfer algorithmsfrom Gatys et al., Johnson et al., Luan et al., Li et al., and onesgenerated by ImagineNet to 31 female and 19 male Mechani-cal Turk evaluators in the U.S. All evaluators selected the GUIsgenerated by ImagineNet as their preferred GUI. To assessif ImagineNet maintains text legibility, we present evaluatorswith 6 styled GUIs and ask them to transcribe all text in theGUI. All evaluators retyped the text with 100% accuracy. Inaddition, to show how dutifully the style is maintained in thenew image, we show evaluators 6 style images and 6 stylizedGUIs, and have them match the two sets. Evaluators matchthe style to the transferred GUI with 79% accuracy.

Style Image Selection StudyTo understand if artwork can be used on apps in general, weconduct a user study on the results of applying ImagineNet ona random set of artwork and app images. We train ImagineNetfor 96 randomly selected images, under different style classifi-cations in the Paint by Numbers dataset [1]. We apply thesestyles to a randomly sampled images to create 4800 styled im-ages. We show to each of the 48 Mechanical Turk evaluators180 randomly sampled styled GUIs, and asked them to rate thedesign on a Likert scale from 1 (bad) to 5 (good). The averagerating of the 96 styles ranges from 1.6 to 4.5, with only 14rated below 3; the remaining 82 paintings are distributed uni-formly between 3 and 4.5. The standard deviation of ratingsreceived for each style ranges from 0.5 to 1.4, with an averageof 1.1, suggesting that there is a general agreement whether astyle is good and it depends less on the content to which it isapplied.

Table 5: Results of transferring the style of the paintings from Table 4 to GUIs. The details can best be viewed on a computer.

Figure 8: While training, a user can cancel training within oneminute if the generated result does not look promising. After5 minutes, 1 hour, and 10 hours the style becomes clear.

We found that styles classified as Abstract Expressionism,High Renaissance, Modernismo, and Pop Art transfer welldue to clear edges and vibrant colors generally present inthese paintings. On the other hand, Action painting, Lettrism,Kinetic Art, and Purism, did not transfer well.

Not all styles are suitable for restyling apps. Since ImagineNettransfers the structure component in style, if the style imagehas no clear object borders anywhere, it will not be able toprovide a crisp output. Styles with faded colors, with less thanthree colors, or containing little texture were found to not workwell. Figure 7 shows representative styles that do not transferwell to GUIs. The style transfers well but the text is illegiblebecause the style image lacks the structural elements neededfor lettering. This result suggests that our structure componentshould indeed be included as a component of style.

Speed of ImagineNetAll the speeds reported are measured on a computer with6 core 3.50 GHz processor with 48 GB of RAM and thefollowing GPUs: Nvidia Titan V 12GB, Nvidia Titan X 12GBand Nvidia GTX 980Ti 6GB. The code is written in Python3.5 and TensorFlow 1.5 using CUDA V9.0 and cuDNN V6.0.

For the best qualitative results, we train using batch size of 1,for 10 hours, but a preliminary approximation can be obtainedin 1 minute, shown in Figure 8. Once the image transformeris trained for a style, the image transformation time only de-pends on the size of the input image. We test the run-time fordifferent sizes of color GUIs. On average over five runs, a512x288 sized image takes 3.02 ms while a 1920x1080 sizedimage takes 104.02 ms. On average, the transformation took0.026 µs/pixel.

LIMITATIONSBy prioritizing legibility, ImagineNet cannot generate imagesas artistic as those produced by prior techniques. It is possible

to adjust the weight of the structure loss term to create adifferent balance between aesthetics and functionality.

Many algorithms have been proposed recently to speed upstyle transfer [17, 16]. Since the structure loss term is com-puted across layers, ImagineNet is not immediately amenableto such optimization techniques. The algorithm as is cannotbe run in real-time on mobile phones. More work is requiredto optimize ImagineNet in the future.

CONCLUSIONWe propose ImagineNet, a neural network that transfers thestyle of an image to a content image, while retaining the de-tails of the content. While other attempts to retain details incontent add processing steps to the content, our model copiesmore information from the style. We show that the original for-mulation of style is missing a structure component, computedas the uncentered cross-covariance between features acrossdifferent layers of a CNN. By minimizing the squared error inthe structure between the style and output images, ImagineNetretains structure while generating results with texture in thebackground, shadow and contrast in the borders, consistencyof design across edges, and an overall cohesiveness to thedesign.

We show how ImagineNet can be used to re-style a varietyof apps, from those using a static set of assets, those incor-porating user content, and to those generating the graphicaluser interface on-the-fly. By style-transferring well-chosenartwork to mobile apps, we can get beautiful results that lookvery different from what are commercially available today.

ACKNOWLEDGMENTSThis work is supported in part by the National Science Foun-dation under Grant No. 1900638 and the Stanford MobiSocialLaboratory, sponsored by AVG, Google, HTC, Hitachi, INGDirect, Nokia, Samsung, Sony Ericsson, and UST Global.

REFERENCES[1] Painter by Numbers Dataset.

https://www.kaggle.com/c/painter-by-numbers. (????).

[2] Alexander Calder. 1970. Spiral Composition.https://www.wikiart.org/en/alexander-calder/

spiral-composition-1970. (1970).

[3] Alex J Champandard. 2016. Semantic style transfer andturning two-bit doodles into fine artworks. arXivpreprint arXiv:1603.01768 (2016).

[4] Tian Qi Chen and Mark Schmidt. 2016. Fastpatch-based style transfer of arbitrary style. arXivpreprint arXiv:1612.04337 (2016).

[5] Michael Fischer, Giovanni Campagna, Silei Xu, andMonica S. Lam. 2018. Brassau: AutomaticallyGenerating Graphical User Interfaces for VirtualAssistants. In Proceedings of the 20th InternationalConference on Human-Computer Interaction withMobile Devices and Services (MobileHCI 2018). DOI:http://dx.doi.org/10.1145/3229434.3229481

https://www.kaggle.com/c/painter-by-numbers

https://www.wikiart.org/en/alexander-calder/spiral-composition-1970

https://www.wikiart.org/en/alexander-calder/spiral-composition-1970

http://dx.doi.org/10.1145/3229434.3229481

[6] Leon A. Gatys, Alexander S. Ecker, and MatthiasBethge. 2015. A Neural Algorithm of Artistic Style.(2015).

[7] Leon A Gatys, Alexander S Ecker, Matthias Bethge,Aaron Hertzmann, and Eli Shechtman. 2017.Controlling perceptual factors in neural style transfer. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition. 3985–3993.

[8] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, BrianCurless, and David H Salesin. 2001. Image analogies. InProceedings of the 28th annual conference on Computergraphics and interactive techniques. ACM, 327–340.

[9] Xun Huang and Serge Belongie. 2017. Arbitrary styletransfer in real-time with adaptive instancenormalization. In Proceedings of the IEEE InternationalConference on Computer Vision. 1501–1510.

[10] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016.Perceptual Losses for Real-Time Style Transfer andSuper-Resolution. (2016).

[11] Roman Kushnarenko. 2019. Memory Game.https://github.com/sromku/memory-game. (2019).

[12] Charles Lapicque. 1898. Les régates. (1898). https://www.wikiart.org/en/charles-lapicque/les-r-gates.

[13] Honglak Lee, Roger Grosse, Rajesh Ranganath, andAndrew Y. Ng. 2009. Convolutional Deep BeliefNetworks for Scalable Unsupervised Learning ofHierarchical Representations. In Proceedings of the 26thAnnual International Conference on Machine Learning(ICML ’09). ACM, New York, NY, USA, 609–616. DOI:http://dx.doi.org/10.1145/1553374.1553453

[14] Chuan Li and Michael Wand. 2016. Combining markovrandom fields and convolutional neural networks forimage synthesis. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition.2479–2486.

[15] Shaohua Li, Xinxing Xu, Liqiang Nie, and Tat-SengChua. 2017b. Laplacian-steered neural style transfer. InProceedings of the 25th ACM international conferenceon Multimedia. ACM, 1716–1724.

[16] Xueting Li, Sifei Liu, Jan Kautz, and Ming-Hsuan Yang.2018a. Learning linear transformations for fast arbitrarystyle transfer. arXiv preprint arXiv:1808.04537 (2018).

[17] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, XinLu, and Ming-Hsuan Yang. 2017. Universal styletransfer via feature transforms. In Advances in neuralinformation processing systems. 386–396.

[18] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang,and Jan Kautz. 2018b. A closed-form solution tophotorealistic image stylization. In Proceedings of theEuropean Conference on Computer Vision (ECCV).453–468.

[19] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou.2017a. Demystifying neural style transfer. arXivpreprint arXiv:1701.01036 (2017).

[20] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing BingKang. 2017. Visual attribute transfer through deepimage analogy. arXiv preprint arXiv:1705.01088 (2017).

[21] Roy Lichtenstein. 1975. Bicentennial Print. (1975).https://artsandculture.google.com/asset/

bicentennial-print/TAHrv5B7P6vh1A.

[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár, andC Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In European conference on computervision. Springer, 740–755.

[23] Xiao-Chang Liu, Ming-Ming Cheng, Yu-Kun Lai, andPaul L Rosin. 2017. Depth-aware neural style transfer.In Proceedings of the Symposium on Non-PhotorealisticAnimation and Rendering. ACM, 4.

[24] Fujun Luan, Sylvain Paris, Eli Shechtman, and KavitaBala. 2017. Deep photo style transfer. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition. 4990–4998.

[25] Aravindh Mahendran and Andrea Vedaldi. 2015.Understanding deep image representations by invertingthem. (2015).

[26] Roey Mechrez, Eli Shechtman, and Lihi Zelnik-Manor.2017. Photorealistic style transfer with screened poissonequation. arXiv preprint arXiv:1709.09828 (2017).

[27] Jean Metzinger. 1906. Bacchante. (1906). https://www.wikiart.org/en/jean-metzinger/bacchante-1906.

[28] Michael I Norton, Daniel Mochon, and Dan Ariely.2012. The IKEA effect: When labor leads to love.Journal of consumer psychology 22, 3 (2012), 453–460.

[29] Augustus Odena, Vincent Dumoulin, and Chris Olah.2016. Deconvolution and Checkerboard Artifacts. Distill(2016). DOI:http://dx.doi.org/10.23915/distill.00003

[30] Pablo Picasso. 1935. A Muse. (1935).https://www.wikiart.org/en/pablo-picasso/a-muse-1935.

[31] Alec Radford, Luke Metz, and Soumith Chintala. 2015.Unsupervised representation learning with deepconvolutional generative adversarial networks. arXivpreprint arXiv:1511.06434 (2015).

[32] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, andPeter Shirley. 2001. Color transfer between images.IEEE Computer graphics and applications 21, 5 (2001),34–41.

[33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, and others.2015. Imagenet large scale visual recognition challenge.International Journal of Computer Vision 115, 3 (2015),211–252.

https://github.com/sromku/memory-game

https://www.wikiart.org/en/charles-lapicque/les-r-gates

https://www.wikiart.org/en/charles-lapicque/les-r-gates

http://dx.doi.org/10.1145/1553374.1553453

https://artsandculture.google.com/asset/bicentennial-print/TAHrv5B7P6vh1A

https://artsandculture.google.com/asset/bicentennial-print/TAHrv5B7P6vh1A

https://www.wikiart.org/en/jean-metzinger/bacchante-1906

https://www.wikiart.org/en/jean-metzinger/bacchante-1906

http://dx.doi.org/10.23915/distill.00003

https://www.wikiart.org/en/pablo-picasso/a-muse-1935

[34] YiChang Shih, Sylvain Paris, Connelly Barnes,William T Freeman, and Frédo Durand. 2014. Styletransfer for headshot portraits. ACM Transactions onGraphics (TOG) 33, 4 (2014), 148.

[35] Yichang Shih, Sylvain Paris, Frédo Durand, andWilliam T Freeman. 2013. Data-driven hallucination ofdifferent times of day from a single outdoor photo. ACMTransactions on Graphics (TOG) 32, 6 (2013), 200.

[36] Karen Simonyan and Andrew Zisserman. 2014a. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 (2014).

[37] Karen Simonyan and Andrew Zisserman. 2014b. VeryDeep Convolutional Networks for Large-Scale Image

Recognition. CoRR abs/1409.1556 (2014).http://arxiv.org/abs/1409.1556

[38] Toyen. Unknown. A Lover. (Unknown).https://www.wikiart.org/en/toyen/a-lover.

[39] Daniel Weld, Corin Anderson, Pedro Domingos, OrenEtzioni, Krzysztof Z Gajos, Tessa Lau, and SteveWolfman. 2003. Automatically personalizing userinterfaces. In IJCAI’03 Proceedings of the 18thinternational joint conference on Artificial intelligence.ACM.

[40] Richard R Yang. 2019. Multi-Stage Optimization forPhotorealistic Neural Style Transfer. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition Workshops. 0–0.

http://arxiv.org/abs/1409.1556

https://www.wikiart.org/en/toyen/a-lover

imaginenet: restyling apps using neural style transfer · style transfer is the computer vision...

Documents