salient deconvolutional networks - university of...

Salient deconvolutional networks

Aravindh Mahendran and Andrea Vedaldi

University of Oxford

Abstract. Deconvolutional networks are a popular architecture. However, whenused to reverse an existing model for visualization purposes, the process involvesa few heuristic choices that cannot be immediately interpreted. In this paper, weintroduce a family of reversed networks which include deconvolution, backprop-agation and network saliency as special but important cases. We use this con-struction to thoroughly investigate the different architectures and compare themin terms of quality of the produced images, of what information is reflected bythose images, and of what architectural choices are important in determiningthese properties. We also show an application to foreground object segmentationobtaining near state-of-the-art results despite using weakly-supervised learning.

Keywords: DeConvNets, Deep Convolutional Neural Networks, Saliency, Seg-mentation

1 Introduction

Despite the success of modern Convolutional Neural Networks (CNNs), there is a lim-ited understanding of how these complex black-box models achieve their performance.Methods such as deconvolutional networks (DeConvNets) have been proposedto visu-alize image patterns that strongly activate any given neuron in a CNN [24] and thereforeshed some light on the CNN structure. However, the DeConvNet construction is par-tially heuristic and so are the corresponding visualizations. Simonyan et al. [15] notedsimilarities with their network saliency method which partially explains DeConvNets,but this interpretation remains incomplete.

This paper carries a novel and systematic analysis of DeConvNets and closely re-lated visualization methods such as network saliency. Our first contribution is to extendDeConvNet to a general method for architecture reversal and visualization. In this con-struction, the reversed layers use selected information extracted by the forward network,which we call bottleneck information (Sect. 2). We show that backpropagation is a spe-cial case of this construction which yields a reversed architecture, SaliNet, equivalent tothe network saliency method of Simonyan et al. (Sect. 2.1). We also show that the onlydifference between DeConvNet and SaliNet is a seemingly innocuous change in thereversal of Rectified Linear Units (ReLU; Sect. 2.2). However, this change has a verysignificant effect on the results: the SaliNet response is well localized but lacks struc-ture, whereas the DeConvNet response accurately reproduces the image boundaries andobject shapes, but is less localized (Fig. 1). We also show that the two methods can becombined in order to simultaneously obtain structure and localization (DeSaliNet). De-SaliNet is also similar to results recently obtained by [16].

We then move to the important question of whether deconvolutional architectures

2 A. Mahendran and A. Vedaldi

DeS

aliN

etSa

liNet

DeC

onvN

etIm

age

Fig. 1: From top row to bottom: Original image, DeConvNet, SaliNet and our DeSaliNetvisualizations from the fc8 layer in AlexNet (just before the softmax operation). Themaximally active neuron is visualized in each case. DeSaliNet results in crispier vi-sualizations. They suppress the background while preserving edge information. Bestviewed on screen.

are useful for visualizing neurons. Our answer is partially negative, as we find that theoutput of reversed architectures is mainly determined by the bottleneck informationrather than by which neuron is selected for visualization (Sect. 3.3). In the case ofSaliNet and DeSaliNet, we confirm that the output is selective of any recognizableforeground object in the image, but the class of the selected object cannot be specifiedby manipulating class-specific neurons.

Having established the dominance of bottleneck information, we draw an analogybetween that and phase information in the Fourier transform (Sect. 3.4) and show theimportance of polarity information in reversed architectures.

Finally, we quantitatively test the ability of SaliNet and DeSaliNet to identify genericforeground objects in images (Sect. 3.5). Combined with GrabCut, we achieve nearstate-of-the-art segmentation results on the ImageNet segmentation task of [3], whileusing off-the-shelf CNNs pretrained from a largely disjoint subset of ImageNet andwith only image-level supervision.

Related work. DeConvNets were originally proposed as a method for unsupervisedfeature learning [25,26] and later applied to visualization [24]. There are several CNNvisualizations alternative to DeCovnNets. Some recent ones such as [23] build on theidea of natural (regularized) pre-images introduced in [12], which in turn are basedon prior contributions that applied pre-images to representations such as HOG [20],SIFT [21], BoVW [2,6], as well as early neural networks [22,9,8,11,5,19]. A relatedline of work [1] is to learn a second neural network to act as the inverse of the original

Salient deconvolutional networks 3

...

...

...

Fig. 2: The top row shows a typical CNN obtained by repeating short chains of convolution (˚),max pooling (MP) and ReLU (RU) operators. The middle row shows a generic “deconvolution”architecture, in which information flows backward to the image using the convolution transpose˚

BP operator. Different variants are obtained by: (i) choosing a different input (U -architecture,feature selector, or noise), (ii) choosing a variant of backward ReLU RU: (1, ReLU, ReLU back-propagation, or hybrid), and (iii) choosing a variant of backward max pooling MP: (unpool tocentre or MP backpropagation). This schema generalizes DeConvNets [24].

one. Several authors characterize properties of CNNs and other models by generatingimages that confuse them [18,17,13]. Our DeSaliNet architecture is also similar to thework of [16].

Recently, DeConvNets have also been proposed as a tool for semantic image seg-mentation; for example, [14,4] interpolate and refine the output of a fully-convolutionalnetwork [10] using a deconvolutional architecture. In this paper, inspired by [15], we ap-ply reversed architectures for foreground object segmentation, although as a by-productof visualization and in a weakly-supervised transfer-learning setting rather than as aspecialized segmentation method.

2 A family of deconvolutional architectures

Given an image x P X , a deep CNN extracts a feature vector or representation

φ : x ÞÑ φL ˝ ¨ ¨ ¨ ˝ φ1pxq (1)

using a sequence of L linear and non-linear layers φi (Fig. 2.top). Typical layers includeconvolution, ReLU, max pooling, and local contrast normalization.

The goal is to associate to φ a corresponding architecture φ: that reverses in somesense the computations and produces an image as output. While such reversed archi-tectures have several uses, here we focus on the problem of visualizing deep networks:by looking at the images produced by φ:, we hope to gain some insights about theforward network φ. This method was popularized by the work of Zeiler and Fergusin [24], where a particular construction called DeConvNet was shown to produce sur-prisingly crisp renderings of neural activations. However, given the heuristic nature ofsome choices in DeConvNets, it is difficult to precisely characterize the meaning ofthese results.

In order to explore this point, we consider here a generalization of the DeConvNet


construction. To this end, each layer φi is associated with a corresponding layer φ:i thatreverses input x and output y (Fig. 2.middle row). We also allow the reverse layer to beinfluenced by auxiliary information r computed by the forward layer. For instance, inDeConvNet, the reverse max pooling layer requires the “setting of the pooling switches”computed in the forward pass. Thus a layer φi and its reverse φ:i are maps:

forward φi : x ÞÑ py, rq, reversed φ:i : py, rq ÞÑ x. (2)

The ˆ symbol emphasizes that, in the backward direction, the tensors x and y have thesame shape as x and y in the forward pass, but different values.

Since the auxiliary information r is a function of the input r “ πpxq, one canalways let r “ x without loss of generality; however, the interesting case is when theauxiliary information is limited and r is an information bottleneck. For example, thepooling switches r in DeConvNet contain much less information than the input data x.In Fig. 2 these bottlenecks are denoted by dotted arrows.

The question then is how can we build reverse layers φ:? Next, we show that back-propagation provides a general construction for reverse layers, which only in some casescorresponds to the choice in DeConvNet.

2.1 SaliNet: network saliency as a deconvolutional architecture

The network saliency method of Simonyan et al. [15] characterizes which pixels of animage are most responsible for the output of a CNN. Given an image x0 and a networkφpx0q, saliency is obtained as the derivative of the (projected) CNN output Spφ,x0,pqwith respect to the image:

Spφ,x0,pq “B

Bxxp, φpxqy

ˇ

ˇ

ˇ

x“x0

. (3)

Since the CNN output φpxq is in general a vector or tensor, the latter is transformed intoa scalar by linear projection onto a constant tensor p before the derivative is computed.In practice, p is usually a one-hot tensor that selects an individual neuron in the output.In this case, the value of a pixel in the saliency map Spφ,x0,pq answers the question:“how much would the neuron response xp, φpx0qy change by slightly perturbing thevalue of that pixel in the image x0?”

Saliency is computed from (1) and (3) using the chain rule:

vecSpφ,x0,pq “ vecpJ ˆB vecφLB vecxJL

ˆ ¨ ¨ ¨ ˆB vecφ1B vecxJ0

. (4)

Here the vec operator stacks tensors into vectors and allows us to use a simple matrixnotation for the derivatives.

The Back Propagation (BP) algorithm is the same as computing the products (4)from left to right; this reduces to a chain of derivatives in the form of (3), one for eachlayer, where p is replaced with the derivative y obtained from the layer above. In thismanner, BP provides a general way to define a reverse of any layer φi:

φi : x ÞÑ y BP-reversed becomes φBPi : px, yq ÞÑ

B

Bxxy, φipxqy. (5)

We denote the BP-reversed of a layer with the symbol φBPi . Note that any CNN toolbox


can compute the BP-reversed of a layer because it contains code for back-propagation.Note also that the BP-reversed layer is a linear map in the argument y, even if theforward layer is not linear in x.

The BP-reversed layer φBPi takes as input both x and y, whereas from our dis-

cussion above we would like to replace x with a bottleneck r. Formally, using thedefinition (2), we rewrite the BP-reversed layer φBP

i px, yq as:

φ:i py, rq, where r “ πpxq

and π : x ÞÑ r projects the data x onto the smallest possible bottleneck. While thisrewrite does not change the result of the BP-reversed layer, it characterizes how muchauxiliary information it requires. In practice, while the smallest bottleneck π is easy todefine in an abstract sense,1 it is much more instructive to derive it for concrete layertypes, which we do later.

Note that we have obtained the following interesting result: we can compute back-propagation, and therefore the saliency map Spφ,x0,pq of [15], by using a “decon-volutional” architecture of the type of Fig. 2, where layers are reversed using the BPequation (5). We call the resulting deconvolutional architecture SaliNet. Next, we dis-cuss the construction in detail for common layer types.

Affine layers. A fully connected layer φfc simply multiplies the data x by a matrixA and adds a bias b. Given that the data x P RHˆWˆC is a 3D tensor of height Hand width W and C feature channels, we use the vec operator to write this in termsof matrices2 as φfc : vecy “ A vecx ` b. Linear convolution φ˚ can convenientlybe defined in the same way, by replacing matrix A with a matrix ρpF q constructedby “sliding” a bank of filters F , giving φ˚ : vecy “ ρpF q vecx ` b. Using (5), theBP-reversed layers are obtained by transposing the respective matrices:

φBPfc : vec x “ AJ vec y, φBP

˚ : vec x “ ρpF qJ vec y. (6)

The layer φBP˚ is often called deconvolution and gives the name to DeConvNets.

Note that the computation of these layers does not require any information from theforward pass, so the bottleneck r is empty. This is due to linearity and explains why inFig. 2 there are no dashed arrows connecting convolutional layers.

Rectified linear Unit (ReLU or RU). ReLU and its BP-reversed layer are given by

φRUpxq “ maxtx, 0u, φBPRUpx, yq “ φ:RUpy, rq “ y d r, r “ rx ą 0s, (7)

where max is computed element-wise, d is the element-wise product, and rx ą 0s is amask (binary tensor) with a 1 for every positive element of x and 0 otherwise. Hencethe bottleneck information for ReLU is the mask. Note that φBP

RUpx, yq is not the reversal

1 Let x1„ x2 be equivalent whenever functions φBP

i px1, ¨q “ φBPi px2, ¨q are the same. It is

easy to check that this defines an equivalence relation. Then the smallest possible bottleneckπ : x ÞÑ r P X „ projects x into its equivalence class.

2 This is slightly more general than usual as it specifies a different bias for each output dimensioninstead for each output feature channel.


used by DeConvNets [24,15] and this choice changes the output significantly.

Max Pooling (MP). Let xuc be the element of tensor x at spatial location u P Ω andfeature channel c. MP is obtained by computing the maximum of xvc over a smallspatial neighbourhood v P Npuq Ă Ω corresponding to u:

rφMPpxqsuc “ maxvPNpuq

xvc “ xspu|c,xq,c where spu|c,xq “ argmaxvPNpuq

xvc. (8)

Here v “ spu|x, cq tells which element xvc of the input is associated by max to eachelement yuc of the output and is informally called a setting of the pooling switches. Ashort derivation from (5) shows that the BP-reversed is given by

rφBPMPpx, yqsvc “ rφ

:

MPpy, rqsvc “ÿ

uPs´1pv|c,xq

yuc, r “ sp¨|¨,xq. (9)

Hence the bottleneck information for MP is the setting of the pooling switches. Notethat, when max pooling downsamples the data, the output of this reverse layer looks likea “bed of nails” as any element which is not a local maxima in x is set to zero.

2.2 Deconvolutional architectures

BP-reversal is only one way of defining reversed layers in a deconvolutional architec-ture. Here, we consider three variations. The first variation is whether the reversedmax-pooling layers are the BP-reversed ones φBP

MP, with pooling switches as bottle-neck, or whether they simply unpool to the center of each neighborhood, with emptybottleneck. The second variation is whether the reversed ReLU units are the BP-reversed ones φBP

RU, with the ReLU mask as bottleneck, or whether they are simplyreplaced with the identity function 1, with empty bottleneck. The third variation iswhether the reversed ReLU units are, as in DeConvNets, also composed with a secondReLU. We will see that, while this choice seems arbitrary, it has a very strong impacton the results as it preserves the polarity of neural activations. Overall, we obtain eightcombinations, summarized in Fig. 3, including three notable architectures: DeConvNet,SaliNet, and the hybrid DeSaliNet. Note that only SaliNet has an immediate interpreta-tion, which is computing the derivative of the forward network.

Affine layers, max pooling, and ReLU cover all the layer types needed to reverse ar-chitectures such as VGG-VD, GoogLeNet, Inception and ResNet.3 AlexNet includes lo-cal response normalization (LRN) layers, which in DeConvNet are reversed as the iden-tity. As discussed in the supplementary material, this has little effect on the results.

3 Experiments

Experiments thoroughly investigate the family of deconvolutional architectures identi-fied in Sect. 2. Sect. 3.1 tests eight possible network architectures and identifies De-ConvNet, SaliNet, and DeSaliNet as interesting cases for further exploration. Sect. 3.2compares the architectures in terms of clarity of the generated images. Sect. 3.3 inves-tigates whether visualizations provide useful information about neurons, and Sect. 3.4looks at the effect of the bottleneck information. Finally, Sect. 3.5 evaluates these tech-

3 In all cases we deal with the network only till the layer before the softmax


MP

BP

DeSaliNet DeConvNet SaliNet

cent

red

nails

RU ˝ RUBP RU RUBP 1

Fig. 3: Visualizations of VGG-16 using the trilobite image of Fig. 1 using eight decon-volutional architectures. The architectures are used to visualize the maximally-firingneuron in the pool5 3 layer and the full output image is shown (localization is mostlydue to the finite support of the neuron). From top to bottom we change the MP: reverseand from left to right the RU: reverse. Here all methods use the identity as the reverseLRN:.

niques on a practical application: segmentation of foreground objects.Several experiments are shown here for a few representative images, but many more

examples are provided in the supplementary material.4

3.1 Overview of deconvolutional architectures

The first experiment compares the eight deconvolutional architectures of Fig. 3. This isdone by “reversing” the computations obtained when a network φ is evaluated on animage x0 (in the example, the “trilobite” image of Fig. 1). Here the forward networky “ φpx0q is AlexNet [7] truncated at the last max-pooling layer (pool5). The input pto the reversed network φ:ppq is the one-hot tensor selecting the pool5 neuron yuc thatis maximally excited by x0.

We can make the following observations. First, as in [15], SaliNet computes asaliency map that looks like a fuzzy cloud. Likewise, matching the results of [24], theresult of DeConvNet has structure, in the sense that object edges are recovered.

Second, we compare the left four deconvolutional architectures to the right ones,which differ by the use of the ReLU units in the backward direction. We note thatadding these units is necessary in order to recover the image edges. In particular, bymodifying SaliNet in this manner, DeSaliNet produces an image with structure.

Third, using pooling switches (top row) slightly improves the clarity of the results

4 To improve the clarity of visualization images x “ φ:ppq in print, their real valued ranges are

remapped using the expression σpxp´ logp99qaqq where a is the 0.5% quantile in vec I .


compared to unpooling to center (bottom row). Even so, we note that the image struc-ture can still be clearly recognized in the bottom-left image, using unpooling to centercomplemented by the hybrid RU ˝ RUBP as reverse ReLU. In fact, this image is ar-guably crispier than the DeConvNet result. This suggests that, perhaps unexpectedly,the ReLU polarity (captured by RU in the backward direction) is more important thatthe MP switches. It also shows that the ReLU masks (captured by RUBP) significantlyimprove the sharpness of the results.

So far the LRN layers in AlexNet have been reversed using the identity, as in De-ConvNet; however, the original saliency method by [15] uses the BP-reversed LRNBP.In the supplementary material we show that this has a minor impact on the result,with slightly sharper results for the DeConvNet solution. Therefore, in the rest of themanuscript, DeConvNet and DeSaliNet use identity, while SaliNet, in keeping with theoriginal saliency method by [15], uses LRNBP.

3.2 Generated image quality

A first striking property of DeSaliNet is the clarity of the resulting visualizations com-pared to the other architectures (e.g. Fig. 1, 3, 4 6). While sharper visualizations thanSaliNet are expected given the results in [15], the gap with DeConvNet is somewhatunexpected and particularly strong for deep layers (e.g. Fig 1) and deeper architec-tures (e.g. Fig. 6). DeConvNet results appear to be less sharp than the ones shownin [24], which is likely due to the fact that they used a custom and refined version ofAlexNet, whereas we visualize off-the-shelf versions of AlexNet and VGG-VD. Unfor-tunately, we could not obtain a copy of their custom AlexNet to verify this hypothesis.

3.3 Meaning and selectivity of the deconvolutional response

Visualizations obtained using reversed architectures such as DeConvNets are meant tocharacterize the selectivity of neurons by finding which visual patterns cause a neuronto fire strongly. However, we will see here that this interpretation is fragile.

Consider the i-th neuron rφpxqsi “ xei, φpxqy in the output layer of a (truncated)CNN architecture, where ei is an indicator vector. In order to characterize this neu-ron, Zeiler and Fergus [24] search a large collection of images to find an image x˚

that causes φipx˚q to respond strongly. Thus, even before the deconvolutional networkφ:pei, rq is applied, the image x˚ is already representative of the neuron. The applica-tion of φ: then refines this information by highlighting which regions in x˚ are mostresponsible for this activation.

While this sounds simple, there is a subtle complication. Note in fact that the decon-volutional architecture φ:pei, rq is a function both of the neuron indicator ei as well asthe bottleneck information r extracted from the forward pass of x˚ through φpxq. In thedeconvolution process, ei is a direct specification of the neuron to be visualized. Theother parameter, r, can also be considered a specification of the same neuron, althougha fairly indirect one, because it is extracted from an image x˚ that happens to excitethe neuron strongly. Then the question is whether the deconvolutional response can be


pool5 fc8 pool5 fc8 pool5 fc8R

nd.n

oise

Rnd

.neu

ron

Max

neur

on

— DeConvNet — — DeSaliNet — — SaliNet —

Fig. 4: Lack of neuron selectivity. The bottleneck information r is fixed to the one com-puted during the forward pass φpxq through AlexNet and the output of φ:pe, rq is com-puted by choosing e as: the most active neuron (top row), a second neuron at random(middle), or as a positive random mixture of all neurons (bottom row). Results barelydiffer, particularly for the deeper layers. See figure 1 for the original house input imagex. Best viewed on screen.

interpreted as a direct characterization of a neuron or not. This is answered next.

Neuron/channel selectivity (or lack thereof). If the output of φ:pei, rq is a direct char-acterization of the i-th neuron, we would expect the generated image to meaningfullychange as the input ei to the deconvolutional network changes.

In Fig. 4, DeConvNet, DeSaliNet, and SaliNet are used to visualize the responses ofdifferent neurons at the center of the image. The reversed function φ:pe, rq is evaluatedby keeping r fixed (as obtained from the forward pass φpx0q) and by replacing e witheither: the indicator vector e˚ of the neuron that has the maximal response, a secondrandom neuron e1 that still generates a non-zero image, and a random non-negativevector e. It can be noted that, particularly in deeper layers, the response changes verylittle with different choices of e, and what changes there are do not have an obviousmeaning.

A clear difference between images from different depths (e.g. pool5 vs fc8 in Fig. 4and 6) is the extent of the response, which however corresponds to the neuron supportand depends on the architecture and not on the learned network weights or data. Thisis further confirmed in Fig. 5 by considering a network with random weights. There,it is also shown that renormalizing the image intensities reveals the full neuron sup-port, which is only partially suppressed in the visualization, and in a manner which isarchitecture-dependent rather than weight or data dependent.

We conclude that the reversed architectures φ:pe, rq are mainly dependent on thebottleneck information r rather than the neuron selector e. Hence, they provide poordirect characterizations of neurons, particularly of deep ones.


Fig. 5: Effect of finite neuron support. From left to right: Visualization from VGG-16pool5 3 using DeSaliNet; the same result after x0.1 renormalization; visualization with-out any bottleneck information as in Fig. 3-bottom right; the same visualization withoutbottleneck information but with randomized filter weights for the ˚BP operators. There-normalization reveals that the true receptive field of pool5 is much larger and that thesides are not discarded but simply made weak in the deconvolution process.

Foreground object selectivity. SaliNet is an equivalent implementation of the networksaliency technique of Simonyan et al. [15]. Thus, based on their results, we would ex-pect the response of SaliNet to be selective for the object recognized in the image.In Simonyan et al., such a response was obtained by first identifying the maximally-responding neuron e˚ in the last fully-connected layer of a model such as AlexNet(which performs classification), and then taking the derivative of this output with re-spect to the input image. This is equivalent to evaluating the SaliNet response φ:pe˚, rq.However, in the previous section we have shown the apparently contradictory result thatthis response depends weakly on e˚, which specifies which class has been recognized.

To clarify this point, in Fig. 1 and Fig. 6 we observe that SaliNet and DeSaliNet areindeed selective for the foreground object in the image; however, the information comesmainly from the bottleneck r and not from the neuron selector e˚. In other words, theoutput of φ:pe, rq of SaliNet and DeSaliNet emphasizes whatever foreground object isdetected by the network in the forward pass, regardless of which neuron is specified asinput to the reversed architecture. The main difference between SaliNet and DeSaliNet,as observed before, is that the latter produces a much more localized and crisp response.

Compared to SaliNet and DeSaliNet, DeConvNet fails to produce a clearly selectivesignal from these very deep neurons, generating a rather uniform response. We concludethat saliency, in the sense of foreground object selectivity, requires not only the maxpooling switches (available in all three architectures), but also the ReLU masks (usedonly by SaliNet and DeSaliNet).

3.4 Dominance of “phase” information

If a message emerges from the previous sections, it is that the response of all reversedarchitectures φ:pe, rq is dominated by the bottleneck information r. As seen in Sect. 2,the bottleneck information comprises 1) the setting of the pooling switches in the MaxPool units and 2) the setting of the masks in the ReLU units.

Interestingly, this information does not code for the intensity of the neural acti-vations, but rather for their spatial location and polarity. We argue that this informa-tion is somewhat similar to phase information in the Fourier transform and relatedrepresentations. To explain this analogy, consider the Fourier transform Xpωx, ωyq “Frxspωx, ωyq P C of image x; a well known result is that if one replaces the modulus


———- VGG-VD Pool5 3 ———- ———- VGG-VD FC8 ———-Image DeConvNet SaliNet DeSaliNet DeConvNet SaliNet DeSaliNet

Fig. 6: Foreground object selectivity. This figure compares the response of DeConvNet,SaliNet, and DeSaliNet by visualizing the most active neuron in Pool5 3 and FC8of VGG-VD. SaliNet and DeSaliNet tend to emphasize more foreground objects (seee.g. the faces of people), whereas DeConvNet’s response is nearly uniform. Note thatthe apparent spatial selectivity of Pool5 3 is due to the finite support of the neuron andis content independent. Best viewed on screen.

of the Fourier transform with a random signal but preserves the phase, then the recon-structed image x “ F´1r|Y pωx, ωyq|e

i=Xpωx,ωyqs still contains the structure (edges)of x and very little of y is recognizable. In fact the resulting image, an example of whichis shown in fig. 7, is not dissimilar from the output of DeConvNet and DeSaliNet.

In the Fourier transform, changing the phase of a spectral component Aejpωx`θq by∆θ amounts to shifting it by ´∆θω. Furthermore, negating the signal is equivalent toa phase shift of π. In the deconvolutional architectures, the max pooling switches recordthe location of filter activations, whereas the ReLUs applied in the backward directioncontribute to reconstructing the polarity. More precisely, in the forward pass the ReLUblock computes y “ maxt0, xu. In the backward direction, the signal y is propagatedtowards the input as follows:

x “ maxty, 0u (in DeConvNet), x “ maxty, 0u d rx ą 0s (in DeSaliNet). (10)

We see that both constructions guarantee that the polarity of the backward signal x is thesame as the polarity of the forward signal y, which is non negative. In fact, DeConvNet


Fig. 7: Analogy with phase information in Fourier Transform. From left to right: Theoriginal image, the inverse Fourier transform of the Fourier transform of the originalimage, the inverse Fourier transform but after randomizing the amplitude of the spec-trum, DeSaliNet φ:pp, rq with positive random input (p) and DeConvNet with positiverandom input (p). Best viewed on screen.

Table 1: Mean Intersection over Union (IoU) and Mean Per-Pixel (PP) accuracy fordifferent segmentation methods on the dataset of [3].

Method CNN PP IoU CNN PP IoU Method PP IoUSaliNet AlexNet 82.82 57.07 VGG-16 82.45 56.33 Baseline 78.97 46.27

DeSaliNet AlexNet 82.31 55.57 VGG-16 83.29 56.25 Guillaumin et al. [3] 84.4 57.3DeConvNet AlexNet 75.85 48.26 VGG-16 76.52 48.16 Baseline of [3] 73.4 24.0

guarantees that yx ě 0, and DeSaliNet adds the guarantee that y “ 0 ñ x “ 0. Thefirst condition is stronger in term of preserving the polarity, and as seen in fig. 3 it isnecessary to obtain a clear reconstruction of the image edges.

3.5 Objectness for free: weakly-supervised salient object segmentation

In this section we demonstrate that pre-trained CNNs reversed using SaliNet and De-SaliNet can accurately segment generic foreground objects. To this end, we consider thebenchmark dataset of [3] consisting of 4276 ImageNet images annotated with the bi-nary mask of the foreground objects. Notably, the object categories in this benchmarksare partially disjoint from the ones in the ImageNet ILSVRC data used to pre-train theCNNs: of the 445 synsets present in the segmentation benchmark data only 215 of themoverlap with the 1000 ILSVRC classes.

In order to perform segmentation, we improve the setup of [15]. Given an imagex, the CNN φpxq is evaluated until the last layer before softmax (FC8 in AlexNet5

and VGG-VD), recording the bottleneck information r. Rather than resizing the imageto the standard network input size, the CNN is applied in a fully convolutional man-ner [10]. The feature channel containing the maximally active neuron is determinedand all active neurons in that feature channel are set to one; in the tensor e˚. Followingthis, the reversed architecture x “ φ:pe˚, rq is evaluated and its output is converted intoa saliency map by computing the maximum absolute channel activation at each pixel.The saliency map is then used in GrabCut segmentation to obtain the object mask as

5 For DeSaliNet, the LRN layers in AlexNet are reversed using BP-reversal LRNBP instead ofthe identity, which was found to be slightly superior in terms of IoU performance.


in [15]. Beside the ILSVRC data used to pre-train the CNN and 98 segmented imagesfor validating the design choices above, there is no further training. For segmentation,this is a weakly-supervised setting as no object bounding boxes or segmentations is everused for training.

Table 1 and Fig. 8 compare the three reversed architectures and the dedicated methodof [3], which uses a combination of segment transfer from VOC2010 data, label prop-agation, bounding box annotations for 60k training images, and class label annotationsfor all images, including the evaluation ones. It also compares a simple baseline ob-tained by assuming as a saliency map a fixed Gaussian blob (fig. 8); by choosing asufficiently smooth mask, we were able to obtain a significantly better baseline perfor-mance compared to the analogous baseline method of [3].

We found that DeSaliNet and SaliNet perform about as well, much better than thebaseline, and nearly identically to the method of [3] despite the fact that the networkswere not trained with any segmentation or bounding box information, and on an inde-pendent (although larger) training set of majority different classes. This convincinglyshows the ability of these networks to learn the appearance of generic objects, and of theSaliNet and DeSaliNet architectures to extract this information efficiently. In contrast,given that DeConvNet is not selective for foreground information (Sect. 3.3), it did notperform well in this task, being comparable to the Gaussian baseline.

Somewhat surprisingly, VGG-VD did not perform better than AlexNet for thistask, nor did DeSaliNet perform better than SaliNet, despite achieving in general muchsharper saliency maps. A qualitative analysis of the results indicates that VGG-VD andDeSaliNet tend to focus much more on small salient details of the objects and on theirboundaries, occasionally creating holes in the GrabCut output.

4 Discussion

In this paper we have made numerous contributions towards a better understandingof deconvolutional architectures. We have derived a general construction for reversedarchitectures, showed that BP is an instance of such a construction, and used this toprecisely contrast DeConvNet and network saliency. We also proposed a hybrid model,DeSaliNet, that produces convincingly sharper images that network saliency while be-ing much more selective to foreground objects than DeConvNet.

We showed that the sharpness of the generated images depends mainly on the polar-ization enforced by reversed ReLU units, followed by the ReLU unit masks, and witha secondary contribution from the max pooling switches. The importance of the ReLUmasks suggests modifications to applications of deconvolution such as the architectureof [14] for semantic segmentation; to be investigated in the future. We also showed thatbottleneck information (pooling switches and ReLU masks) is dominant compared tothe direct input of the reversed architecture. This questions the utility of deconvolutionalarchitectures in the study of individual neurons.

Finally, we showed that SaliNet and DeSaliNet are selective for foreground objectsand confirmed this quantitatively on a large dataset of segmented images, obtaining nearstate-of-the-art result essentially for free as a by-product of training the CNN withoutsupervision for object localization.


Image DeSaliNet SaliNet DeConvNet Baseline

Fig. 8: Segmentation results (random selection). For each image, the top row showsthe GrabCut segmentation and the bottom row shows the output of the correspondingdeconvolutional network derived from AlexNet


References

1. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. 22. E. d’Angelo, A. Alahi, and P. Vandergheynst. Beyond bits: Reconstructing images from local

binary descriptors. In Proc. ICPR, pages 935–938, 2012. 23. M. Guillaumin, D. Kuttel, and V. Ferrari. Imagenet auto-annotation with segmentation prop-

agation. IJCV, 2014. 2, 12, 134. S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semi-supervised semantic

segmentation. In Proc. NIPS, pages 1495–1503, 2015. 35. C. A. Jensen, R. D. Reed, R. J. Marks, M. El-Sharkawi, J.-B. Jung, R. Miyamoto, G. Ander-

son, and C. Eggen. Inversion of feedforward neural networks: algorithms and applications.Proc. of the IEEE, 87(9), 1999. 2

6. H. Kato and T. Harada. Image reconstruction from bag-of-visual-words. In Proc. CVPR,2014. 2

7. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-tional neural networks. In Proc. NIPS, 2012. 7

8. S. Lee and R. M. Kil. Inverse mapping of continuous functions using local and globalinformation. IEEE Trans. on Neural Networks, 5(3), 1994. 2

9. A. Linden and J. Kindermann. Inversion of multilayer nets. In Proc. Int. Conf. on NeuralNetworks, 1989. 2

10. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmenta-tion. In Proc. CVPR, pages 3431–3440, 2015. 3, 12

11. B.-L. Lu, H. Kita, and Y. Nishikawa. Inverting feedforward neural networks using linear andnonlinear programming. IEEE Trans. on Neural Networks, 10(6), 1999. 2

12. A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them.In Proc. CVPR, 2015. 2

13. A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confi-dence predictions for unrecognizable images. In Proc. CVPR, 2015. 3

14. H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation.In Proc. ICCV, pages 1520–1528, 2015. 3, 13

15. K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualis-ing image classification models and saliency maps. In Proc. ICLR, 2014. 1, 3, 4, 5, 6, 7, 8,10, 12, 13

16. J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: Theall convolutional net˙ In iclr, 2015. 1, 3

17. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus.Intriguing properties of neural networks. In Proc. ICLR, 2014. 3

18. A. Tatu, F. Lauze, M. Nielsen, and B. Kimia. Exploring the representation capabilities of theHOG descriptor. In ICCV Workshop, 2011. 3

19. A. R. Varkonyi-Koczy and A. Rovid. Observer based iterative neural network model inver-sion. In IEEE Int. Conf. on Fuzzy Systems, 2005. 2

20. C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing object de-tection features. In Proc. ICCV, 2013. 2

21. P. Weinzaepfel, H. Jegou, and P. Perez. Reconstructing an image from its local descriptors.In Proc. CVPR, 2011. 2

22. R. J. Williams. Inverting a connectionist network mapping by back-propagation of error. InProc. CogSci, 1986. 2

23. J. Yosinksi, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networksthrough deep visualization. In ICML Deep Learning Workshop, 2015. 2

24. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Proc.ECCV, 2014. 1, 2, 3, 6, 7, 8

25. M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional Networks. In Proc.CVPR, 2010. 2


26. M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive Deconvolutional Networks for Mid andHigh Level Feature Learning. In Proc. ICCV, 2011. 2

salient deconvolutional networks - university of...

Documents