arxiv:1906.03444v1 [cs.cv] 8 jun 2019

14
Defending against Adversarial Attacks through Resilient Feature Regeneration Tejas Borkar Arizona State University [email protected] Felix Heide Princeton University [email protected] Lina Karam Arizona State University [email protected] Abstract Deep neural network (DNN) predictions have been shown to be vulnerable to carefully crafted adversarial perturbations. Specifically, so-called universal adversar- ial perturbations are image-agnostic perturbations that can be added to any image and can fool a target network into making erroneous predictions. Departing from existing ad- versarial defense strategies, which work in the image do- main, we present a novel defense which operates in the DNN feature domain and effectively defends against such universal adversarial attacks. Our approach identifies pre- trained convolutional features that are most vulnerable to adversarial noise and deploys defender units which trans- form (regenerate) these DNN filter activations into noise- resilient features, guarding against unseen adversarial per- turbations. The proposed defender units are trained using a target loss on synthetic adversarial perturbations, which we generate with a novel efficient synthesis method. We validate the proposed method for different DNN architec- tures, and demonstrate that it outperforms existing defense strategies across network architectures by more than 10% in restored accuracy. Moreover, we demonstrate that the approach also improves resilience of DNNs to other unseen adversarial attacks. 1. Introduction DNNs have shown great improvement in state-of-the- art performance for challenging computer vision tasks such as image classification [22, 43, 46, 17], object recognition [41, 40] and semantic segmentation [42, 50]. Despite their continued success and widespread use in practical com- puter vision applications, these networks have also been shown to make arbitrary high confidence predictions for im- ages that contain random noise and no actual human rec- ognizable object or class (i.e., clutter samples) [35]. Ad- versarial noise is a small magnitude, carefully crafted im- age perturbation almost visually imperceptible to humans, which when added to an image causes DNNs to misclas- sify the image even when the original perurbation-free ver- sion of such an image is correctly classified with high GoogLeNet Baseline: ibex 99% PRN [1]: ibex 99% Ours: ibex 99% Baseline: mashed potato 94% PRN [1]: mashed potato 43% Ours: mashed potato 98% Baseline: weimaraner 98% PRN [1]: ibex 93% Ours: ibex 99% Baseline: burrito 25% PRN [1]: cauliflower 59% Ours: mashed potato 98% 2 norm = 3200 norm = 10 Figure 1. DNN label predictions and confidence score for differ- ent images perturbed with the same universal adversarial pertur- bation. A DNN with no defense (Baseline) misclassifies perturbed images (right) with high confidence (red text) even when it clas- sifies clean images (left) correctly (green text). The proposed de- fense (Ours) correctly classifies both perturbed and clean images with high confidence. confidence[47, 15, 5, 21, 33, 30, 36]. Although the original adversarial attack formulation pro- posed by Szegedy et al. [47] is computationally expen- sive to compute, stronger and computationally inexpen- sive attacks have since been proposed [15, 33, 23]. Most existing adversarial attacks use target network gradients (white-box setting) to construct an adversarial example [47, 15, 23, 33, 37, 5], making them largely image and net- work specific with limited transferability to other networks [47, 25, 38]. However, some attacks do not rely on accu- rate gradient information, and instead construct adversarial samples by accessing only the network predictions (black- box attacks) [34, 45] or by utilizing gradients from another similar network [36]. Unlike the previously mentioned adversarial attacks, universal adversarial attacks [30] construct a single image- agnostic perturbation that when added to any image fools DNNs into making erroneous predictions with very high confidence as shown in Fig. 1. These universal perturba- tions are not unique and many adversarial directions may exist in a DNN’s feature space, further highlighting the need to guard against vulnerable directions of a DNN’s feature 1 arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Upload: others

Post on 21-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Defending against Adversarial Attacks through Resilient Feature Regeneration

Tejas BorkarArizona State University

[email protected]

Felix HeidePrinceton University

[email protected]

Lina KaramArizona State University

[email protected]

Abstract

Deep neural network (DNN) predictions have beenshown to be vulnerable to carefully crafted adversarialperturbations. Specifically, so-called universal adversar-ial perturbations are image-agnostic perturbations that canbe added to any image and can fool a target network intomaking erroneous predictions. Departing from existing ad-versarial defense strategies, which work in the image do-main, we present a novel defense which operates in theDNN feature domain and effectively defends against suchuniversal adversarial attacks. Our approach identifies pre-trained convolutional features that are most vulnerable toadversarial noise and deploys defender units which trans-form (regenerate) these DNN filter activations into noise-resilient features, guarding against unseen adversarial per-turbations. The proposed defender units are trained usinga target loss on synthetic adversarial perturbations, whichwe generate with a novel efficient synthesis method. Wevalidate the proposed method for different DNN architec-tures, and demonstrate that it outperforms existing defensestrategies across network architectures by more than 10%in restored accuracy. Moreover, we demonstrate that theapproach also improves resilience of DNNs to other unseenadversarial attacks.

1. IntroductionDNNs have shown great improvement in state-of-the-

art performance for challenging computer vision tasks suchas image classification [22, 43, 46, 17], object recognition[41, 40] and semantic segmentation [42, 50]. Despite theircontinued success and widespread use in practical com-puter vision applications, these networks have also beenshown to make arbitrary high confidence predictions for im-ages that contain random noise and no actual human rec-ognizable object or class (i.e., clutter samples) [35]. Ad-versarial noise is a small magnitude, carefully crafted im-age perturbation almost visually imperceptible to humans,which when added to an image causes DNNs to misclas-sify the image even when the original perurbation-free ver-sion of such an image is correctly classified with high

GoogLeNet

Baseline: ibex 99%PRN [1]: ibex 99%

Ours: ibex 99%

Baseline: mashed potato 94% PRN [1]: mashed potato 43%

Ours: mashed potato 98%

Baseline: weimaraner 98%PRN [1]: ibex 93%

Ours: ibex 99%

Baseline: burrito 25%PRN [1]: cauliflower 59%

Ours: mashed potato 98%

ℓ 2no

rm =

320

0

ℓ ∞no

rm =

10

Figure 1. DNN label predictions and confidence score for differ-ent images perturbed with the same universal adversarial pertur-bation. A DNN with no defense (Baseline) misclassifies perturbedimages (right) with high confidence (red text) even when it clas-sifies clean images (left) correctly (green text). The proposed de-fense (Ours) correctly classifies both perturbed and clean imageswith high confidence.

confidence[47, 15, 5, 21, 33, 30, 36].Although the original adversarial attack formulation pro-

posed by Szegedy et al. [47] is computationally expen-sive to compute, stronger and computationally inexpen-sive attacks have since been proposed [15, 33, 23]. Mostexisting adversarial attacks use target network gradients(white-box setting) to construct an adversarial example[47, 15, 23, 33, 37, 5], making them largely image and net-work specific with limited transferability to other networks[47, 25, 38]. However, some attacks do not rely on accu-rate gradient information, and instead construct adversarialsamples by accessing only the network predictions (black-box attacks) [34, 45] or by utilizing gradients from anothersimilar network [36].

Unlike the previously mentioned adversarial attacks,universal adversarial attacks [30] construct a single image-agnostic perturbation that when added to any image foolsDNNs into making erroneous predictions with very highconfidence as shown in Fig. 1. These universal perturba-tions are not unique and many adversarial directions mayexist in a DNN’s feature space, further highlighting the needto guard against vulnerable directions of a DNN’s feature

1

arX

iv:1

906.

0344

4v1

[cs

.CV

] 8

Jun

201

9

Page 2: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

space [31, 12, 11]. Interestingly, universal perturbationscan not only be easily computed for different network archi-tectures (includng AlexNet, VGG, GoogLeNet and ResNet)but the perturbation generated for any one DNN also foolsother DNNs (black-box setting), making them double uni-versal [30]. Such a pre-computed perturbation that is bothimage-agnostic and also generalizes well to other unseennetworks can be easily used to fool DNN models in real-time on unseen images, circumventing the need for image-specific computations or even network information used in[13, 27, 1]. Universal adversarial pertubations have alsobeen extended to semantic segmentation [18].

This work proposes a novel defense against unseen uni-versal adversarial perturbations [30] through the followingcontributions:

• We show the existence of a set of vulnerable convolu-tional filters, that are largely responsible for erroneouspredictions made by a DNN in an adversarial setting.The `1 norm of the convolutional filter weights can beused to identify such filters, and masking the noise intheir activations is able to prevent erroneous predic-tions.

• A novel defense that operates in the DNN featurespace, unlike all existing image domain defenses. Ourproposed defense deploys learnable defender units,which operate on the activations of the vulnerable con-volutional filters and transform them into noise re-silient features.

• A fast method to generate strong synthetic adversar-ial perturbations for training, computed from a smallset of universal adversarial perturbations for a targetDNN.

• Extensive evaluations on a variety of DNN architec-tures, including CaffeNet [22], VGG-F [6], VGG-16[43], GoogLeNet [46] and ResNet-50 [17], validatethat our proposed defense outperforms all other exist-ing defenses. We will make our entire code publiclyavailable for reproducibility and comparison.

• The proposed defender units are also able to improverobustness of a DNN against other unseen attacks with-out any additional attack specific-training.

2. Related Work2.1. Adversarial training

Szegedy et al. [47] and Goodfellow et al. [15] showthat augmenting adversarial examples in the training stagemakes networks more robust to the same adversarial attackbut requires computation of an adversarial sample for eachtraining iteration. Tramer et al. improve robustness throughadversarial training with an ensemble of attack examples

generated for different attack types and networks [48]. Pa-pernot et al. [36] suggest a distillation approach that in-volves training a shallower network to match softmax pre-dictions of deeper networks. This has the effect of produc-ing smoother and regularized network decision boundariesthat are more robust to adversarial perturbations. However,subsequently, Carlini and Wagner [5] proposed a modifi-cation to existing gradient-based attacks that result in muchstronger adversarial perturbations which are able to fool Pa-pernot et al.’s distillation approach. Buckman et al. [4]show robustness to adversarial examples on MNIST andCIFAR-10 through non-linear discretization of real-valuedDNN inputs.

2.2. Image Transformation

Image domain defenses utilize non-differentiable trans-formations of the input to mitigate the impact of adversarialperturbations [16, 39, 10, 32]. However, such approachesintroduce unnecessary artifacts in clean images resulting inaccuracy loss [1][39]. Using JPEG compression to elim-inate adversarial perturbations has been explored in [10],but compression alone tends to be insufficient as a defense.Song et al. [44] propose filtering high frequency coeffi-cients of an image’s Saak transform [44] to remove adver-sarial perturbations. Das et al. [7] train an ensemble of net-works on varying degrees of JPEG compression to defendagainst certain attacks on Inception-v4 and ResNet-50 mod-els. Liu et al. [26] uses a DNN-oriented image compressionscheme to defend against adversarial attacks.

Guo et al. [16] use image quilting to deconstruct an ad-versarial image into small patches that are reconstructed us-ing Total Variance Minimization. Similarly, dividing theadversarial image into small patches and denoising thesepatches have been explored in [32]. Prakash et al. [39] ob-serve that most adversarial attacks are agnostic to object lo-cation and leverage this to locally corrupt the image throughpixel redistribution, followed by a wavelet denoising stage.

2.3. Detecting Adversarial Examples

Another popular line of defenses explores the idea offirst detecting an adversarially perturbed input and then ei-ther abstaining from making a prediction or further pre-processing adversarial input for reliable predictions [24, 27,49, 28, 29]. Li and Li [24] use statistics based on the ac-tivations of convolutional filters to first detect adversarialexamples, followed by a 3x3 averaging operation on the in-put image to improve robustness. Similarly, Meng and Chen[28] use two additional networks, one to detect adversariallyperturbed images and the other to then pre-process the de-tected input adversarial images to resemble natural images,before being passed on to the target DNN for predictions.Lu et al. [27] proposed an adversarial image detector basedon discrete codes generated using activations from the late

Page 3: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

stages of a DNN. Xu et al. [49] use numerous quantizationoperations on the DNN input to both detect adversarial im-ages and to also squash out adversarial perturbations in theinput.

All of the aforementioned defenses in Sections 2.1-2.3are geared towards image-specific gradient-based attacksand none of them has, as of yet, been shown to defendagainst image-agnostic attacks like [30] and [20]. Recently,Akhtar et al. [1] proposed a defense against image-agnosticadversarial perturbations, using a learnable PerturbationRectifying Network (PRN) that denoises the input image.Although adversarial examples are mitigated by the PRN,clean images tend to be over-corrected leading to a lossin accuracy. This over-correction of original images is re-duced by adding an adversarial sample detector that de-tects and passes only adversarially perturbed input imagesthrough the PRN. In contrast, our proposed approach workssolely in the DNN feature space and outperforms existingdefenses. As a result, our approach does not suffer fromover-correction of the input and is able to maintain highaccuracy on both adversarial and clean images without theneed to detect if the input has been adversarially perturbed.An in-depth survey of existing defenses and attacks can befound in [2].

3. Problem FormulationLet µc represent the distribution of clean (unperturbed)

images in Rd, F (·) be a classifier that predicts a class labelF (x) for an image x ∈ Rd. The universal adversarial per-turbation attack seeks a perturbation vector v ∈ Rd underthe following constraints [30]:

Px∼µc

(F (x+ v) 6= F (x)

)≥ (1− δ) s.t. ‖v‖p ≤ ξ (1)

where P (·) denotes probability, ‖ · ‖p is the `p norm withp ∈ [1,∞), (1−δ) is the target fooling ratio with δ ∈ [0, 1),(i.e., the fraction of samples in µc that change labels whenperturbed by an adversary) and ξ controls the magnitudeof adversarial perturbations. Moosavi-Dezfooli et al. [30]computed perturbations that achieved a fooling ratio≥ 80%with ξ = 10 for `∞-norm and ξ = 2000 for `2-norm.

4. Feature-Domain Adversarial DefenseThis section describes the proposed defense framework

which operates in the feature domain. Specifically, we pro-pose a defense framework that first identifies convolutionalfilters whose activations are significantly disrupted by ad-versarial perturbations (Section 4.1) and then deploy de-fender units that transform (regenerate) disruptive featuresinto noise-resilient features that restore the network accu-racy (Section 4.2). The defender units are trained usinga target loss on synthetic adversarial perturbations (Sec-

tion 4.3), leaving the parameters for the underlying baselineDNN unchanged.

4.1. Stability of Convolutional Filters

Szegedy et al. [47] showed that convolutional layer sta-bility is affected by the operator norm of the layer weights.In this work, we assess the vulnerability of individual filtersand show that, for each layer, certain filter activations aresignificantly more disrupted than others, especially in theearly layers of a DNN

For a given layer, let φm(u) be the output (activationmap) of themth convolutional filter with kernel weightsWm

for an input u. Let em = φm(u+r)−φm(u) be the additivenoise (perturbation) that is caused in the output activationmap φm(u) as a result of applying an additive perturbationr to the input u. It can be shown (see supplementary 6) thatem is bounded as follows:

‖em‖∞ ≤ ‖Wm‖1‖r‖p (2)

where as before ‖·‖p is the `p norm with p ∈ [1,∞). Equa-tion 2 shows that the `1 norm of the filter weights can beused to identify and rank convolutional filter activations interms of their ability to restrict perturbation in their activa-tion maps. For example, filters with a small weight `1 normwould result in insignificant small perturbations in their out-put when their input is perturbed, and are thus consideredto be less vulnerable to perturbations in the input. Figure 2shows observed norms in the first convolution layer of Caf-feNet [22] and GoogLeNet [46] for an `∞-norm universaladversarial input. We can see that our ‖W‖1-based rankingcorrelates well with the degree of perturbation (max level)that is induced in the filter outputs. Similar observations canbe made for other convolutional layers in the network (seesupplementary 6, Figure 8).

We evaluate the impact of masking the adversarial noisein such ranked filters on the overall top-1 accuracy of Caf-feNet [22], VGG-16 [43] and GoogLeNet [46]. Specifi-cally, we randomly choose a subset of 1000 images (1 im-age per class) from the ILSVRC2012 [8] training set andadd an `∞ norm universal adversarial perturbation (gener-ated using [30]) to each image to generate adversarially per-turbed images. The top-1 accuracies for perturbation-freeimages of our subset are 0.58, 0.70 and 0.69 for CaffeNet,GoogLeNet and VGG-16, respectively. Similarly the top-1accuracies for adversarially perturbed images of the samesubset are 0.10, 0.25 and 0.25 for CaffeNet, GoogLeNetand VGG-16, respectively. From Figure 3, we see that justmasking the adversarial perturbations in 50% of the mostvulnerable filter activations is able to recover most of thelost performance for each DNN resulting in top-1 accura-cies of 0.56, 0.68 and 0.67 for CaffeNet, GoogLeNet andVGG-16, respectively.

Page 4: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

0 20 40 60 80 1000

100

200

300

400

‖W‖ 1‖r‖ ∞

CaffeNet: Conv-1

0 20 40 60 80 100

Ranked filters

0

100

200

300

400

`∞

-no

rm

0 10 20 30 40 50 600

100

200

300

400GoogLeNet: Conv-1

0 10 20 30 40 50 60

Ranked filters

0

100

200

300

400

Figure 2. Observed `∞ norm for universal adversarial noise inthe ranked convolutional filters (ordered from most vulnerable toleast) of the first layer of CaffeNet [22] and GoogLeNet [46]. The`∞-norm attack is used with ξ ≤ 10, i.e. ‖r‖∞ ≤ 10.

Top-1 accuracy

Figure 3. Effect of masking `∞ norm universal adversarial noise inranked convolutional filter activations of the first layer in CaffeNet[22], GoogLeNet [46] and VGG-16 [43], evaluated on a 1000 im-age subset of the ILSVRC2012 [8] training set. Top-1 accuraciesfor perturbation-free images are 0.58, 0.70 and 0.69 for CaffeNet,GoogLeNet and VGG-16, respectively. Similarly, top-1 accura-cies for adversarially perturbed images with no noise masking are0.1, 0.25 and 0.25 for CaffeNet, GoogLeNet and VGG-16, respec-tively. Masking the noise in just 50% of the ranked filter activa-tions restores most of the lost accuracy for all three DNNs.

4.2. Resilient Feature Regeneration Defense

Our proposed defense is illustrated in Figure 4. We learna task-driven feature restoration transform that acts as a de-fender for convolutional filter activations severely disruptedby adversarial input. This defender unit does not modify theremaining activations of the baseline DNN. A similar ap-proach of learning corrective transforms has been exploredin a different context in [3], with the goal of making net-works more resilient to image blur and additive white gaus-sian noise.

Let Sl represent a set consisting of indices for convo-lutional filters in the lth layer of a DNN. Furthermore, letSldef be the set of indices for filters we wish to defend (Sec-tion 4.1) and let Sladv

be the set of indices for filters whoseactivations are not regenerated (i.e., Sl = Sldef ∪ Sladv

).If Φl represents the set of convolutional filters in the lth

layer, then our defender units perform a feature regenera-tion transform Dl(·) under the following conditions:

Dl(ΦSldef(u+ r)) ≈ ΦSldef

(u) (3)

andDl(ΦSldef

(u)) ≈ ΦSldef(u) (4)

where u is the unperturbed input to the lth layer of convolu-tional filters and r is an additive perturbation that acts on u.In Equations 3 and 4,≈ denotes similarity based on classifi-cation accuracy in the sense that features are restored to re-gain the classification accuracy of the original perturbation-free activation map. Equation 3 forces Dl(·) to pursue task-driven feature denoising that restores lost accuracy of theDNN while Equation 4 ensures that prediction accuracy onunperturbed activations is not decreased, without any addi-tional adversarial perturbation detector. Similar to Borkarand Karam [3], we implement Dl(·) (i.e., defender unit)as a shallow residual block [17], consisting of two stacked3 × 3 convolutional layers sandwiched between a coupleof 1 × 1 convolutional layers and a single skip connection.Dl(·) is then estimated using a target loss from the baselinenetwork, through backpropagation as shown in Figure 4, butwith much less number of parameters.

Given an L layered DNN Φ, pre-trained for an imageclassification task, Φ can be represented as a function thatmaps network input x to an N-dimensional output label vec-tor Φ(x) as follows:

Φ = ΦL ◦ ΦL−1 ◦ . . .Φ2 ◦ Φ1 (5)

where Φl is a mapping function (set of convolutional filters)representing the lth DNN layer and N is the output dimen-sionality of the DNN (i.e., number of classes). Without anyloss of generality, the resultant DNN after deploying a de-fender unit that operates on the set of filters represented bySldef in layer l is given by:

Φdef = ΦL ◦ ΦL−1 ◦ . . .Φldef . . .Φ2 ◦ Φ1 (6)

where Φldef represents the new mapping function for layerl, such that Dl(·) defends only activations of the filter subsetΦSldef

and all the remaining filter activations (i.e., ΦSladv)

are left unchanged. If Dl(·) is parameterized by θl, then thedefender unit can be trained by minimizing:

J (θl) =1

K

K∑k=1

L(yk,Φdef (xk)) (7)

where L is the same target loss function of the baselineDNN (e.g., cross-entropy classification loss), yk is the tar-get output label for the kth input image xk, K representsthe total number of images in the training set consisting ofboth clean and perturbed images. As we use both clean and

Page 5: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

densedense

PredictedlabelsRGB image

Feature extraction

Conv-1 features Conv-2 features Conv-3 featuresConv-4 features

Classification

dense dense

Defender unit Defender

unit

Trainable parameters update through backpropogation

Regenerated features

GT label

Target-oriented loss

High noisefeatures

Low noisefeatures Frozen parameters from baseline net

Baseline DNN

Figure 4. Resilient Feature Regeneration Defense: Convolutional filter activations in the baseline DNN (top) are first sorted in order ofvulnerability to adversarial noise using their respective filter weight norms (Section 4.1). We deploy defender units, consisting of a residualblock with a single skip connection (4 layers) at the output of vulnerable activations (high-noise features) to regenerate input features intonoise-resilient features (D(·)) that restore the lost accuracy of the baseline DNN, while leaving the remaining filter activations (low-noisefeatures) unchanged. We train these units on both clean and perturbed images in every mini-batch using the same target loss as the baselineDNN such that all parameters of the baseline DNN are left unchanged during training.

perturbed images during training, xk in Equation 7, repre-sents a clean or an adversarially perturbed image. Equa-tions 6 and 7 show defender unit estimation for a singlelayer, however, it is possible to add such units in more thanone DNN layer as shown in Figure 4. In all our experiments(Section 5), the defender units only regenerate 50% of theconvolutional filter activations in a layer.

4.3. Generating Synthetic Perturbations

The defender units in Figure 4 are trained using a combi-nation of clean and adversarially perturbed images in eachtraining mini-batch. However, in order to avoid overfit-ting, generating a reasonable variety of adversarial pertur-bations ( ≥ 100) using the algorithm of Moosavi-Dezfooliet al. [30] can be computationally prohibitive, as noted byAkhtar et al. [1]. We propose a single efficient method (Al-gorithm 1) to construct synthetic adversarial perturbationsfor both `∞-norm and `2-norm attacks from a small set ofadversarial perturbations V ⊆ Rd computed using [30].Starting with the synthetic perturbation vsyn set to zero, weiteratively select a random perturbation vnew ∈ V and arandom scale factor α ∈ [0, 1] and update vsyn as:

vsyn = αvnew + (1− α)vsyn (8)

till the `2-norm of vsyn exceeds a threshold η. We set thethreshold η to be the minimum `2-norm of perturbations inthe set V .

Unlike the approach of Akhtar et al. [1], which uses an it-erative random walk along pre-computed adversarial direc-tions, the proposed algorithm has two distinct advantages:1) The same algorithm can be used for `2-norm and `∞-norm attacks without any modification and 2) instead of

Algorithm 1 Generating synthetic adversarial perturbationInput: Set of pre-computed perturbations V ⊆ Rd such that vi ∈

V is the ith perturbation, threshold ηOutput: Synthetic perturbation vsyn ∈ Rd

1: vsyn = 02: while ‖vsyn‖2 ≤ η do3: α ∼ uniform(0, 1)

4: vnewrand∼ V

5: vsyn = αvnew + (1− α)vsyn6: end while7: return vsyn

having to check and ensure that the `∞-norm of the per-turbation does not violate the constraint of the attack (i.e.,`∞-norm≤ ξ), Equation 8 (step 5 in Algorithm 1) automat-ically ensures this.

Synthetic samples generated using Algorithm 1 achievea fooling rate equivalent to that achieved by sampling fromV (e.g., 0.8 for a set V fooling rate of 0.8), while Akhtaret al. [1] achieves a significantly lower fooling rate (e.g.,0.67 for a set V fooling rate of 0.8). The synthetic pertur-bations are not identical to their closest match in the set V ,yet they achieve similar fooling rates as the original pertur-bations in V and on average achieve a linear correlation ≥0.9 with their closest match compared to 0.83 achieved byAkhtar et al. [1]. See the supplementary material for visu-alizations (Figure 10).

5. AssessmentWe use the ImageNet validation set (ILSVRC 2012) [8]

consisting of 50000 images with a single crop evaluation(unless specified otherwise) in our experiments. The base-line DNNs and our proposed approach are implemented

Page 6: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

using Caffe [19] and we use publicly provided code byMoosavi-Dezfooli et al. [30] to generate the universal ad-versarial perturbations for testing. For effective comparisonwith prior works, we follow the same evaluation setup asAkhtar et al. [1] and report our results in terms of top-1accuracy and the restoration accuracy proposed by Akhtaret al. [1]. Given a set Ic containing clean images, set Ip/ccontaining clean and perturbed images in equal numbers,then:

Restoration accuracy =acc(Ip/c)

acc(Ic)(9)

where acc(·) is the top-1 accuracy. We use all 50000 im-ages during evaluation for easy reproducibility of results.We compute 5 independent universal adversarial test per-turbations per network using a set of 10000 held out imagesrandomly chosen from the training set with the fooling ra-tio for each perturbation lower-bounded to 0.8 (except forResNet-50 1) on the held out images, and with the maxi-mum normalized inner product between any two perturba-tions for the same DNN upper bounded to 0.15.

5.1. Defense Training Methodology

In our proposed defense (Figure 4), only the parametersfor defender units have to be trained and these parametersare updated to minimize the cost function given by Equa-tion 7. Although a defender unit can be deployed for eachconvolutional layer in a DNN, we limit the number of de-ployed defender units as min(#DNN layers, 6)2. The filterdepth of convolutional layers in our defender units is upperbounded to the number of regenerated feature maps (i.e.,50% filters in a DNN layer). Using Algorithm 1, we gen-erate 2000 synthetic perturbations from a set V of 25 orig-inal perturbations [30] and train defender units on a singleNvidia Titan-X using a standard SGD optimizer, momen-tum of 0.9 and a weight decay of 0.0005 for 4 epochs of theImageNet training set [8]. The learning rate is dropped by afactor of 10 after each epoch with an initial learning rate of0.1.

5.2. Validation for Varying DNN Architectures

Top-1 accuracy of adversarially perturbed test images forvarious DNNs (no defense) and our proposed defense forrespective DNNs is reported in Table 1 under both white-box (same network used to generate and test attack) andblack-box (tested network is different from network used togenerate attack) settings. As mentioned earlier, universaladversarial perturbations can be doubly universal and undera black-box setting, we evaluate a target DNN defense (de-fense is trained for attacks on target DNN) against a pertur-

1For ResNet-50, the provided code in [30] was able to achieve a maxi-mum fooling ratio of only 0.6

2See supplementary material for results supporting this choice (Fig-ure 9)

Table 1. Cross-DNN evaluation: Top-1 accuracy of different clas-sifiers against a `∞-norm universal adversarial attack with ξ = 10under both white-box and black-box settings. Target fooling ratiofor ResNet-50 is 0.6 and 0.8 for all other DNNs. DNNs in columnone are tested with attacks generated for DNNs in row one.

CaffeNet VGG-F GoogleNet VGG-16 Res50

CaffeNet [22], orginal accuracy 56.4%

CaffeNet 0.109 0.298 0.456 0.431 0.479Ours 0.542 0.524 0.510 0.457 0.513

VGG-F [6], original accuracy 58.4%

VGG-F 0.299 0.150 0.461 0.417 0.495Ours 0.556 0.550 0.548 0.492 0.545

GoogLeNet [46], original accuracy 68.6%

GoogLeNet 0.519 0.539 0.260 0.472 0.549Ours 0.651 0.653 0.653 0.637 0.660

VGG-16 [43], original accuracy 68.4%

VGG-16 0.549 0.559 0.519 0.240 0.534Ours 0.615 0.622 0.646 0.655 0.652

Res50 [17], original accuracy 72.3%

Res50 0.593 0.605 0.545 0.448 0.390Ours 0.652 0.654 0.677 0.644 0.677

Table 2. Same-norm evaluation: Restoration accuracy of DNNsand associated defenses against an `∞-norm universal adversarialattack with ξ = 10.

Methods CaffeNet VGG-F GoogLeNet VGG-16 Res50`∞-norm attack, ξ = 10

Baseline 0.596 0.628 0.691 0.681 0.769PRN [1] 0.936 0.903 0.956 - -

PRN+det [1] 0.952 0.922 0.964 - -PD+WD [39] 0.873 0.813 0.884 0.818 0.900

JPEG comp. [10] 0.554 0.697 0.830 0.693 0.849Feat. Distill. [26] 0.671 0.689 0.851 0.717 0.867

Ours 0.976 0.967 0.970 0.963 0.946

bation generated for a different network. Top-1 accuracy forbaseline DNNs is severely affected by both white-box andblack-box attacks, whereas our proposed defense is not onlyable to effectively thwart the white-box attacks but is alsoable to generalize to attacks constructed for other networkswithout further training (Table 1). Since different DNNscan share common adversarial directions in their featurespace, our defender units get trained on regularizing suchdirections against unseen data and are able to defend againstblack-box attacks. Figure 5 shows a visual example of howdefender units remove adversarial perturbations from per-turbed activation maps whilst preserving the discriminativefeatures of the original perturbation-free activation maps.From Table 1, it can be seen that black-box attacks gener-ated using VGG-16 [43] are the most challenging to defendagainst as compared to the other considered networks.

5.3. Analysis and Comparisons

We use the restoration accuracy metric given by Equa-tion 9 to provide a comparison of our defense with existingdefenses (Tables 2 and 3), using both clean and perturbed

Page 7: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Table 3. Cross-norm evaluation: Restoration accuracy againstan `2-norm attack using defense (PRN [1] and Ours) trained on`∞-norm attack examples with ξ = 10.

Methods CaffeNet VGG-F GoogLeNet VGG-16`2-norm attack, ξ = 2000

Baseline 0.677 0.671 0.682 0.697PRN [1] 0.922 0.880 0.971 -

PRN+det [1] 0.936 0.900 0.975 -PD+WD [39] 0.782 0.784 0.857 0.809

Ours 0.964 0.961 0.912 0.876Stronger `2-norm attack, ξ = 3500

Baseline 0.548 0.548 0.613 0.617PD+WD [39] 0.782 0.640 0.774 0.756

Ours 0.916 0.921 0.880 0.809

images since a defense should not only recover the DNNaccuracy against adversarial images but must also main-tain a high accuracy on clean images. Even though Akhtaret al. [1] are the only ones who have reported defense re-sults on universal adversarial attacks, we also compare re-sults with the defense proposed by Prakash et al. [39] asit has one of the best reported defense results for all otherattacks. Additionally, defenses that perform pre-processingof the input such as JPEG compression are also comparedwith [10, 26], with a quality factor of 50 for compression assuggested by the respective authors.

In Table 2, we report results for an `∞-norm attackagainst various DNNs and show that our defender units out-perform all the other defenses for all networks and achievea 97% restoration rate for GoogLeNet [46]. Even without aperturbation detector, our defense outperforms the existingdefense with a perturbation detector (PRN+det) of Akhtaret al. [1] for all networks and by almost 4% for VGG-F.Without a perturbation detector, PRN [1] is consistently out-performed by our approach with a significant margin. Al-though Prakash et al. [39] (PD+WD) report an effective de-fense against many attacks, their approach does not defendwell against a universal adversarial attack.

In Table 3, we also evaluate how well a defense trainedon an `∞-norm attack (Akhtar et al. [1] and ours) defendsagainst an `2-norm attack for the same DNN. Our defenderunits are able to effectively generalize to even cross-normattacks and outperform the other two defenses on a majorityof DNNs. Akhtar et al. [1] do not report results for stronger`2-norm attacks (i.e., ξ ≥ 2000). From Table 3, it can beseen that our defender units are still able to effectively re-cover 91% accuracy even against much stronger `2-normattacks.

5.4. Robustness to Secondary Attacks

Although in practical situations, an attacker may nothave full or even partial knowledge of a defense, for com-pleteness, we also evaluate our proposed defense against asecondary attack (type II attack [27]) that has full accessto the gradient information of our defender units. Figure 6

Ranked most disruptive Ranked least disruptive

Cle

anP

ertu

rbed

Reg

ener

ated

Diff

eren

ce

Figure 5. Filter activations for conv 1 1 in VGG-16: Row 1shows activations for clean image. Row 2 shows activations af-fected by universal adversarial perturbations. Row 3 shows adver-sarial attack-resilient feature activations generated by our defenderunit. Row 4 visualizes the difference map between perturbed andregenerated activations.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 1 0 1 5 2 0

FOO

LIN

G R

ATE

ATTACK EPOCHS

TYPE II ATTACK: FOOLING RATE VS. ATTACK EPOCHSTarget fooling rate Baseline Ours

Figure 6. Robustness to secondary attacks: Fooling rate vs. at-tack epochs for an `∞-norm attack (ξ = 10) against the CaffeNetarchitecture, where the attacker has full knowledge of the baselineDNN and also the defense. The defender units are trained usingadversarial samples that achieve a target fooling rate of 0.85.

shows the robustness of our defender units (trained on sam-ples with fooling rate 0.85) to a secondary attack seekingto achieve a target fooling rate of 0.85, whilst having fullaccess to defense gradients. While the attack can easilyconverge against a baseline DNN in less than 2 epochs andeventually achieving a final fooling rate of 0.9, it strugglesto achieve a fooling rate of even 0.8 against our defenseeven after 20 epochs, plateauing at 0.78. Although it is pos-sible to design a converging attack for a much smaller fool-ing rate, it is fairly straightforward to defend against this bysimply augmenting defender unit training with such sam-ples.

5.5. Resilience to Unseen Attacks

Although our proposed defender units effectively defendagainst universal adversarial perturbations, here, we evalu-

Page 8: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Table 4. Resilience across various attacks on CaffeNet: Restoration accuracy against various other attacks using defender units trainedon just `∞-norm universal adversarial examples. Attacks are constructed for the baseline DNN and are assumed to have no knowledge ofthe defense.

DefenseUniv. adv [30] AWGN Local adv. [45] FGSM [15] DeepFool [33] IGSM [23] Singular-UAP [20] Sensor noise [9]

Average`2 = 3000 `2 = 7700 `2 = 6600 `2 = 194 `2 = 68 `2 = 800 `2=3930 `2= 6920

Baseline 0.596 0.921 0.741 0.715 0.611 0.553 0.852 0.836 0.728PD+WD [39] 0.873 0.899 0.761 0.819 0.894 0.657 0.870 0.841 0.827

Ours 0.975 0.936 0.816 0.765 0.918 0.584 0.914 0.870 0.847

Figure 7. Visualization of various attacks on CaffeNet: Adversarially perturbed images for respective attacks are shown in row 1 withan enlarged view of a chosen image region (red square). Row 2 shows the 2D-FFT magnitude plots of the adversarial perturbations. EachFFT magnitude is normalized to a range of [0,1]. Row 3 shows predicted label and confidence (green=correct, red=incorrect) of respectiveimages for a baseline DNN (no defense) and defender units trained only on universal adversarial perturbations.

ate the resilience of such a defense to other unseen attacksand sources of noise without any additional attack-specifictraining. For this evaluation, we choose 7 more additionalattacks that not only cover both white-box and black-boxattacks but also cover a large range of `2-norm values for anattack (i.e., 68 ≤ `2-norm ≤7700). Please see the supple-mental document (Section 6) for detailed description of theindividual attacks. Our defender units are only trained on`∞-norm universal adversarial perturbation attack sampleswith ξ = 10. Results are reported in Table 4. From Ta-ble 4, it can be seen that our approach is resilient to unseenattacks and acheives the best restoration accuracy on mostof the attacks. Furthermore in Figure 7, we visualize per-turbed images and the respective noise perturbations in thefrequency domain (2D-FFT magnitude plot). Figure 7 alsoshows the prediction results of a DNN with our defense ascompared to the baseline DNN with no defense under vari-ous types of attacks.

6. ConclusionWe propose a novel approach that operates solely in the

DNN feature domain and effectively defends against uni-versal adversarial perturbations [30], unlike existing adver-sarial defenses which work mainly in the image domainor use a modified loss function. The proposed defense

first identifies convolutional features that are most disruptedby adversarial noise and deploys trainable defender unitswhich transform these filter activations into noise-resilientfeatures. The defender units can efficiently be trained us-ing a set of synthetic adversarial perturbations, avoidingthe need for a large set of original adversarial perturbationsthat are computationally expensive to generate. The pro-posed defense not only effectively generalizes across dif-ferent DNNs as well as attack norms, but also makes sec-ondary white-box attacks more difficult. Additionally, de-fender units trained solely on universal adversarial perturba-tions are shown to improve the resilience of baseline DNNsto other types of unseen (adversarial) noise. Our DNNfeature domain-based defense is complimentary to existingwork such as image domain defenses [39, 16], and adver-sarial training [47] and can be easily combined with suchdefenses. Although we show results for the task of imageclassification, we expect the proposed defense to be imme-diately applicable to DNNs for other semantically-relatedtasks such as object detection [14] and semantic segmenta-tion [42].

References[1] N. Akhtar, J. Liu, and A. Mian. Defense against universal ad-

versarial perturbations. In IEEE Conference on Computer Vi-

Page 9: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

sion and Pattern Recognition, pages 3389–3398, June 2018.2, 3, 5, 6, 7

[2] N. Akhtar and A. Mian. Threat of adversarial attackson deep learning in computer vision: A survey. CoRR,abs/1801.00553, 2018. 3

[3] T. S. Borkar and L. J. Karam. Deepcorrect: Correcting DNNmodels against image distortions. CoRR, abs/1705.02406,2017. 4

[4] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow. Ther-mometer encoding: One hot way to resist adversarial exam-ples. In International Conference on Learning Representa-tions, 2018. 2

[5] N. Carlini and D. Wagner. Towards evaluating the robustnessof neural networks. In IEEE Symposium on Security andPrivacy, pages 39–57, 2017. 1, 2

[6] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In BMVC, 2014. 2, 6

[7] N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, S. Li,L. Chen, M. E. Kounavis, and D. H. Chau. SHIELD: Fast,practical defense and vaccination for deep learning usingJPEG compression. In ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, pages196–204, 2018. 2

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database.In IEEE Conference on Computer Vision and Pattern Recog-nition, pages 248–255, 2009. 3, 4, 5, 6, 11, 12

[9] S. Diamond, V. Sitzmann, S. P. Boyd, G. Wetzstein, andF. Heide. Dirty pixels: Optimizing image classification ar-chitectures for raw sensor data. CoRR, abs/1701.06487,2017. 8, 13

[10] G. K. Dziugaite, Z. Ghahramani, and D. M. Roy. A study ofthe effect of JPG compression on adversarial images. CoRR,abs/1608.00853, 2016. 2, 6, 7

[11] A. Fawzi, S. Moosavi-Dezfooli, and P. Frossard. The ro-bustness of deep networks: A geometrical perspective. IEEESignal Process. Mag., 34(6):50–62, 2017. 2

[12] A. Fawzi, S. Moosavi-Dezfooli, P. Frossard, and S. Soatto.Classification regions of deep neural networks. CoRR,abs/1705.09552, 2017. 2

[13] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Ro-bustness of classifiers: from adversarial to random noise. InAdvances in Neural Information Processing Systems, pages1632–1640. 2016. 2

[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In IEEE Conference on Computer Vision andPattern Recognition, pages 580–587, 2014. 8

[15] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572, 2014. 1, 2, 8, 12

[16] C. Guo, M. Rana, M. Cisse, and L. van der Maaten. Coun-tering adversarial images using input transformations. CoRR,abs/1711.00117, 2017. 2, 8

[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In IEEE Conference on Computer

Vision and Pattern Recognition, pages 770–778, 2016. 1, 2,4, 6, 14

[18] J. Hendrik Metzen, M. Chaithanya Kumar, T. Brox, andV. Fischer. Universal adversarial perturbations against se-mantic image segmentation. In IEEE International Confer-ence on Computer Vision, pages 2774–2783, Oct 2017. 2

[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014. 6

[20] V. Khrulkov and I. Oseledets. Art of singular vectors anduniversal adversarial perturbations. In IEEE Conference onComputer Vision and Pattern Recognition, pages 8562 –8570, June 2018. 3, 8, 13

[21] J. Kos, I. Fischer, and D. Song. Adversarial examples forgenerative models. CoRR, abs/1702.06832, 2017. 1

[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassification with deep convolutional neural networks. InAdvances in Neural Information Processing Systems, pages1097–1105, 2012. 1, 2, 3, 4, 6, 11, 12, 13, 14

[23] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial ex-amples in the physical world. CoRR, abs/1607.02533, 2016.1, 8, 12

[24] X. Li and F. Li. Adversarial examples detection in deep net-works with convolutional filter statistics. In IEEE Interna-tional Conference on Computer Vision, pages 5775 – 5783,Oct 2017. 2

[25] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into trans-ferable adversarial samples and black-box attacks. CoRR,abs/1611.02770, 2016. 1

[26] Z. Liu, Q. Liu, T. Liu, Y. Wang, and W. Wen. Feature Distil-lation: DNN-oriented JPEG compression against adversarialexamples. International Joint Conference on Artificial Intel-ligence, 2018. 2, 6, 7

[27] J. Lu, T. Issaranon, and D. Forsyth. Safetynet: Detecting andrejecting adversarial examples robustly. In IEEE Interna-tional Conference on Computer Vision, pages 446–454, Oct2017. 2, 7

[28] D. Meng and H. Chen. MagNet: A two-pronged defenseagainst adversarial examples. In ACM SIGSAC Conferenceon Computer and Communications Security. 2

[29] J. Metzen, T. Genewein, V. Fischer, and B. Bischoff. Ondetecting adversarial perturbations. In International Confer-ence on Learning Representations, 2017. 2

[30] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard.Universal adversarial perturbations. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 86–94,2017. 1, 2, 3, 5, 6, 8, 13

[31] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard, andS. Soatto. Analysis of universal adversarial perturbations.CoRR, abs/1705.09554, 2017. 2

[32] S. Moosavi-Dezfooli, A. Shrivastava, and O. Tuzel. Di-vide, denoise, and defend against adversarial attacks. CoRR,abs/1802.06806, 2018. 2

[33] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep-Fool: A simple and accurate method to fool deep neural net-works. In IEEE Conference on Computer Vision and PatternRecognition, pages 2574–2582, June 2016. 1, 8, 12

Page 10: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

[34] N. Narodytska and S. P. Kasiviswanathan. Simple black-box adversarial perturbations for deep networks. CoRR,abs/1612.06299, 2016. 1

[35] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networksare easily fooled: High confidence predictions for unrecog-nizable images. In IEEE Conference on Computer Visionand Pattern Recognition, pages 427–436, June 2015. 1

[36] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik,and A. Swami. Practical black-box attacks against machinelearning. In ACM Asia Conference on Computer and Com-munications Security, pages 506–519, 2017. 1, 2

[37] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, C. B. Celik,and A. Swami. The limitations of deep learning in adversar-ial settings. In IEEE European Symposium on Security andPrivacy, pages 372–387, 2016. 1

[38] N. Papernot, P. D. McDaniel, and I. J. Goodfellow. Transfer-ability in machine learning: from phenomena to black-boxattacks using adversarial samples. CoRR, abs/1607.02533,2016. 1

[39] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer.Deflecting adversarial attacks with pixel deflection. In IEEEConference on Computer Vision and Pattern Recognition,pages 8571–8580, June 2018. 2, 6, 7, 8

[40] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Unified, real-time object detection. In IEEEConference on Computer Vision and Pattern Recognition,pages 779–788, 2016. 1

[41] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Sys-tems, pages 91–99, 2015. 1

[42] E. Shelhamer, J. Long, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. IEEE Trans. PatternAnal. Mach. Intell., 39(4):640–651, Apr. 2017. 1, 8

[43] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 1, 2, 3, 4, 6, 11, 12, 14

[44] S. Song, Y. Chen, N. Cheung, and C. J. Kuo. Defenseagainst adversarial attacks with Saak transform. CoRR,abs/1808.01785, 2018. 2

[45] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack forfooling deep neural networks. CoRR, abs/1710.08864, 2017.1, 8, 13

[46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In IEEE Conference onComputer Vision and Pattern Recognition, pages 1–9, 2015.1, 2, 3, 4, 6, 7, 11, 12, 14

[47] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199, 2013. 1, 2, 3, 8

[48] F. Tramr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh,and P. McDaniel. Ensemble adversarial training: Attacksand defenses. In International Conference on Learning Rep-resentations, 2018. 2

[49] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detectingadversarial examples in deep neural networks. In Networkand Distributed System Security Symposium, 2018. 2, 3

[50] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. CoRR, abs/1511.07122, 2015. 1

Page 11: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Supplementary Material

A. Maximum adversarial perturbationWe show in Section 4.1 of the paper that the maximum

possible adversarial perturbation in a convolutional filteractivation map is proportional to the `1-norm of its corre-sponding filter kernel weights. Here, we provide a detailedproof for Equation 2 in Section 4.1 of the paper. For a givenconvolutional layer, let Wm ∈ Rd×d×C be the mth convo-lutional filter kernel, where C is the number of input fea-ture maps to the layer and d is the spatial kernel size. LetZ ∈ RX×Y×C be the adversarial perturbation in the layerinput, where X and Y represent the height and width ofthe input feature maps, respectively. Then, the adversarialperturbation in the mth filter activation map is given as:

Z ∗Wm =C∑

c=1

X∑x′=1

Y∑y′=1

Z c(x

′, y

′)W c

m(x− x′, y − y′) (10)

where the subscript c is the index of the input feature map.The maximum adversarial perturbation in the mth outputactivation map can be given by:

max |Z ∗Wm| = max∣∣ C∑c=1

X∑x′=1

Y∑y′=1

Z c(x

′, y

′)W c

m(x− x′, y − y′)

∣∣(11)

≤ max

C∑c=1

X∑x′=1

Y∑y′=1

∣∣Z c(x

′, y

′)∣∣∣∣W c

m(x− x′, y − y′)

∣∣ (12)

≤ zm max

C∑c=1

X∑x′=1

Y∑y′=1

1c(x

′, y

′)∣∣W c

m(x− x′, y − y′)

∣∣, (13)

where zm = max |Z | and 1 is a 3D tensor with each ofits elements being 1 and having the same size as Z . FromEquation 13, we can easily show:

max |Z ∗Wm| ≤ zmC∑

c=1

d∑i=1

d∑j=1

∣∣W cm(i, j)

∣∣, (14)

If vec(·) represents a vectorization operation, Equation 14can be rewritten as:

‖vec(Z ∗Wm)‖∞ ≤ ‖vec(Wm)‖1‖vec(Z )‖∞ (15)

Since `∞-norm ≤ `p-norm where p ∈ [1,∞), we have,

‖vec(Z ∗Wm)‖∞ ≤ ‖vec(Wm)‖1‖vec(Z )‖p (16)

B. Masking perturbations in other layersIn Section 4.1 of the paper (Figure 3 in the paper), we

evaluate the effect of masking `∞-norm adversarial pertur-bations in a ranked subset (using `1-norm ranking) of con-volutional filter activation maps of the first convolutionallayer of a DNN. Here, in Figure 8, we evaluate the effect ofmasking `∞-norm adversarial perturbations in ranked filteractivation maps of the convolutional layers 2, 3, 4 and 5 of

CaffeNet [22] and VGG-16 [43]. We use the same evalu-ation setup as in Section 4.1 of the paper (i.e., 1000 imagerandom subset of ILSVRC2012 [8]). The top-1 accuracy forperturbation-free images of the subset are 0.58 and 0.69 forCaffeNet and VGG-16, respectively. Similarly, the top-1 ac-curacies for adversarially perturbed images of our subset are0.10 and 0.25 for CaffeNet and VGG-16, respectively. Sim-ilar to our observations in Section 4.1 of the paper, for mostDNN layers, masking the adversarial perturbations in justthe top 50% most susceptible filter activation maps (identi-fied by using the `1-norm ranking measure, Section 4.1 ofthe paper), is able to recover most of the accuracy lost by thebaseline DNN (Figure 8). Specifically, masking the adver-sarial perturbations in the top 50% ranked filters of VGG-16 is able to restore atleast 84% of the baseline accuracy onperturbation-free images.

C. Defender units: An ablation study

In general, defender units can be added at the outputof each convolutional layer in a DNN. However, this maycome at the cost of increased computations, trainable pa-rameters and also over-fitting. As mentioned in Section 5.1of the paper, we constrain the number of defender unitsadded to the DNN, in order to avoid drastically increas-ing the training and inference cost for larger DNNs (i.e.,VGG-16, GoogLeNet and ResNet-50). Here, we performan ablation study to identify the least number of defenderunits needed to at least achieve a 95% restoration accuracyacross most DNNs. Specifically, we use VGG-16 [43] andGoogLeNet [46] for this analysis. We evaluate restorationaccuracy on the ILSVRC2012 [8] validation set (50000 im-ages) by adding increasing number of defender units, start-ing from a minimum value of 2 towards a maximum valueof 10 in steps of 2. In Figure 9, we report the results ofthis ablation study and observe that for GoogLeNet, addingjust two defender units achieves a restoration accuracy of97% and adding any more defender units does not haveany significant impact on the restoration accuracy. How-ever, for VGG-16, adding only 2 defender units achievesa restoration accuracy of only 91%. For VGG-16, addingmore defender units improves performance with the bestrestoration accuracy of 96.2% achieved with 6 defenderunits. Adding more than 6 defender units resulted in a mi-nor drop in restoration accuracy and this may be due to dataover-fitting. As a result, we restrict the number of defenderunits deployed for any DNN to min(#DNN layers, 6).

D. Adversarial attack details

Here, we provide a brief overview of the different ad-versarial attacks and noise sources used for the evaluationpresented in Section 5.5 of the paper.

Page 12: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Top-1 accuracy Top-1 accuracy

Top-1 accuracy Top-1 accuracy

Figure 8. Effect of masking `∞ norm universal adversarial noise in ranked convolutional filter activations of CaffeNet [22] and VGG-16[43], evaluated on a 1000 image subset of the ILSVRC2012 [8] training set. Top-1 accuracies for perturbation-free images are 0.58, 0.69for CaffeNet and VGG-16, respectively. Similarly, top-1 accuracies for adversarially perturbed images with no noise masking are 0.1 and0.25 for CaffeNet and VGG-16, respectively. For VGG-16, masking the noise in just 50% of the ranked filter activations restores more than≈ 80% of the baseline accuracy on perturbation-free images.

Number of defender units

Res

tora

tion

accu

racy

Figure 9. Effect of adding defender units on the restoration ac-curacy of our proposed defense. Adding just two defender unitsin GoogLeNet [46] achieves a restoration accuracy of 97% andadding more defender units to the DNN does not improve resultsany further. For VGG-16 [43], adding 6 defender units providesbest results.

Fast Gradient Sign Method (FGSM) [15] is a single it-eration white-box adversarial attack that uses just the sign

of the gradient of the loss function with respect to the inputimage to construct an adversarial perturbation. Typically,the signed gradient perturbation is scaled by a factor ε thatcan be used to control the strength of the adversarial pertur-bation. For our experiments, we set ε = 0.5 and the imageinput to the DNN has a pixel value range of [0,255].

Iterative Gradient Sign Method (IGSM) [23] is an it-erative white-box adversarial attack based on FGSM [15],where at each iteration, an adversarial perturbation is con-structed by using the sign of the gradient of the loss func-tion with respect to the image, and then added to the im-age, followed by a clipping operation that restricts the perpixel perturbation to be within a [-ε, ε] neighborhood of theoriginal image. The iterative process is continued till thetarget DNN misclassifies the input image or the number ofiterations exceeds a maximum limit. We use ε = 2 in ourexperiments and restrict the maximum number of iterationsto 10.

DeepFool [33] is an untargeted white-box adversarial at-tack that seeks to find the smallest possible perturbationwhich when added to an input image causes the target

Page 13: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

DNN to misclassify the perturbed image, even thought theperturbation-free image is classified correctly. This attackapproximates a DNN’s decision function as a linear deci-sion boundary and minimizes the `2-norm of the the pertur-bation with respect to the original image. We use the codeand default parameters provided by the original authors athttps://github.com/LTS4/DeepFool

Local adversarial attack (Single-pixel attack) [45] is ablack-box attack that restricts the number of pixels (per-turbation budget) that can be adversarially perturbed in theoriginal image. Unlike, most other gradient-based attacks,this attack only corrupts a small fraction of the pixels in theoriginal image and still manages to fool a DNN into makingerroneous predictions. The attack uses differential evolutionand a greedy search to construct adversarial perturbationsfrom only the DNN’s output predictions without using anygradient information. The perturbation budget (i.e., numberof pixels perturbed) is constrained to just 500 pixels for ourexperiments with 224×224 pixel RGB images (i.e. 0.9% oftotal pixels).

Singular-UAP [20] is a recently proposed universal ad-versarial attack that uses the singular values of the Jacobianmatrices of the hidden layers of a DNN to construct univer-sal adversarial perturbations. Unlike [30], this attack uti-lizes far less number of images to construct an attack. Usingthe code and parameters provided at https://github.com/KhrulkovV/singular-fool, we construct at-tacks using the singular vectors of the Jacobian matrices for”conv3” (third convolutional layer) and ”fc6” (first fully-connected layer) layers of CaffeNet [22] and report the av-erage restoration accuracy.

Sensor noise [9] models realistic noise and blur inducedduring the image formation process in a real camera.Specifically, a poisson-gaussian model parameterized by(αp, σb) is used to model the noise and blur observed inthe captured image. The poisson component (parameterizedby αp) represents the noise introduced by the image sen-sor when an image is captured in low-light settings, whilethe gaussian component (parameterized by σb) models theblur induced during image formation. We evaluated thepoission-gaussian image formation model under three dif-ferent ISO light settings (ISO=100, 200 and 400) repre-senting increasingly noisy imaging conditions with αp =0.00038 and a fixed blur standard deviation σb = 0.0025.We report results averaged over all three ISO settings.

Additive White Gaussian Noise (AWGN) with three dif-ferent levels of noise standard deviation σn was used in ourevaluation. We use σn = {5, 10, 15} for our experimentsand report results averaged over all three noise levels.

E. Examples of synthetic perturbationsSample visualizations of synthetic adversarial pertur-

bations generated using our algorithm proposed in Sec-tion 4.3 (Algorithm 1) of the paper are provided in Fig-ure 10.

Page 14: arXiv:1906.03444v1 [cs.CV] 8 Jun 2019

Closest match Synthetic Difference map

Fooling ratio: 0.8 Fooling ratio: 0.78

ℓ ∞norm

= 10

CaffeNet

Closest match Synthetic Difference map

Fooling ratio: 0.75 Fooling ratio: 0.72

ℓ ∞norm

= 1

0

GoogLeNet

Closest match Synthetic Difference map

Fooling ratio: 0.60 Fooling ratio: 0.58

ℓ ∞norm

= 1

0

ResNet-50

Closest match Synthetic Difference map

Fooling ratio: 0.75 Fooling ratio: 0.72

ℓ ∞norm

= 1

0

VGG-16

Figure 10. Visualization of synthetic perturbations (center) computed for CaffeNet [22], GoogLeNet [46], ResNet-50 [17] and VGG-16 [43]along with their closest match in the set of original perturbations (left) and a per pixel difference map between the two (right).