rab: provable robustness against backdoor attacksarxiv.org/pdf/2003.08904v1.pdf · rab: provable...

35
RAB: Provable R obustness A gainst B ackdoor Attacks Maurice Weber †* Xiaojun Xu ‡* Bojan Karlas Ce Zhang Bo Li ETH Zurich, Switzerland {webermau, karlasb, ce.zhang}@inf.ethz.ch University of Illinois at Urbana-Champaign, USA {xiaojun3, lbo}@illinois.edu Abstract Recent studies have shown that deep neural networks (DNNs) are vulnerable to various attacks, in- cluding evasion attacks and poisoning attacks. On the defense side, there have been intensive interests in provable robustness against evasion attacks. In this paper, we focus on improving model robustness against more diverse threat models. Specifically, we provide the first unified framework using smoothing functional to certify the model robustness against general adversarial attacks. In particular, we propose the first robust training process RAB to certify against backdoor attacks. We theoretically prove the robustness bound for machine learning models based on the RAB training process, analyze the tightness of the robustness bound, as well as proposing different smooth- ing noise distributions such as Gaussian and Uniform distributions. Moreover, we evaluate the certified robustness of a family of “smoothed” DNNs which are trained in a differentially private fashion. In addition, we theoretically show that for simpler models such as K-nearest neighbor models, it is possible to train the robust smoothed models efficiently. For K =1, we propose an exact algorithm to smooth the training process, eliminating the need to sample from a noise distribution. Empirically, we conduct comprehensive experiments for different machine learning models such as DNNs, differentially private DNNs, and KNN models on MNIST, CIFAR-10 and ImageNet datasets to pro- vide the first benchmark for certified robustness against backdoor attacks. In particular, we also evaluate KNN models on a spambase tabular dataset to demonstrate its advantages. Both the theoretic analysis for certified model robustness against arbitrary backdoors, and the comprehensive benchmark on diverse ML models and datasets would shed light on further robust learning strategies against training time or even general adversarial attacks on ML models. 1 Introduction Building machine learning algorithms that are robust to adversarial attacks has been an emerging topic over the last decade. There are mainly two different types of adversarial attacks: (1) evasion attack, in which the attackers manipulate the test examples against a trained machine learning model, and (2) data poisoning attack, in which the attackers are allowed to perturb the training set. Both types of attacks have attracted intensive interests from academia as well as industries [14]. Several solutions to these threats have been proposed [58]. For instance, adversarial training has been proposed to retrain the ML models with generated adversarial examples [9]; quantization has been applied to either inputs or neural network weights to defend against potential adversarial instances [6]. However, recent studies have shown that these defenses are not resilient against intelligent adversaries responding dynamically to the deployed defenses [5]. As a result, one recent, exciting line of research aims to develop the provably robust algorithms against evasion attacks. To provide provable robustness for ML models, one usually needs to solve a min-max prob- lem min θ max π l f (x + π,y), where θ denotes the model parameter, π the perturbation and l f (·) the loss function for the training instance (x, y). Given the model complexity such as deep neural networks, exactly * The first two authors contribute equally to this work. 1 arXiv:2003.08904v1 [cs.LG] 19 Mar 2020

Upload: others

Post on 03-Nov-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

RAB: Provable Robustness Against Backdoor Attacks

Maurice Weber† ∗ Xiaojun Xu‡ ∗ Bojan Karlas† Ce Zhang† Bo Li‡† ETH Zurich, Switzerland {webermau, karlasb, ce.zhang}@inf.ethz.ch

‡ University of Illinois at Urbana-Champaign, USA {xiaojun3, lbo}@illinois.edu

Abstract

Recent studies have shown that deep neural networks (DNNs) are vulnerable to various attacks, in-cluding evasion attacks and poisoning attacks. On the defense side, there have been intensive interests inprovable robustness against evasion attacks.

In this paper, we focus on improving model robustness against more diverse threat models. Specifically,we provide the first unified framework using smoothing functional to certify the model robustness againstgeneral adversarial attacks. In particular, we propose the first robust training process RAB to certify againstbackdoor attacks. We theoretically prove the robustness bound for machine learning models based on theRAB training process, analyze the tightness of the robustness bound, as well as proposing different smooth-ing noise distributions such as Gaussian and Uniform distributions. Moreover, we evaluate the certifiedrobustness of a family of “smoothed” DNNs which are trained in a differentially private fashion. In addition,we theoretically show that for simpler models such as K-nearest neighbor models, it is possible to trainthe robust smoothed models efficiently. For K = 1, we propose an exact algorithm to smooth the trainingprocess, eliminating the need to sample from a noise distribution.

Empirically, we conduct comprehensive experiments for different machine learning models such asDNNs, differentially private DNNs, and KNN models on MNIST, CIFAR-10 and ImageNet datasets to pro-vide the first benchmark for certified robustness against backdoor attacks. In particular, we also evaluateKNN models on a spambase tabular dataset to demonstrate its advantages. Both the theoretic analysis forcertified model robustness against arbitrary backdoors, and the comprehensive benchmark on diverse MLmodels and datasets would shed light on further robust learning strategies against training time or evengeneral adversarial attacks on ML models.

1 Introduction

Building machine learning algorithms that are robust to adversarial attacks has been an emerging topic overthe last decade. There are mainly two different types of adversarial attacks: (1) evasion attack, in which theattackers manipulate the test examples against a trained machine learning model, and (2) data poisoningattack, in which the attackers are allowed to perturb the training set. Both types of attacks have attractedintensive interests from academia as well as industries [1–4].

Several solutions to these threats have been proposed [5–8]. For instance, adversarial training has beenproposed to retrain the ML models with generated adversarial examples [9]; quantization has been appliedto either inputs or neural network weights to defend against potential adversarial instances [6]. However,recent studies have shown that these defenses are not resilient against intelligent adversaries respondingdynamically to the deployed defenses [5].

As a result, one recent, exciting line of research aims to develop the provably robust algorithms againstevasion attacks. To provide provable robustness for ML models, one usually needs to solve a min-max prob-lem minθ maxπ lf (x + π, y), where θ denotes the model parameter, π the perturbation and lf (·) the lossfunction for the training instance (x, y). Given the model complexity such as deep neural networks, exactly

∗The first two authors contribute equally to this work.

1

arX

iv:2

003.

0890

4v1

[cs

.LG

] 1

9 M

ar 2

020

Page 2: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Clean Training Set

Poisoned Training Set

Attack

RAB Training

RAB Training Clean Model

Poinsoned Model

Poisoned/Clean Test Example

Clean Prediction

Poisoned Prediction

Robustness Guarantee

Figure 1: In this paper, we define a robust training process RAB for classifiers. Given a poisoned dataset D′— produced by adding backdoor patterns ∆ by the attacker to some instances in the clean dataset D — thisprocess guarantees that, for all test examples x, AD′(x) = AD(x), with high probability when the magnitudeof ∆ is within the certification radius.

solving this min-max problem could be NP-complete. Therefore, in primal optimization different methodsare developed to encode the nonlinear activation functions as linear constraints. For instance, NSVerify [10],MIPVerify [11], and ILP [12] are proposed. The constraints can also be simplified through dual optimization.Representative methods for dual optimization are Duality [13], ConvDual [14], and Certify [15].

Despite these recent developments on the provable robustness against evasion attacks, only empiricalstudies have been conducted to defend against poisoning attacks [16,17], and the question of how to improveand certify the robustness bound of given machine learning models against advanced poisoning adversarialattacks remains largely unanswered. In particular, to our best knowledge, there are no provably robuststrategies to deal with poisoning attacks. Naturally, we wonder: Can we develop provably robust algorithmsfor poisoning attacks?

Poisoning attack is a popular family of attacks in which an attacker adds small patterns to the training setsuch that the trained model is biased towards test images with the same pattern. Such attacks can be appliedto various real-world scenarios such as online recommendation systems [18]. In this paper, we presentthe first certification process for provable robustness against backdoor attacks, which is the most popularpoisoning attack against DNNs. Specifically, we focus on the following setting. Let A{xi,yi}n : X → {0, 1} bethe classifier trained with an n-example training set {(x1, y1), ..., (xn, yn)}. Let x be a test example, we focuson the certification process that guarantees the following properties:

∑ni=1 ‖πi‖2 < R ⇒ A{xi,yi}n(x) =

A{xi+πi,yi}n(x). Intuitively, this guarantees that the prediction of the classifier will stay consistently thesame, no matter what patterns the attacker adds to the training set as long as these patterns are bounded bycertain radius.

In this paper we first present a general provable robust framework that significantly generalizes over therecent result of randomized smoothing.

In our framework, we provide a certification process for (1) any deterministic function f with outputdomain in [0, 1], with the special case as a base classifier; (2) any type of perturbation, parametrized by aparameter δ, on the input of f ; and (3) we provide sufficient conditions under which our general result providesa tight robustness bound. This framework allows us to naturally develop the first certification process againstpoisoning attacks.

Given its generality, we particularly propose the RAB robust training process to improve prediction ro-bustness against backdoor attack which is the most popular poisoning attack against DNNs. In addition tothe certified robustness bound against normal DNNs, we also evaluate the models with differentially privatesmoothing training process and show that such “smoothed” models enjoy higher certified robustness. Thisprovides some guidance of improving the certified robustness against poisoning attacks and we hope it caninspire other robust learning works in the future. Besides DNN models, we also observed an interesting

2

Page 3: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

connection between RAB and a recently developed result from the database community [19] and adapt thetechnique to show that for simple models such as KNNs, it is possible to perform efficient exact smoothingrobust training process instead of randomly sampling noise from a noise distribution. We propose a exactsmoothing algorithm for 1NN models to certify its robustness against backdoor attacks.

We evaluate our algorithms on multiple machine learning models including DNNs, differentially privateDNNs, and 1NN models, and provide the first collection of certified robustness bounds on a diverse range ofdatasets such as MNIST, CIFAR-10, ImageNet, and a spambase tabular data as benchmarks. We hope thatthese experiments and benchmarks can provide future directions for improving the model robustness.

As the first result on provable robustness against a poisoning attack, we have no doubt that these resultswill be improved by follow-up work in the near future. We make the code and evaluation protocol publiclyavailable with the hope to facilitate future research by the community.

Our Contributions. In this paper, we made the following technical contributions.• We propose a unified framework to certify model robustness against both evasion and poisoning attacks

via promoting a general randomized smoothing strategy. We also prove the tightness for differentsmoothing strategies.

• We propose an exact efficient smoothing algorithm for 1NN models without needing to sample randomnoise during training.

• We provide the first certifiable robustness bound with theoretic guarantees against backdoor poisoningattacks on general machine learinng models with smoothing noise sampled from different distributions.

• We analyze the sufficient condition for model robustness against poisoning attacks. We show thatsmoother model (e.g. differentially private functions or models with certain dropout ratio) can achievehigher certified robustness.

• We conduct extensive reproducible large-scale experiments and provide a benchmark on the certifiedrobustness against backdoor poisoning attacks for different machine learning models such as DNNs,differentially private DNNs, and 1NN on diverse datasets including ImageNet. We implement threedifferent poisoning attacks during evaluation. We make our models and code publicly available athttps://github.com/AI-secure/Robustness-Against-Backdoor-Attacks.

2 Background

We provide the overview for the state-of-the-art poisoning attacks especially backdoor attack (or Trojanattack), including different criteria to categorize them. We will then introduce the randomized smoothingstrategy which has been used to improve the model robustness against evasion attacks.

2.1 Backdoor (poisoning) attacks

A backdoor attack on an ML model such as a neural network aims to inject certain “backdoor” patterns duringtraining and associate such patterns with a specific adversarial target (label). As a result, during testing time,any test instance with such a pattern would be misrecognized as the pre-selected adversarial target [20,21].Such models with injected backdoors are called backdoored models or Trojan models which are usually ableto achieve similar performance as benign models on normal test data, making it very challenging to identifysuch maliciously trained backdoored models.

Different backdoor attacks have been developed to contaminate the machine learning training process.For instance, Gu et al. [20] demonstrate a “sticker” based backdoor attack for traffic signs against classifiers.It is shown that as long as the specific sticker pattern is added onto a stop sign, the backdoored model willalways misrecognize it as a speed limit sign, while maintaining similar performance on other road signs.

3

Page 4: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

There are several ways to categorize backdoor attacks. First, based on the adversarial target design, theattacks can be characterized as single target attack and all-to-all attack. In a single target attack, the backdoorpattern will cause the poisoned classifier to always return a designed target label, such as classifying anyroad sign with the backdoor pattern as a speed limit sign. An all-to-all attack leverages the backdoor patternto permute the classifier results. For instance, Gu et al. demonstrate an attack where a backdoor pattern cancause the poisoned model to change the label of digit i to (i+ 1) (mod 10) [20].

Based on the difference in backdoor patterns, there are region based and blending backdoor attacks. Inthe region based attack, a specific region of the training instance is manipulated in a subtle way that willnot cause human notification [3, 20]. In particular, it has been shown that such backdoor pattern can be assmall as only one or four pixels [22]. On the other hand, Chen at al. [21] show that by blending the wholeinstance with certain pattern such as a meaningful background or a fixed random pattern, it is also possibleto generate effective backdoor instances to poison the ML models. As a result, during test time, targetedincorrect prediction could be made by the model on test data with the blending backdoor pattern.

Given different targeted machine learning models, the backdoor attacks can either focus on a single modelsuch as a neural network [20,21] or a distributed ML system such as federated learning [23].

In addition, based on whether the attacker can manipulate the training data or directly the weights of themodels, the backdoor attack can also be classified as data backdoor [3,20,21] and model parameter backdoorattacks [24–26]. The data backdoor attack embeds specific backdoor patterns for training data, while themodel parameter backdoor attack directly manipulates the model parameters. The model parameter back-door attack contains three steps. First, the adversary generates a backdoor pattern using the gradient-basedapproach, which is the easiest way to backdoor attack a model; next, the adversary reverse-engineers someinputs from the model as training data; finally, the adversary adds the generated backdoor pattern to thesynthesized input data and retrains a small part of the model. After retraining, the model will output thedesired label for test data with the designed backdoor pattern. In this attack the adversary is only able tochoose the backdoor mask which controls the shape and location of the backdoor, while the exact backdoorpattern generated by the gradient-based approach is not under the control of the adversary.

In this work, we mainly focus on certifying the robustness of a single model against general backdoorattacks, where the attacker is able to add either specific or uncontrollable random backdoor patterns. Inparticular, we will analyze both the region based and blending attacks.

2.2 Improving learning robustness via randomized smoothing

The randomized smoothing technique has been studied by several works to improve learning robustness.Some provide heuristic approaches to defend against adversarial examples [27, 28], and some provide the-oretic guarantees against the Lp bounded adversarial perturbation. In particular, Cohen et al. [29] haveproposed a tight robustness guarantee in L2 norm with Gaussian noise smoothing.

On the higher level, the randomized smoothing strategy [29] provides a way to certify the robustness ofa smoothed classifier against adversarial examples during test time. First, a smoothed classifier is obtainedby sampling Gaussian noise ε ∼ N (0, σ2) around each test instance. Then, the classification gap betweena lower bound of the confidence on the top-1 class pA and an upper bound of the confidence on the top-2class pB are obtained. The smoothed classifier will be guaranteed to provide consistent predictions withinthe perturbation radius, which is a function of the variance of smoothing noise σ, pA and pB , for each testinstance.

However, all these approaches focus on the robustness against evasion attacks only, and in this work weaim to first provide a general smoothing functional to certify the robustness against both evasion and poi-soning attacks. In particular, the current randomized smoothing strategy focuses on adding noise to smoothon the level of test instance, while our unified framework generalizes it to smooth on the level of classifiers.We then specifically describe the additional challenges of certifying robustness against poisoning backdoorattacks, and provide theoretic robustness guarantees for different machine learning models, randomizedsmoothing noise distributions, as well as the tightness of the robustness bounds.

4

Page 5: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

3 Threat Model

We will first provide a detailed threat model analysis for poisoning attacks especially backdoor attack againstgeneral machine learning models including deep neural networks [18,20]; and then specify the threat modelfocused in this paper and emphasize its generality. Concretely, the threat model consists of definition of theadversary’s goal, knowledge of the attacked system, and capability of manipulating or poisoning the trainingdata, to eventually categorize the potential backdoor poisoning attack strategies.

3.1 Adversary’s goal

The goal of an adversary is to inject “backdoors” during the machine learning model training phase, so thatpredictions on new data will be modified in the testing phase. There are potentially two types of poisoningattacks based on the adversary’s goal: availability attack and integrity attack. If the adversary aims to affectprediction indiscriminately, i.e., to cause a denial of service, it is called availability attack [18]. On the otherhand, if the adversary’s goal is to cause specific mis-predictions during the test phase while preserving thepredictions on the other test instances, it is referred to as integrity attack [20,25].

3.2 Adversary’s knowledge

Given the information that an adversary can access, the poisoning attacks include white-box and black-boxattacks. In the white-box attack scenario, the attacker is assumed to have knowledge about the trainingdata Dtrain and the learning algorithm L, as well as the model parameter θ. This way, it is possible for theattacker to synthesize the optimal poisoning instances to poison different machine learning models with lowpoisoning ratio [30, 31]. In the black-box attack, the attacker has no knowledge about the training dataor ML models, while is able to collect and contribute additional training data [20]. This way, either thetransferable poisoning attack or the model agnostic poisoning attack would be performed [32,33].

3.3 Adversary’s capability

To conduct poisoning attacks, the adversary would inject poisoning instances into the training set before themodel is trained. The attacker’s capability is usually limited by the number of poisoning instances that canbe injected and the manipulations added to each poisoning instance. Lower poisoning ratio would help topreserve the model performance on benign test data, and smaller manipulation for each poisoning instancewould imply lower chance to be detected. In addition, the adversary can generate either fixed or dynamicbackdoor patterns [34] to ensure the hardness of being detected.

General backdoor based poisoning attack. In this paper, based on Kerckhoffs’s principle [35] we aim tocertify the learning robustness against the strongest attacker, who aims to perform an integrity attack withwhite-box access to the learning model. In particular, we focus on the general backdoor based poisoning at-tack, where the attacker is able to inject a desired “backdoor” pattern to certain training instances. We allowthe attacker to design different backdoor patterns with their chosen poisoning ratio, backdoor magnitude, loca-tion, position, and we also allow the attacker to generate different backdoors for different instances dynamicallyto ensure the robustness against a dynamic backdoor attack [34]. To our best knowledge, this is the first workto provide: (1) certifiable robustness against backdoor attacks, and (2) certifiable robustness against generalbackdoors including the dynamic [34] and uncontrollable backdoors (model parameter attack [24,25]).

We specifically analyze a range of backdoor patterns from small ones such as one pixel and four-pixel,and large ones which are based on the whole instance such as the “blending attack” [24,25].

5

Page 6: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

3.4 Formal definition

We now provide the general notational setup and formally define the threat model analysed in this work.

Notation. We refer to X as the input feature space1, on which predictions of labels Y ⊆ N can be made.Furthermore, we write Dn :=

∏ni=1 (X × Y) for the space of training sets D = {(x1, y1), . . . , (xn, yn))}

comprised of n feature vectors x ∈ X and labels y ∈ Y generated according to some unknown distributionPX,Y . For a random variable X, we write PX to denote the probability measure induced by X and µX forthe probability mass function in the discrete case, or the probability density function, if X admits one, in thecontinuous case. For a set S we denote its probability by PX(S). Finally, we denote by SC the C-dimensionalprobability simplex SC = {p ∈ [0, 1]C | ‖p‖1 = 1} and define classifiers to be general deterministic functionsh : X × Dn → SC that output a vector of class probabilities, given a test instance x ∈ X and a training setD ∈ Dn. We obtain the prediction by taking the label y ∈ Y which has the highest probability. The set ofclassifiers is written as H(X ×Dn, SC).

Backdoor attack and defense. Given a training set of labelled training instances D, the learner’s task isto learn a classifier h : X → Y to label test instances. The task of the adversary is to poison the trainingset by by replacing r training instances xi, yi by poisoned instances (xi + πi, y

Ai ) where yAi is the designed

adversarial target. Given the poisoned training set, the defender will train a classifier hA. Here we allowthat the set of r backdoors πj may be comprised of distinct patterns or just a single one. The goal of theadversary is that, during test time, given any test instance xtest, hA(xtest + πj) = yAj where yAj representsthe adversarial target which could be a single or an all-to-all attack as mentioned above. As a defender, thegoal is to certify the learning robustness of a smoothed classifier gh given a base classifier h, such that duringtest time the smoothed classifier will always provide the same prediction for a test instance no matter whatpattern is added to the training set: gh(xtest + π|D) = gh(xtest + π|D + π). We will show that the onlycondition that the patterns πj need to satisfy is that their Lp-norm needs to be bounded, and this boundis a function of the classifier smoothing noise and the confidence gap between the top two classes for thesmoothed classifier. We prove that our robustness bound is tight and we also show the different robustnessperformance of different types of classifiers including neural networks, differentially private neural networkswhich are additionally smoothed with privacy noise, and KNN classifiers.

4 Unified Framework for Certified Robustness

In this section we present the proposed unified theoretical framework for certified robustness for generalmachine learning models against either evasion or poisoning attacks, based on the randomized smoothingstrategy, which has been leveraged to certify the robustness against evasion attacks [29]. We first definethe basic notions of smoothing functionals, smoothed classifiers and the confidence of a classifier at a testinstance and training set. Given these definitions, we present our main theoretical result and conditionsunder which this result provides a tight robustness guarantee.

4.1 Preliminaries

We view base classifiers as the entire prediction process that consists of learning the model given a dataset,and making a prediction at a test instance. The following definitions formalize the idea of smoothing such abase classifier by introducing noise to this prediction process. Specifically, we divide the smoothing processinto two concepts: (1) the smoothing functional, which serves as the strategy governing the defense and al-lowing to model different attack scenarios, and (2) the ε-smoothed classifier which is given by the expectationover the noise distribution ε used to smooth the classifier.

1In practice we typically have X ⊆ Rd, but other sets — e.g. Zd or {0, 1}d — are also possible. We write X to preserve generality.

6

Page 7: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Definition 1 (Smoothing Functional). We define a smoothing functional to be a general function

F : H(X ×Dn, SC)→ H(X ×Dn ×Z, SC) (1)

mapping base classifiers defined on X ×Dn to classifiers with input space X ×Dn ×Z.

The above definition is of a rather general nature and only requires that F maps classifiers to classifiers.In essence, a smoothing functional acts on classifiers by introducing noise sampled from Z to the predictionprocess. We note that the resulting classifier F(h) does not yet have an explicit connection to the baseclassifier h. However, as we will see later, instantiations of this framework will establish a clear relation. Thenext definition integrates our notion of smoothing functionals with randomized smoothing classifiers.

Definition 2 (ε-Smoothed Classifier). LetF : H(X×Dn, SC)→ H(X×Dn×Z, SC) be a smoothing functional,h : X × Dn → SC a base classifier and ε ∼ Pε a random variable taking values in Z. We define the associatedε-smoothed classifier as

gεh(x|D) := Eε(F(h)(x, D, ε)). (2)

The above Smoothing functional provides a general process for smoothing on the classifier level and, givena classifier, the Smoothing functional can be used to produce the ε-Smoothed Classifier based on general noisedistributions. The next definitions establish a more explicit relation between the ε-smoothed classifier gεh andthe base classifier h for the cases where the smoothing functional transforms test instances (evasion attacks)and training datasets (poisoning attacks).

Definition 3 (Test Time Smoothing). We define smoothing functionals for evasion attacksFE : H(X×Dn, SC)→H(X ×Dn ×Z, SC) to be of the form

FE(h)(x, D, z) = h(φ(x, z)|D). (3)

where φ : X × Z → X is a deterministic function that acts on test instances x ∈ X with a transformationparameter z ∈ Z.

Definition 4 (Training Time Smoothing). We define smoothing functionals for dataset poisoning attacksFP : H(X ×Dn, SC)→ H(X ×Dn ×Z, SC) to be of the form

FP (h)(x, D, z) = h(x|ψ(D, z)). (4)

where ψ : Dn ×Z → X is a function that acts on datasets D ∈ Dn with a transformation parameter z ∈ Z.

We see that these two types of smoothing strategies have different dynamics, allowing us to developpotential defenses against different types of threat models. The next type of smoothing functional unifiesthis and looks at smoothing as a simultaneous process during testing and training.

Definition 5 (Test and Training Smoothing). We define smoothing functionals for smoothing during test andtraining time FT : H(X ×Dn, SC)→ H(X ×Dn ×Z, SC) to be of the form

FT (h)(x, D, z) = h(φ(x, z)|ψ(D, z)). (5)

where φ : X × Z → X and ψ : Dn ×Z → Dn are deterministic functions.

We see that the choice ψ(D, z) = D amounts to test time smoothing, while φ(x, z) = x results in trainingtime smoothing. The following definition of (pA, pB)-confidence at tuples of test instances and trainingsets (x, D) extends the notion of (pA, pB)-confidence at a test instance defined in [36] to our more generalsetting.

Definition 6 ((pA, pB)-Confidence at (x, D)). Let x ∈ X and D ∈ Dn. Given a smoothing functional F , abase classifier h, cA ∈ Y and pA, pB ∈ [0, 1], we say that the ε-smoothed classifier gεh is (pA, pB)-confident at(x, D), if

gεh(x|D)cA ≥ pA ≥ pB ≥ maxc 6=cA

gεh(x|D)c. (6)

7

Page 8: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Definition 7 (Lower Level Sets). Let ε ∼ Pε, δ ∼ Pδ be Z-valued random variables. For t ≥ 0, we define strictlower and lower level sets as

St :=

{z ∈ Z | µδ(z)

µε(z)< t

}and St :=

{z ∈ Z | µδ(z)

µε(z)≤ t}. (7)

This Framework generalizes what is presented in [29] and [36] from input transformations to the moregeneral notion of functionals that act on classifiers. This allows us to analyze and derive provable robustnessagainst more general attack models.

For instance, consider the setting where an attacker adds a carefully chosen adversarial perturbation π0

to a test instance xtest with the goal of manipulating the prediction. A viable defense strategy would beto choose the input transform φ(x, z) = x + z and use Gaussian noise in combination with the smoothingfunctional FE , resulting in the ε-smoothed classifier gεh(x|D) = Eε(h(x + ε|D)). With this instantiationof the proposed unified framework, we can recover the results presented in previous work [29]. In thenext section, we provide a general robustness condition which allows us certify model robustness within thisunified framework and in Section 5 we show how one can use this in order to obtain provable robustnessagainst backdoor attacks.

4.2 A General Condition for Provable Robustness

We will now present our result for obtaining provable robustness. Our main Theorem 1 is a more generalversion of the robustness condition derived in [36], and it allows us to analyze a substantially larger classof threat models. In particular, we extend the guarantee from test-time smoothing to smoothing functionalsthat act on classifiers, which allows us to obtain provable robustness against attacks on the entire predictionprocess (that includes learning and inference) rather than on a testing sample.

Our Theorem is based on the following intuition. Suppose that the ε-smoothed classifier gεh predicts a testinstance xtest — given a training set Dtrain — to be of class cA with probability at least pA and the secondmost likely class with probability smaller than pB . Theorem 1 tells us that under this assumption, if we wereto use a different noise distribution δ for smoothing, then the δ-smoothed classifier gδh is guaranteed to alsopredict xtest to be of class cA given the same training set Dtrain, as long as δ and ε satisfy a condition thatdepends only on the confidence levels pA and pB . We emphasize that no explicit relation between ε and δ isneeded in order for the theorem to hold.

Theorem 1. Let ε and δ be Z-valued random variables. Let xtest ∈ X , Dtrain ∈ Dn a dataset, F : H(Rd ×Dn, SC) → H(Rd × Dn × Z, SC) a smoothing functional and h ∈ H(X × Dn, SC) a base classifier. Supposethat the ε-smoothed classifier is (pA, pB)-confident at (xtest, Dtrain) for some cA ∈ Y, that is

gεh(xtest|Dtrain)cA ≥ pA ≥ pB ≥ maxc6=cA

gεh(xtest|Dtrain)c. (8)

Let ζ : R≥0 → [0, 1] be the function defined by

ζ(t) := Pε(St) (9)

and let ζ−1(p) := inf{t : ζ(t) ≥ p} be its generalized inverse. For t ≥ 0 and p ∈ [0, 1], let

St,p := {S ⊆ Z|St ⊆ S ⊆ St ∧ Pε(S) ≤ p} (10)

and define the function ξ : R≥0 × [0, 1]→ [0, 1] by

ξ(t, p) := sup{Pδ(S)|St ⊆ S ⊆ St ∧ Pε(S) ≤ p}. (11)

If δ satisfies1− ξ(ζ−1(1− pB), 1− pB) < ξ(ζ−1(pA), pA), (12)

thengδh(xtest|Dtrain)cA > max

c 6=cAgδh(xtest|Dtrain)c. (13)

8

Page 9: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

While the generality of this statement allows to model a wide range of threat models, it bears the chal-lenge of how one should instantiate this theorem such that it is applicable to defend against a specific adver-sarial attack. In addition to the flexibility with regards to the underlying threat model, we are also providedwith flexibility with regards to the smoothing distributions, resulting in different robustness guarantees. Thisagain comes with the question, which smoothing distributions result in useful robustness bounds. In the fol-lowing, we will show how this theorem can be used to obtain provable robustness against dataset poisoningand in particular backdoor attacks. We refer the reader to Appendix C for a detailed proof of this result. Thenext proposition shows that ε and δ having non-disjoint support is a necessary condition on δ such that itcan satisfy (12).

Proposition 1. If the support of δ is disjoint from the support of ε, then δ can not satisfy (12).

The next theorem shows that our result is tight in the case where the function ζ satisfies certain regu-larity conditions. This result tells us that, whenever the distribution δ violates (12) and all we know aboutthe smoothed classifier are its class probabilities, then there will always be a base classifier for which thecorresponding δ-smoothed classifier makes the wrong prediction.

Theorem 2. Let 1 ≥ pA ≥ pB ≥ 0 such that pA + pB ≤ 1. Let ε and δ be Z-valued random variables withnon-disjoint support and such that ζ(0) = 0 and ζ is strictly increasing, continuous and

1− ξ(ζ−1(1− pB), 1− pB) > ξ(ζ−1(pA), pA). (14)

Consider the smoothing functional FT defined by

FT (h)(x, D, z) = h(φ(x, z)|ψ(D, z)) (15)

where φ : X ×Z → X and ψ : Dn ×Z → Dn are deterministic functions. Let xtest ∈ X and Dtrain ∈ Dn. Thenthere exists a base classifier h∗ such that the ε-smoothed classifier

gεh∗(x|D) = Eε(h∗(φ(x, ε)|ψ(D, ε))) (16)

is (pA, pB)-confident at (xtest, Dtrain) and

gδh∗(xtest|Dtrain)cA < maxc 6=cA

gδh∗(xtest|Dtrain)c. (17)

The following proposition shows that Gaussian smoothing provides a tight robustness bound.

Proposition 2. Suppose Z =∏ni=1 Rm. For i = 1, . . . , n let εi

iid∼ N (mε, Σ) and δiiid∼ N (mδ, Σ) with

mδ 6= mε and covariance matrix Σ. Then, condition (12) provides a tight robustness guarantee in the sense ofTheorem 2.

As we will see in the next section, Gaussian smoothing results in an L2-robustness radius. In this case,Theorem 2 has the following — more intuitive — interpretation: assuming that we only know the classprobabilities associated with a smoothed classifier, then it is impossible to certifiy a larger radius.

5 Certified Robustness Against Backdoor Poisoning Attacks

Theorem 1 is rather abstract and certifying robustness against backdoor attacks is not straightforward. Inthis section, we aim to answer the question: How can we instantiate this general result to obtain robustnessguarantees against backdoor attacks? In addition — due to its generality — we can derive robustness boundsfor smoothing with (1) isotropic Gaussian noise and (2) with uniform noise. These two distributions inhibitdifferent dynamics and thus lead to distinct robustness bounds.

9

Page 10: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

As a first step, we outline the intuition governing our approach to certifying robustness against backdoorattacks. Suppose that we are given a base classifier which was trained on a poisoned dataset that contains rinputs infected with the trojan pattern π0. Suppose we know that the smoothed (poisoned) classifier is con-fident in predicting a malicious input xtest +π0 to be of a given class yA. Our goal is to derive a condition onπ0 such that the prediction for xtest + π0 is the same as the prediction that a smoothed classifier would havemade, had it been trained on a dataset that is not infected with the backdoor pattern π0. In other words, weobtain the guarantee that an attacker can not achieve their goal of systematically leading the test instance withthe backdoor pattern to the adversarial target, meaning they will always obtain the same prediction as longas the added pattern π0 satisfies certain conditions (bounded magnitude). The intention of the attacker —carrying out a backdoor attack — has thus failed. The next corollary justifies this intuition and embeds itinto our theoretical framework.

For the sequel, suppose that xtest ∈ X is a test instance, Dtrain ∈ Dn a benign training set and π ∈∏ni=1 Rd is a set of n backdoors. Furthermore, let

Dtrain = {(xi, yi ⊕C y′i)| (xi, yi) ∈ Dtrain} (18)

denote the benign training set that contains flipped labels. The notation ⊕C is used to denote additionmodulo C.

Corollary 1. Let Z =∏ni=1 Rd and consider the dataset transform ψ : Dn ×Z → Dn defined by

ψ(D, z) = {(xi + zi, yi)ni=1}. (19)

Let ρ : Z → X be any deterministic function and let FT be the smoothing functional defined by

FT (h)(x, D, z) = h(x+ ρ(z)|ψ(D, z)). (20)

Let {Zi}ni=1 be a collection of n iid random variables, Z := (Z1, . . . , Zn) and consider the (π + Z)-smoothedclassifier given by

gπ+Zh∗ (x|D)) = E (h∗(x+ ρ(π + Z)|ψ(D, π + Z))) . (21)

Let π0 ∈ Rd be a backdoor pattern and suppose that gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain), i.e.

gπ+Zh∗ (xtest + π0|Dtrain)cA ≥ pA ≥ pB ≥ max

c6=cAgπ+Zh∗ (xtest + π0|Dtrain)c (22)

for some cA ∈ Y. If condition (12) holds for the random variables ε := π + Z and δ := Z, then

gZh∗(xtest + π0| Dtrain)cA > maxc6=cA

gZh∗(xtest + π0| Dtrain)c. (23)

The next two corollaries instantiate Corollary 1 with Gaussian and uniform noise distributions. For bothcases, we get a robustness guarantee in Lp-norm, telling us that, whenever the patterns π are within a givenregion, the smoothed classifier trained with the poisoned dataset Dtrain + π will make the same predictionas the classifier trained with the benign dataset Dtrain that contains flipped labels.

Corollary 2 (Gaussian Smoothing). Consider the setting in Corollary 1. Suppose that ∀ i : Ziiid∼ N (0, σ2

1d).If gπ+Z

h∗ is (pA, pB)-confident at (xtest + π0, Dtrain), and π := (π1, . . . , πn) satisfies√√√√ n∑i=1

‖πi‖22 <σ

2

(Φ−1(pA)− Φ−1(pB)

)(24)

thengZh∗(xtest + π0| Dtrain)cA > max

c 6=cAgZh∗(xtest + π0| Dtrain)c. (25)

10

Page 11: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

In the special case where an attacker adds the same pattern π0 to r training instances, then condition (24)reduces to

‖π0‖2 <σ

2√r

(Φ−1(pA)− Φ−1(pB)

). (26)

Corollary 3 (Uniform Smoothing). Consider the setting in Corollary 1. Suppose that ∀ i : Ziiid∼ U([a, b]d) for

finite a < b. If gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain), and π = (π1, . . . , πn) satisfies

1−(pA − pB

2

)<

n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

, (27)

thengZh∗(xtest + π0| Dtrain)cA > max

c 6=cAgZh∗(xtest + π0| Dtrain)c. (28)

Similar to the Gaussian case, if an attacker adds the same pattern π0 to r training instances, then condi-tion (27) reduces to

1−(pA − pB

2

)<

d∏j=1

(1− |π0,j |

b− a

)+

r

. (29)

The next corollary shows that we can use uniform smoothing in order to obtain provable robustness withrespect to the L∞-norm.

Corollary 4 (Uniform Smoothing with L∞-bound). Consider the setting in Corollary 3. Let π0 ∈ Rd be a back-door pattern and suppose that π contains π0 exactly r times and the 0-vector n− r times. Then, condition (27)is satisfied, if π0 satisfies

‖π0‖∞ < (b− a)

(1−

(1− pA − pB

2

) 1d·r). (30)

One crucial optimization to achieve better certified accuracy is to introduce the deterministic functionρ in the smoothing functional (20). The intuition is as follows. When we smooth a classifier over thetraining process by perturbing the training set with small additive noise, then the base classifier has onlyseen perturbed examples. This can be problematic, when considering classification of clean images. Forexample, Gaussian noise places almost no mass near its mode in high dimensions. This results in trainingimages that come from a distribution with virtually disjoint support from natural images and thus, such amodel may fail to learn classifying clean samples. The (deterministic) function ρ thus provides us with away to perturb test instances with noise and thereby increase model accuracy. One way to instantiate thesmoothing functional is to simply define ρ to have a constant value ρ(Z) ≡ ρ0. Another possibility is tointroduce noise by choosing an average over Z1, . . . , Zn. Finally, setting ρ(Z) to depend on the hash of thetrained model also introduces randomness and can lead to better prediction accuracy.

6 Instantiating the General Framework with Specific ML Models

Based on the certified robustness against backdoor attacks by smoothing the training process, we show thattheoretically it is possible to achieve robust prediction even though the training data is poisoned for generalmachine learning models. In this section, we will particularly analyze three types of machine learningmodels, including deep neural networks, differentially private deep neural networks, and simple models suchas K-nearest neighbor classifiers (K=1). First, since the backdoor poisoning attacks have been shown mostsuccessful for deep neural networks which have caused a lot of attention, here we mainly want to evaluateand certify the robustness of different deep neural network models on diverse datasets. Secondly, we aimto understand more about the properties of models that could potentially lead to the learning robustness;

11

Page 12: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Poisoned Training Set 𝓓

Smoothed Training Set 𝓓+ 𝝐𝟏

Smoothed Training Set 𝓓+ 𝝐𝒏

RAB Process

Model 1

Model 𝑛

… …

Clean/Poisoned

Test Example

AggregationRobust

Prediction(e.g., Lion)

Figure 2: An illustration of the RAB robust training process. Given a poisoned training set D and a train-ing process A vulnerable to backdoor attacks, RAB generates N smoothed training sets {Di}i∈[N ] andtrains N different classifiers ADi . For binary classification problem, given a test example x, RAB outputsI[1/N

∑iADi(x) ≥ pA] as the output prediction.

Algorithm 1 DNN-RABInput: (Poisoned) Training datasetD = {(xi, yi)ni=1}, Test data xtest, Smooth noise scale σ, Model NumberN , Error rate α./* Train models using smoothed dataset. */for k = 1, . . . , N do

Sample z1, . . . , zn ∼ N (0, σ2I).Dsmoothk = {(xi + zi, yi)

ni=1}.

fk = train model(Dsmoothk ).

uk = noise sample(σ, hash(fk)).end for/* Calculate the empirical estimation of pA and pB . */g = 1

N

∑Nk=1 fk(xtest + uk).

pA, cA = maxc gc, arg maxc gc.pB = maxc 6=cA gc./* Calculate lower bound of pA and upper bound of pB . */pA, pB = calculate bound(pA, pB , N, α).if pA > pB then

Return: prediction cA, radius σ2

(Φ−1(pA)− Φ−1(pB)

).

elseReturn: ABSTAIN.

end if

therefore we evaluate the certified robustness of differentially private models which are already “smoothed”on the given training data in order to minimize the sensitivity of learned models. Our hypothesis is thatsuch smoothed model will achieve higher certified robustness against poisoning attacks based on our RABcertifiable training process. In addition, it is of great interest to know the robustness of other machinelearning models, such as KNN models given the fact that these models have been widely applied in differentapplications either based on raw data or trained embeddings. Specifically, we are inspired by a recent resultdeveloped by the database community [19] and apply similar techniques to develop an efficient smoothingalgorithms for KNN models (when K=1), such that we do not need to draw a large number of randomsamples from the smoothing distribution ε for these models.

12

Page 13: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

6.1 Deep Neural Networks

Our pipeline of RAB on deep neural networks is shown in Alg. 1. Our goal is to calculate the prediction gπ+Zh∗

as in Corollary 2 and the corresponding certified bound as in the right hand side of Eqn. 24. In the mostintuitive way, we will first train N smoothed models h1, . . . , hN by sampling different smoothing noises Zto smooth the training dataset and train the models. Then, given the test data xtest, the prediction of eachmodel hk(xtest) is an unbiased estimation of gπ+Z

h∗ so we can use the average g = 1N hk(xtest) as the empirical

estimation of gπ+Zh∗ , and we will calculate the prediction cA and estimated probability pA and pB as shown

in the algorithm. Finally, we use Hoeffding inequality to calculate the lower bound of pA and upper boundof pB with error tolerance α. In particular, by computing pA and pB according to

pA = pA −√

log 1/α

2 ·Nand pB = pB +

√log 1/α

2 ·N,

we can calculate the certified radius using these bounds.However, directly making predictions using hk(xtest) will not yield a good result in practice. The reason

we find is that when the classifier hk is trained on a smoothed dataset {(xi + zi, yi)}, it usually has a betterprediction performance on input data with noise (i.e. hk(x+ z)) instead of directly on clean input data (i.e.hk(x)), since the model has only seen the data distribution x+z which could have different support with thedata distribution x. As a result, we propose a simple yet effective heuristic to map the test instances to thenearby positions as well. We propose to add noise with the same distribution as zi’s during the test phase ofthe certification. However, in our theoretical analysis we require that the model is fully determined by thesmoothed dataset and does not contain randomness. In order to achieve this, we change our training processso that our model hk now contains two parts — a trained model fk and a sampled noise uk ∼ N (0, σ2I).In particular, we will first train fk using the smoothed dataset and then use the hash value of trained fkparameters as the random seed and sample uk ∼ N (0, σ2I). In practice we use SHA256 hashing [37] of thetrained model file. The output of the model contains hk(x) = fk(x+uk). This way, the hk is fully determinedby the training dataset while in the meantime can get better prediction performance in practice.

6.2 Differentially Private Models

Based on the intuition that the “smoothed” model would be more robust against attacks, in this section wewould like to provide a way to verify this assumption and further provide guidance about how to improvethe certified robustness of machine learning models against poisoning attacks.

In the proposed RAB, we directly smooth the training process by adding different smoothing noise εiat each time and then train a corresponding model as shown in Figure 2. The jth trained model can berepresented as

fj = arg minθ

∑i

l(f(θ, xi + zi), yi)

To estimate the effects of the model smoothness on the certified robustness bound, we improve the modeltraining process with differentially private stochastic gradient descent (DP-SDG) [38]. During each iterationof training the model fj , we would add certain noise to the gradient to reduce its dependency on specifictraining data and then clip the gradient to a predefined range. Intuitively, this process will add another levelof “smoothness” to the model training and hopefully to obtain large gap between pA and pB and thereforeprovide higher certified robustness. Note that DP-SGD is just one way to provide a smoothed model trainingprocess and therefore improve model robustness. It is flexible to plug in any other smoothing strategies tothe proposed RAB framework to enhance the learning robustness against poisoning backdoor attacks.

6.3 1-Nearest Neighbors

If the base classifier h is given by a 1-nearest neighbor classifier, we can evaluate the corresponding smoothedclassifier analytically and thus compute the confidence levels pA and pB exactly, opposed to the approxima-tion needed in the case for DNN classifiers. In this section, we show an example of how this computation

13

Page 14: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

can be performed efficiently for 1NN on Gaussian noise [19].

In order to measure similarity between feature vectors x ∈ X , we use a quantized euclidean distance.Specifically, consider a partition of R≥0 into L buckets Il := [bl−1, bl) where b0 = 0 < b1 < . . . < bL−1 <bL =∞. Associated with each bucket l is a level of similarity βl such that β1 > β2 > . . . > βL. We define thesimilarity associated with these quantization levels and buckets as

sL(x, x′) :=

L∑l=1

βl · 1{‖x− x′‖22 ∈ Il}. (31)

Given a test example xtest and a training set Dtrain ∈ Dn, a 1-NN classifier returns the class of the traininginstance xi which is nearest to xtest in terms of the similarity measure sL. Formally, let i?(xtest, Dtrain)denote the index of the training instance in Dtrain most similar to xtest, defined as i?(xtest, Dtrain) =arg mini=1,...,n sL(xi, xtest) and let y(i) = yi denote the class of a given training instance xi ∈ Dtrain.2 Then

h1(xtest|Dtrain)c = 1{y(i?(xtest, Dtrain)) = c} (32)

for c ∈ Y. The next proposition gives a closed form solution for a 1NN classifier smoothed with Gaussiannoise.

Proposition 3. Let Ziiid∼ N (0, σ2

1) for i = 1, . . . , n and let Z = (Z1, . . . , Zn). Consider the smoothingfunctional

FP (h)(x, D, z) = h(x|ψ(D, z) (33)

where ψ(D, z) = {(xi + zi, yi)| (xi, yi) ∈ D}. For backdoor patterns π ∈∏ni=1 Rd, we can evaluate the

(π + Z)-smoothed classifier gπ+Zh1

according to

gπ+Zh1

(xtest|Dtrain)c =∑

i : yi=c

L−1∑l=1

pil ·

i−1∏j=1

L∑r=l+1

pjr

· n∏j=i+1

L∑r=l

pjr

(34)

where

pil = Fd,λi

(blσ2

)− Fd,λi

(bl−1

σ2

)(35)

and Fd,λi denotes CDF of the non-central χ2-distribution with d degrees of freedom and non-centrality parameterλi = ‖xi + πi − xtest‖22.

Algorithm 3 illustrates an efficient algorithm based on this result, which has a complexity of O(n · L).

7 Experimental Results

In this section, we aim to conduct extensive experiments to provide a benchmark for the certified robustnessbound of different types of machine learning models on diverse datasets. In particular, we evaluate theDNNs, differentially private DNNs, and 1-NN on MNIST, CIFAR-10, and ImageNet, and we consider threedifferent types of backdoor patterns including one pixel, four pixel and blending attack. Furthermore, wealso evaluate a tabular data spambase dataset on 1-NN model to assess the advantages of our efficientalgorithm on 1-NN model.

2Since we use a quantized euclidean distance, we need to consider the case where multiple instances have equal similarity. In sucha case, we break ties by choosing the instance with the lowest index.

14

Page 15: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Algorithm 2 COMPUTEPROB

Input: Probabilities {pil}i,l.Output: probabilities qi = P(i?(xtest, ψ(Dtrain, Z)) = i) for each i./* Compute psumjl =

∑lr=1 pjr. */

for j = 1, . . . , n dopsumj0 ← 0for l = 1, . . . , L dopsumjl ← psumj(l−1) + Pil

end forend for/* pprodil =

∏ij=1(psumjL − psumj(l−1)) =

∏ij=1

∑Lr=l pjr. */

for l = 1, . . . , L dopprod0l ← 1for i = 1, . . . , n dopprodil ← pprod(i−1)l · (p

sumjL − psumj(l−1))

end forend for/* Process boundary cases of pprodil . */pprod0(L+1) ← 1

pprodi(L+1) ← 0 ∀i > 0

pprod(n+1)l ← 1 ∀l ≥ 0

/* Return∑i : yi=c

∑Ll=1 pil · p

prod(i−1)(l+1) ·

pprod(i+1)L

pprod(i+1)(l−1)

. */

q← zeros(n)for i = 1, . . . , n do

for l = 1, . . . , L do

q[i]← q[i] + pil · pprod(i−1)(l+1) ·pprod(i+1)L

pprod(i+1)(l−1)

end forend forreturn q

7.1 Experiment Setup

We provide the first benchmark on the certified robustness bound against backdoor attacks against differentML models on MNIST [39], CIFAR-10 [40] and ImageNet [41] datasets. Following the backdoor attacksetting as in [21,22], we train a DNN model to classify between the (source, target) label pair on each taskand the attacker’s goal is to inject a backdoor pattern during training so that a source input with the patternwill be classified as the adversarial target. We choose two pairs for each task - (0,1) and (8,6) for MNIST,(airplane, bird) and (automobile, dog) for CIFAR, (dog, cat) and (dog, fish) for ImageNet. The training setsizes are 12665, 11769 on MNIST, 10000, 10000 on CIFAR and 20000, 10000 on ImageNet respectively. Weuse the architecture in [42] on MNIST and the architecture in [29] on CIFAR-10. For ImageNet, we use thestandard ResNet-20 architecture.

We evaluate three representative backdoor patterns: add a specific pattern as one-pixel in the middle ofthe image, four-pixel, and blending a noise pattern to the entire image. The backdoor pattern is generatedsuch that the perturbation norm for each pattern is bounded by the L2 norm ‖π‖2. For each attack, we willadd rp backdoor training instances whose ground truth is the source class aiming to mislead the predictionof test instances with the pattern to the target label.

In order to estimate the bounds of pA and pB , we train N = 1000 smoothed models on MNIST and CIFAR,

15

Page 16: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Algorithm 3 1NN-RABInput: (Poisoned) Training dataset Dtrain = {(xi, yi)ni=1}, test instance xtest, noise variance σ./* compute pil. */P← zeros(n× L)for i = 1, . . . , n doλi ← ‖xi−xtest‖22/σ2

for l = 1, . . . , L doPil ← Fd,λi(bl/σ

2)− Fd,λi(bl−1/σ2)end for

end for/* compute pA and pB . */class probs←COMPUTEPROB(P)cA ← arg maxc class probscpA, pB ← top-2 values in class probs

if pA > pB thenreturn prediction cA, radius σ

2

(Φ−1(pA)− Φ−1(pB)

)else

return ABSTAINend if

and N = 200 smoothed models on ImageNet according to Algorithm 1. We then use Hoeffding’s inequalityto approximate the bounds with error rate α = 0.001. We use the smoothing parameters σ = 1.0, 2.0 onMNIST and σ = 0.5, 1.0 on CIFAR and ImageNet.

Besides the vanilla DNNs, we also use the differentially private training algorithm (DP-SGD) as proposedin [38] to further smooth a trained model on the instance level. We set the gradient clip norm to be C =5.0, 100.0, 1000.0 and add Gaussian noise with scale σ = 4.0, 0.1, 0.01 for MNIST, CIFAR and ImageNet,respectively. In the KNN approach, we use N = 200 buckets.

As for the KNN models, we evaluate it with an additional spambase tabular dataset to demonstrate itseffectiveness. In particular, we use the UCI Spambase dataset [43] which gives bag-of-word feature vectoron a mail and determines whether the mail is spam or not. The dataset contains 4601 data cases with 57-dimensional input. We use 80% of the data as training set to fit the KNN model and use the remaining 20%to evaluate. As for the backdoor attack, we will randomly add backdoor pattern as one (or four) dimensionon the input, or a random pattern to the entire input vector similar with the blending attack.

We report our results with three metrics: prediction accuracy, certified rate and certified accuracy. Giventhe i-th test case with ground truth label yi, the output prediction is either ci within radius ri or ci = ABSTAINat ri = 0. Given a test set of size m, the three evaluation metrics are calculated according to

Prediction Acc =1

m

m∑i=1

1{ci = yi}

Certified Rate at r =1

m

m∑i=1

1{ri > r}

Certified Acc at r =1

m

m∑i=1

1{ci = yi and ri > r}.

The prediction accuracy indicates how well the smoothed (backdoored) classifier performs in classifyingnew, possibly backdoored, instances without taking into account their robustness radius. The certified rateat r is given by the fraction of instances in the test set that can be cetrified at radius ri > r, indicating howconsistent the attacked classifier is with a clean one. Finally the certified accuracy at r combines the first twometrics: it reports the fraction of the test set which is classified correctly (without abstaining) and is certifiedas robust with a radius ri > r.

16

Page 17: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

7.2 Certified robustness of DNNs against backdoor attacks

Table 1: Experiment results on MNIST.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.2/0.5/1.0/2.0 Certified Rate at r = 0.2/0.5/1.0/2.0

One-pixel ‖πi‖2 = 1.0, rp = 0.11.0 1.000 0.996 / 0.796 / 0.000 / 0.000 0.996 / 0.796 / 0.000 / 0.0002.0 0.996 0.996 / 0.991 / 0.744 / 0.000 0.996 / 0.991 / 0.744 / 0.000

‖πi‖2 = 0.1, rp = 0.021.0 1.000 1.000 / 1.000 / 0.987 / 0.000 1.000 / 1.000 / 0.987 / 0.0002.0 1.000 0.999 / 0.998 / 0.994 / 0.708 0.999 / 0.998 / 0.994 / 0.708

Four-pixel ‖πi‖2 = 1.0, rp = 0.11.0 1.000 0.997 / 0.530 / 0.000 / 0.000 0.997 / 0.530 / 0.000 / 0.0002.0 0.996 0.995 / 0.988 / 0.641 / 0.000 0.995 / 0.988 / 0.641 / 0.000

‖πi‖2 = 0.1, rp = 0.021.0 1.000 1.000 / 1.000 / 0.990 / 0.000 1.000 / 1.000 / 0.990 / 0.0002.0 1.000 0.999 / 0.998 / 0.994 / 0.726 0.999 / 0.998 / 0.994 / 0.726

Blending ‖πi‖2 = 1.0, rp = 0.11.0 1.000 1.000 / 0.985 / 0.000 / 0.000 1.000 / 0.985 / 0.000 / 0.0002.0 0.997 0.996 / 0.991 / 0.744 / 0.000 0.996 / 0.991 / 0.744 / 0.000

‖πi‖2 = 0.1, rp = 0.021.0 1.000 1.000 / 1.000 / 0.992 / 0.000 1.000 / 1.000 / 0.992 / 0.0002.0 1.000 0.999 / 0.997 / 0.994 / 0.729 0.999 / 0.997 / 0.994 / 0.729

Table 2: Experiment results on CIFAR.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5 Certified Rate at r = 0.05/0.1/0.2/0.5

One-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.738 0.684 / 0.610 / 0.390 / 0.000 0.810 / 0.707 / 0.454 / 0.0041.0 0.669 0.629 / 0.597 / 0.493 / 0.088 0.793 / 0.744 / 0.604 / 0.136

‖πi‖2 = 0.1, rp = 0.020.5 0.835 0.806 / 0.769 / 0.691 / 0.297 0.889 / 0.845 / 0.739 / 0.3001.0 0.797 0.774 / 0.756 / 0.717 / 0.559 0.882 / 0.850 / 0.796 / 0.595

Four-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.740 0.683 / 0.620 / 0.425 / 0.000 0.798 / 0.712 / 0.487 / 0.0051.0 0.659 0.625 / 0.588 / 0.495 / 0.093 0.791 / 0.733 / 0.609 / 0.148

‖πi‖2 = 0.1, rp = 0.020.5 0.837 0.811 / 0.776 / 0.703 / 0.295 0.893 / 0.848 / 0.747 / 0.2981.0 0.797 0.774 / 0.748 / 0.707 / 0.552 0.882 / 0.842 / 0.788 / 0.590

Blending ‖πi‖2 = 1.0, rp = 0.10.5 0.739 0.673 / 0.604 / 0.381 / 0.000 0.798 / 0.703 / 0.450 / 0.0061.0 0.668 0.625 / 0.597 / 0.504 / 0.131 0.789 / 0.748 / 0.621 / 0.185

‖πi‖2 = 0.1, rp = 0.020.5 0.836 0.810 / 0.776 / 0.700 / 0.280 0.890 / 0.845 / 0.742 / 0.2831.0 0.803 0.782 / 0.759 / 0.716 / 0.560 0.888 / 0.855 / 0.795 / 0.598

Table 3: Experiment results on ImageNet.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5 Certified Rate at r = 0.05/0.1/0.2/0.5

One-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.782 0.728 / 0.655 / 0.426 / 0.000 0.768 / 0.684 / 0.442 / 0.0001.0 0.602 0.566 / 0.522 / 0.429 / 0.105 0.637 / 0.580 / 0.469 / 0.120

‖πi‖2 = 0.1, rp = 0.020.5 0.909 0.887 / 0.864 / 0.811 / 0.228 0.906 / 0.879 / 0.819 / 0.2281.0 0.813 0.801 / 0.780 / 0.742 / 0.585 0.837 / 0.810 / 0.763 / 0.591

Four-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.786 0.739 / 0.662 / 0.397 / 0.000 0.777 / 0.691 / 0.414 / 0.0001.0 0.628 0.595 / 0.558 / 0.464 / 0.118 0.654 / 0.608 / 0.500 / 0.129

‖πi‖2 = 0.1, rp = 0.020.5 0.907 0.894 / 0.869 / 0.808 / 0.196 0.914 / 0.886 / 0.818 / 0.1961.0 0.804 0.789 / 0.768 / 0.723 / 0.569 0.826 / 0.800 / 0.747 / 0.576

Blending ‖πi‖2 = 1.0, rp = 0.10.5 0.776 0.718 / 0.643 / 0.388 / 0.000 0.758 / 0.672 / 0.403 / 0.0001.0 0.600 0.564 / 0.528 / 0.441 / 0.117 0.632 / 0.584 / 0.482 / 0.131

‖πi‖2 = 0.1, rp = 0.020.5 0.909 0.888 / 0.867 / 0.808 / 0.276 0.904 / 0.882 / 0.817 / 0.2761.0 0.808 0.792 / 0.772 / 0.724 / 0.564 0.828 / 0.804 / 0.750 / 0.572

We evaluate the certified robustness bound on DNNs against different backdoor patterns, and we presentthe results for three evaluation metrics. Table 1, 2 and 3 list the benchmark results on MNIST, CIFAR-10 andImageNet, respectively. From the results, we can see that on MNIST, it is possible to obtain certified ACCaround 99.6% as long as the L2 norm of the backdoor patterns is within 0.5, and it will drop to around 70%with larger radius such as L2 = 2. It is also obvious that during the RAB training, larger smoothing noisewill help to achieve higher certified ACC and Cerfiried Rate. Similar observations can be drawn for CIFAR-10and ImageNet as well, while the certified robustness on them is lower than on MNIST.

17

Page 18: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

7.3 Certified robustness of differentially private DNNs against backdoor attacks

In addition to vanilla DNNs, we provide the certified robustness benchmarks on differentially private trainedDNNs. Tables 4, 5 and 6 show the results on MNIST, CIFAR-10 and ImageNet. We can see that on MNISTthe certified robustness on such smoothly trained models are much higher than that on vanilla DNNs, whichprovides further guidance on improving the robustness of ML models against backdoor attacks. On the otherhand, the results on CIFAR-10 for the smoothed model is slightly lower than that on CIFAR-10 and ImageNetsince the DPSGD is not able to train an effective model on these datasets. Further improvement on DPSGDwould help to improve the certified robustness.

Table 4: Experiment results of Smoothed Model on MNIST.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.2/0.5/1.0/2.0 Certified Rate at r = 0.2/0.5/1.0/2.0

One-pixel ‖πi‖2 = 1.0, rp = 0.11.0 0.997 0.995 / 0.991 / 0.904 / 0.000 0.997 / 0.991 / 0.904 / 0.0002.0 0.996 0.995 / 0.992 / 0.948 / 0.265 0.996 / 0.992 / 0.948 / 0.265

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.981 / 0.000 0.998 / 0.997 / 0.981 / 0.0002.0 0.998 0.998 / 0.997 / 0.994 / 0.833 0.998 / 0.997 / 0.994 / 0.833

Four-pixel ‖πi‖2 = 1.0, rp = 0.11.0 0.997 0.994 / 0.985 / 0.848 / 0.000 0.996 / 0.985 / 0.848 / 0.0002.0 0.995 0.994 / 0.988 / 0.936 / 0.176 0.996 / 0.988 / 0.936 / 0.176

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.982 / 0.000 0.998 / 0.997 / 0.982 / 0.0002.0 0.998 0.998 / 0.997 / 0.993 / 0.793 0.998 / 0.997 / 0.993 / 0.793

Blending ‖πi‖2 = 1.0, rp = 0.11.0 0.998 0.997 / 0.994 / 0.925 / 0.000 0.997 / 0.994 / 0.925 / 0.0002.0 0.997 0.995 / 0.993 / 0.957 / 0.317 0.995 / 0.993 / 0.957 / 0.317

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.979 / 0.000 0.998 / 0.997 / 0.979 / 0.0002.0 0.998 0.998 / 0.997 / 0.994 / 0.820 0.998 / 0.997 / 0.994 / 0.820

Table 5: Experiment results of Smoothed Model on CIFAR.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5 Certified Rate at r = 0.05/0.1/0.2/0.5

One-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.479 0.365 / 0.251 / 0.082 / 0.000 0.578 / 0.410 / 0.174 / 0.0051.0 0.305 0.239 / 0.192 / 0.105 / 0.004 0.512 / 0.432 / 0.267 / 0.044

‖πi‖2 = 0.1, rp = 0.020.5 0.728 0.675 / 0.612 / 0.440 / 0.003 0.799 / 0.719 / 0.497 / 0.0041.0 0.649 0.609 / 0.575 / 0.484 / 0.204 0.754 / 0.700 / 0.580 / 0.228

Four-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.466 0.344 / 0.222 / 0.057 / 0.000 0.555 / 0.375 / 0.141 / 0.0031.0 0.292 0.225 / 0.170 / 0.088 / 0.002 0.491 / 0.405 / 0.244 / 0.037

‖πi‖2 = 0.1, rp = 0.020.5 0.725 0.673 / 0.612 / 0.457 / 0.006 0.795 / 0.719 / 0.514 / 0.0071.0 0.643 0.602 / 0.570 / 0.471 / 0.189 0.753 / 0.698 / 0.569 / 0.213

Blending ‖πi‖2 = 1.0, rp = 0.10.5 0.471 0.361 / 0.239 / 0.072 / 0.000 0.580 / 0.400 / 0.163 / 0.0051.0 0.321 0.262 / 0.210 / 0.129 / 0.007 0.536 / 0.452 / 0.297 / 0.055

‖πi‖2 = 0.1, rp = 0.020.5 0.724 0.673 / 0.613 / 0.444 / 0.002 0.795 / 0.718 / 0.501 / 0.0031.0 0.647 0.607 / 0.570 / 0.475 / 0.188 0.750 / 0.693 / 0.569 / 0.209

Table 6: Experiment results of Smoothed Model on ImageNet.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5 Certified Rate at r = 0.05/0.1/0.2/0.5

One-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.405 0.317 / 0.237 / 0.116 / 0.000 0.403 / 0.292 / 0.134 / 0.0001.0 0.101 0.072 / 0.054 / 0.022 / 0.000 0.218 / 0.156 / 0.073 / 0.002

‖πi‖2 = 0.1, rp = 0.020.5 0.698 0.641 / 0.580 / 0.445 / 0.028 0.685 / 0.609 / 0.459 / 0.0281.0 0.376 0.333 / 0.276 / 0.195 / 0.037 0.392 / 0.319 / 0.218 / 0.038

Four-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.408 0.318 / 0.246 / 0.115 / 0.000 0.402 / 0.300 / 0.133 / 0.0001.0 0.103 0.071 / 0.053 / 0.020 / 0.000 0.211 / 0.157 / 0.068 / 0.002

‖πi‖2 = 0.1, rp = 0.020.5 0.697 0.638 / 0.579 / 0.444 / 0.028 0.680 / 0.608 / 0.458 / 0.0281.0 0.377 0.324 / 0.280 / 0.197 / 0.041 0.384 / 0.325 / 0.220 / 0.042

Blending ‖πi‖2 = 1.0, rp = 0.10.5 0.397 0.318 / 0.232 / 0.113 / 0.000 0.403 / 0.286 / 0.131 / 0.0001.0 0.097 0.070 / 0.052 / 0.021 / 0.000 0.215 / 0.160 / 0.071 / 0.002

‖πi‖2 = 0.1, rp = 0.020.5 0.701 0.644 / 0.580 / 0.444 / 0.029 0.687 / 0.609 / 0.457 / 0.0291.0 0.379 0.332 / 0.281 / 0.197 / 0.042 0.392 / 0.324 / 0.220 / 0.042

18

Page 19: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

7.4 Certified robustness of KNN models against backdoor attacks

In this section, we will present the benchmarks based on our proposed efficient algorithm on 1NN models.Tables 7 and 8 present results on MNIST and CIFAR-10, and Table 9 shows additional results on a tabulardata (spam classification). From Table 8 we can see that the 1NN model achieves high Certified ACC andCertified Rate on tabular data, which indicates its effectiveness in specific domains.

Table 7: Experiment results of 1NN on MNIST.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.2/0.5/1.0/2.0 Certified Rate at r = 0.2/0.5/1.0/2.0

One-pixel ‖πi‖2 = 1.0, rp = 0.11.0 0.997 0.995 / 0.991 / 0.904 / 0.000 0.997 / 0.991 / 0.904 / 0.0002.0 0.996 0.995 / 0.992 / 0.948 / 0.265 0.996 / 0.992 / 0.948 / 0.265

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.981 / 0.000 0.998 / 0.997 / 0.981 / 0.0002.0 0.998 0.998 / 0.997 / 0.994 / 0.833 0.998 / 0.997 / 0.994 / 0.833

Four-pixel ‖πi‖2 = 1.0, rp = 0.11.0 0.997 0.994 / 0.985 / 0.848 / 0.000 0.996 / 0.985 / 0.848 / 0.0002.0 0.995 0.994 / 0.988 / 0.936 / 0.176 0.996 / 0.988 / 0.936 / 0.176

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.982 / 0.000 0.998 / 0.997 / 0.982 / 0.0002.0 0.998 0.998 / 0.997 / 0.993 / 0.793 0.998 / 0.997 / 0.993 / 0.793

Blending ‖πi‖2 = 1.0, rp = 0.11.0 0.998 0.997 / 0.994 / 0.925 / 0.000 0.997 / 0.994 / 0.925 / 0.0002.0 0.997 0.995 / 0.993 / 0.957 / 0.317 0.995 / 0.993 / 0.957 / 0.317

‖πi‖2 = 0.1, rp = 0.021.0 0.998 0.998 / 0.997 / 0.979 / 0.000 0.998 / 0.997 / 0.979 / 0.0002.0 0.998 0.998 / 0.997 / 0.994 / 0.820 0.998 / 0.997 / 0.994 / 0.820

Table 8: Experiment results of 1NN on CIFAR.Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5 Certified Rate at r = 0.05/0.1/0.2/0.5

One-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.479 0.365 / 0.251 / 0.082 / 0.000 0.578 / 0.410 / 0.174 / 0.0051.0 0.305 0.239 / 0.192 / 0.105 / 0.004 0.512 / 0.432 / 0.267 / 0.044

‖πi‖2 = 0.1, rp = 0.020.5 0.728 0.675 / 0.612 / 0.440 / 0.003 0.799 / 0.719 / 0.497 / 0.0041.0 0.649 0.609 / 0.575 / 0.484 / 0.204 0.754 / 0.700 / 0.580 / 0.228

Four-pixel ‖πi‖2 = 1.0, rp = 0.10.5 0.466 0.344 / 0.222 / 0.057 / 0.000 0.555 / 0.375 / 0.141 / 0.0031.0 0.292 0.225 / 0.170 / 0.088 / 0.002 0.491 / 0.405 / 0.244 / 0.037

‖πi‖2 = 0.1, rp = 0.020.5 0.725 0.673 / 0.612 / 0.457 / 0.006 0.795 / 0.719 / 0.514 / 0.0071.0 0.643 0.602 / 0.570 / 0.471 / 0.189 0.753 / 0.698 / 0.569 / 0.213

Blending ‖πi‖2 = 1.0, rp = 0.10.5 0.471 0.361 / 0.239 / 0.072 / 0.000 0.580 / 0.400 / 0.163 / 0.0051.0 0.321 0.262 / 0.210 / 0.129 / 0.007 0.536 / 0.452 / 0.297 / 0.055

‖πi‖2 = 0.1, rp = 0.020.5 0.724 0.673 / 0.613 / 0.444 / 0.002 0.795 / 0.718 / 0.501 / 0.0031.0 0.647 0.607 / 0.570 / 0.475 / 0.188 0.750 / 0.693 / 0.569 / 0.209

Table 9: Experiment results of 1NN on tabular data (spam classification).Attack Approach Attack Setting σ Prediction Acc Certified Acc at r = 0.05/0.1/0.2/0.5/1.0 Certified Rate at r = 0.05/0.1/0.2/0.5/1.0

One-pixel‖πi‖2 = 1.0, rp = 0.1

1.0 0.725 0.689 / 0.643 / 0.565 / 0.120 0.942 / 0.875 / 0.761 / 0.1832.0 0.579 0.529 / 0.473 / 0.285 / 0.033 0.892 / 0.775 / 0.522 / 0.098

‖πi‖2 = 0.1, rp = 0.021.0 0.869 0.847 / 0.830 / 0.782 / 0.657 0.959 / 0.928 / 0.855 / 0.6612.0 0.924 0.906 / 0.894 / 0.836 / 0.646 0.965 / 0.922 / 0.840 / 0.646

Four-pixel‖πi‖2 = 1.0, rp = 0.1

1.0 0.710 0.676 / 0.625 / 0.525 / 0.093 0.937 / 0.862 / 0.710 / 0.1562.0 0.580 0.539 / 0.479 / 0.299 / 0.038 0.894 / 0.780 / 0.539 / 0.103

‖πi‖2 = 0.1, rp = 0.021.0 0.869 0.846 / 0.829 / 0.781 / 0.659 0.958 / 0.929 / 0.850 / 0.6632.0 0.925 0.912 / 0.897 / 0.843 / 0.652 0.960 / 0.919 / 0.847 / 0.652

Blending‖πi‖2 = 1.0, rp = 0.1

1.0 0.729 0.699 / 0.641 / 0.550 / 0.119 0.953 / 0.870 / 0.726 / 0.1622.0 0.588 0.546 / 0.482 / 0.323 / 0.035 0.903 / 0.779 / 0.562 / 0.103

‖πi‖2 = 0.1, rp = 0.021.0 0.867 0.845 / 0.829 / 0.780 / 0.659 0.957 / 0.929 / 0.852 / 0.6622.0 0.927 0.913 / 0.896 / 0.842 / 0.655 0.962 / 0.924 / 0.846 / 0.655

8 Related Work

In this section, we will discuss existing poisoning backdoor attacks on machine learning models, as well asexisting defenses against such attacks.

19

Page 20: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

8.1 Poisoning attacks

There have been several works developing optimal poisoning attacks against machine learning models suchas SVM and logistic regression [18,31]. Furthermore, [44] proposes a similar optimization based poisoningattack against neural network, that can only be applied to shallow MLP models. In addition to these opti-mization based poisoning attacks, the backdoor attacks are shown to be very effective against deep neuralnetworks [21, 42]. The backdoor patterns could be either static or generated dynamically [4]. For staticbackdoor patterns, it could be as small as one pixel, or as large as an entire image [21]. The backdoorinstances could be either generated with manipulated labels or clean labels [45]. In general, the poisoningratio of these backdoor attacks does not need to be very high to mislead the DNNs, which renders suchattacks very concerning for the widely applied DNN models.

8.2 Potential defenses against poisoning attacks

Given the potential severe consequences caused by backdoor attacks, multiple defense approaches have beenproposed. Cleanse [16] proposes to detect the backdoored models based on the observation that there existsa “short path” to make an image of one label to be predicted as a malicious one. Therefore, it calculates theminimal amount of perturbation needed to cause all images to be predicted as each label, and uses anomalydetection approach to detect the perturbation which is much smaller in size than others. [46] improves uponthe approach by using model inversion to obtain training data, and then apply GANs to generate the “shortpath” and apply anomaly detection algorithm as in Neural Cleanse. Activation Clustering [47] leverages theactivation vectors from the backdoored model as feature to detect backdoor instances. It performs a two-classclustering over the activation vectors of the training data to separate benign data and backdoor instances.Spectral Signature [22] identifies the “spectral signature” in the activation vector of backdoored instances. Itcan calculate the spectral signature score for each data to remove the ones which possibly contain a backdoorbased on the predefined threshold. STRIP [17] observes that for a backdoor instance, the model will mainlyfocus on the backdoor pattern. As a result, it proposes to identify the backdoor instances by checkingwhether the model will still provide a confident answer when it sees the backdoor pattern. SentiNet [48]leverages computer vision techniques to search for the parts in the image that contribute the most to themodel output, which are very likely to be the backdoor pattern. It then copies each part to other imagesto check if it can constantly change the output of other images to identify the backdoors. Finally, anotherinteresting application of randomized smoothing is presented in [49]. They use randomized smoothingto certify robustness against label-flipping attacks and randomize over the entire training procedure of theclassifier by randomly flipping labels in the training set. This work is orthogonal to ours in that we investigatethe robustness with respect to perturbations on the training inputs rather than labels.

Recently a short report also proposes to directly apply the randomized smoothing technique to potentiallyprovide certified robustness against backdoor attacks without any evaluation or analysis [50]. In addition, aswe have shown, directly applying randomized smoothing will not provide high certified robustness bounds.Contrary to that, in this paper, we first provide the unified framework with smoothing functional, and thenpropose the RAB robust training process to provide certified robustness against backdoor attacks. In partic-ular, we provide the tightness analysis for the robustness bound, analyze different smoothing distributions,and propose the hash function based approach for mapping the test data so as to achieve good certifiedrobustness. In addition, we analyze different machine learning models with corresponding properties suchas model smoothness to provide guidance to further improve the certified robustness of machine learningmodels.

9 Conclusion

In this paper, we aim to propose a unified smoothing framework to certify model robustness against differentadversarial attacks including both evasion and poisoning attacks. In particular, towards the popular backdoorpoisoning attack, we propose the first robust training process as well as a test data mapping mechanism to

20

Page 21: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

certify the prediction robustness against diverse backdoor attacks such as backdoors with different staticpatterns and dynamic backdoor patterns. Based on the understanding for the robustness conditions, wepropose to certify the robustness bound on vanilla DNNs, differentially private smoothed DNNs, and KNNmodels. We also propose an exact algorithm for KNN models without needing to randomly sample fromthe noise distributions. Here we provide comprehensive benchmarks of certified robustness against differentmachine learning models on diverse datasets. Different radius of the magnitude of backdoor patterns is alsoevaluated, which we believe will provide the first set of robustness bound against backdoor attacks for futurework to compare with.

21

Page 22: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

References

[1] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXivpreprint arXiv:1412.6572, 2014.

[2] C. Xiao, B. Li, J.-Y. Zhu, W. He, M. Liu, and D. Song, “Generating adversarial examples with adversarialnetworks,” arXiv preprint arXiv:1801.02610, 2018.

[3] C. Liao, H. Zhong, A. Squicciarini, S. Zhu, and D. Miller, “Backdoor embedding in convolutional neuralnetwork models via invisible perturbation,” arXiv preprint arXiv:1808.10307, 2018.

[4] C. Yang, Q. Wu, H. Li, and Y. Chen, “Generative poisoning attack method against neural networks,”arXiv preprint arXiv:1703.01340, 2017.

[5] N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection meth-ods,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp.3–14.

[6] W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarial examples in deep neural net-works,” arXiv preprint arXiv:1704.01155, 2017.

[7] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, andJ. Bailey, “Characterizing adversarial subspaces using local intrinsic dimensionality,” arXiv preprintarXiv:1801.02613, 2018.

[8] Z. Yang, B. Li, P.-Y. Chen, and D. Song, “Characterizing audio adversarial examples using temporaldependency,” arXiv preprint arXiv:1809.10875, 2018.

[9] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistantto adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.

[10] A. Lomuscio and L. Maganti, “An approach to reachability analysis for feed-forward relu neural net-works,” arXiv preprint arXiv:1706.07351, 2017.

[11] V. Tjeng, K. Xiao, and R. Tedrake, “Evaluating robustness of neural networks with mixed integer pro-gramming,” arXiv preprint arXiv:1711.07356, 2017.

[12] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, and A. Criminisi, “Measuring neuralnet robustness with constraints,” in Advances in neural information processing systems, 2016, pp. 2613–2621.

[13] K. Dvijotham, R. Stanforth, S. Gowal, T. A. Mann, and P. Kohli, “A dual approach to scalable verificationof deep networks.” in UAI, vol. 1, 2018, p. 2.

[14] E. Wong and J. Z. Kolter, “Provable defenses against adversarial examples via the convex outer adver-sarial polytope,” arXiv preprint arXiv:1711.00851, 2017.

[15] A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses against adversarial examples,” arXivpreprint arXiv:1801.09344, 2018.

[16] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifyingand mitigating backdoor attacks in neural networks,” in Neural Cleanse: Identifying and MitigatingBackdoor Attacks in Neural Networks. IEEE, 2019, p. 0.

[17] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojanattacks on deep neural networks,” arXiv preprint arXiv:1902.06531, 2019.

22

Page 23: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

[18] B. Li, Y. Wang, A. Singh, and Y. Vorobeychik, “Data poisoning attacks on factorization-based collabora-tive filtering,” in Advances in neural information processing systems, 2016, pp. 1885–1893.

[19] P. Li, B. Karlas, R. Wu, M. Gurel, X. Chu, W. Wuand, and C. Zhang, “Learning and cleaning over knownunknowns: From consistent query answering to consistent classification,” Technical Report, 2020.

[20] T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning modelsupply chain,” arXiv preprint arXiv:1708.06733, 2017.

[21] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems usingdata poisoning,” arXiv preprint arXiv:1712.05526, 2017.

[22] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,” in Advances in Neural InformationProcessing Systems, 2018, pp. 8000–8010.

[23] C. Xie, K. Huang, P.-Y. Chen, and B. Li, “Dba: Distributed backdoor attacks against federated learning,”in International Conference on Learning Representations, 2019.

[24] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang, “Trojaning attack on neural net-works,” in 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego,California, USA, February 18-221, 2018. The Internet Society, 2018.

[25] J. Dumford and W. Scheirer, “Backdooring convolutional neural networks via targeted weight pertur-bations,” arXiv preprint arXiv:1812.03128, 2018.

[26] Y. Ji, X. Zhang, and T. Wang, “Backdoor attacks against learning systems,” in 2017 IEEE Conference onCommunications and Network Security (CNS). IEEE, 2017, pp. 1–9.

[27] X. Liu, M. Cheng, H. Zhang, and C.-J. Hsieh, “Towards robust neural networks via random self-ensemble,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 369–385.

[28] X. Cao and N. Z. Gong, “Mitigating evasion attacks to deep neural networks via region-based classifica-tion,” in Proceedings of the 33rd Annual Computer Security Applications Conference, 2017, pp. 278–287.

[29] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, “Certified adversarial robustness via randomized smooth-ing,” arXiv preprint arXiv:1902.02918, 2019.

[30] S. Mei and X. Zhu, “Using machine teaching to identify optimal training-set attacks on machine learn-ers,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[31] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support vector machines,” arXiv preprintarXiv:1206.6389, 2012.

[32] A. Young and M. Yung, “Backdoor attacks on black-box ciphers exploiting low-entropy plaintexts,” inAustralasian Conference on Information Security and Privacy. Springer, 2003, pp. 297–311.

[33] S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen, and T. Chen, “Backdoor attacks against transferlearning with pre-trained deep learning models,” arXiv preprint arXiv:2001.03274, 2020.

[34] A. Salem, R. Wen, M. Backes, S. Ma, and Y. Zhang, “Dynamic backdoor attacks against machine learningmodels,” arXiv preprint arXiv:2003.03675, 2020.

[35] C. E. Shannon, “Communication theory of secrecy systems,” Bell system technical journal, vol. 28, no. 4,pp. 656–715, 1949.

[36] L. Li, M. Weber, X. Xu, L. Rimanic, T. Xie, , C. Zhang, and B. Li, “Provable robust learning based ontransformation-specific smoothing,” arXiv preprint arXiv:2002.12398, 2020.

23

Page 24: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

[37] Wikipedia contributors, “Sha-2 — Wikipedia, the free encyclopedia,” 2020, [Online; accessed18-March-2020]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=SHA-2&oldid=944705336

[38] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learn-ing with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer andCommunications Security, 2016, pp. 308–318.

[39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recogni-tion,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[40] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.

[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical imagedatabase,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.

[42] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neuralnetworks,” IEEE Access, vol. 7, pp. 47 230–47 244, 2019.

[43] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml

[44] L. Munoz-Gonzalez, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli,“Towards poisoning of deep learning algorithms with back-gradient optimization,” in Proceedings of the10th ACM Workshop on Artificial Intelligence and Security, 2017, pp. 27–38.

[45] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein, “Poison frogs!targeted clean-label poisoning attacks on neural networks,” in Advances in Neural Information Process-ing Systems, 2018, pp. 6103–6113.

[46] H. Chen, C. Fu, J. Zhao, and F. Koushanfar, “Deepinspect: a black-box trojan detection and mitigationframework for deep neural networks,” in Proceedings of the 28th International Joint Conference onArtificial Intelligence. AAAI Press, 2019, pp. 4658–4664.

[47] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivas-tava, “Detecting backdoor attacks on deep neural networks by activation clustering,” arXiv preprintarXiv:1811.03728, 2018.

[48] E. Chou, F. Tramer, G. Pellegrino, and D. Boneh, “Sentinet: Detecting physical attacks against deeplearning systems,” arXiv preprint arXiv:1812.00292, 2018.

[49] E. Rosenfeld, E. Winston, P. Ravikumar, and J. Z. Kolter, “Certified robustness to label-flipping attacksvia randomized smoothing,” arXiv preprint arXiv:2002.03018, 2020.

[50] B. Wang, X. Cao, N. Z. Gong et al., “On certifying robustness against backdoor attacks via randomizedsmoothing,” arXiv preprint arXiv:2002.11750, 2020.

[51] J. Neyman and E. Pearson, “On the problem of the most efficient tests of statistical hypotheses. 231,”Phil. Trans. Roy. Statistical Soc. A, vol. 289, 1933.

24

Page 25: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

A Properties of ζ and ξ

In this section we show some properties related to the functions ζ and ξ defined in Section 4 of this paper.

Lemma A.1. ζ is right-continuous and non-decreasing.

Proof. Let t ≥ 0 and suppose that {tn}n∈N is a sequence such that tn ↓ t. In order to show right-continuity,we need to show that limn→∞ ζ (tn) = ζ(t). Let An := {z : µδ(z)/µε(z) ≤ tn} and note that Pε(An) = ζ(tn).Since {tn}n is decreasing, we have An+1 ⊆ An. We claim that ∩∞n=1An = St. Suppose z ∈ ∩∞n=1An. Then,∀n : µδ(z)/µε(z) ≤ tn and thus µδ(z)/µε(z) ≤ limn→∞ tn = t, yielding z ∈ St. If, on the other hand, z ∈ St, thenµδ(z)/µε(z) ≤ t ≤ tn for all n and thus z ∈ ∩∞n=1An. Finally, this yields

ζ(tn) = Pε(An) →n→∞

Pε(∩∞n=1An) = Pε(St) = ζ(t) (36)

concluding the proof.

Lemma A.2. If ∀ t ≥ 0: Pε(∂t) = 0 where ∂t := St \ St, then ζ is continuous.

Proof. Since ζ is right-continuous, continuity follows if we show that ζ is also left-continuous. For thatpurpose, let t ≥ 0 and suppose that {tn}n∈N is a sequence such that tn ↑ t. Let An := {z : µδ(z)/µε(z) ≤ tn}and note that Pε(An) = ζ(tn). Since {tn}n is increasing, we have An ⊆ An+1. We claim that ∪∞n=1An = St.Suppose that z ∈ ∪∞n=1An. Then, ∃n such that µδ(z)/µε(z) ≤ tn < t and hence z ∈ St. If, on the other hand,z ∈ St, then µδ(z)/µε(z) < t. Hence, ∃n such that µδ(z)/µε(z) ≤ tn and thus z ∈ ∪∞n=1An. Finally, this yields

ζ(tn) = Pε(An) →n→∞

Pε(∪∞n=1An) = Pε(St). (37)

The lemma follows from the assumption, since Pε(St) = Pε(St) = ζ(t).

Lemma A.3. ∀ p ∈ [0, 1] : Pε(Sζ−1(p)) ≥ p ≥ Pε(Sζ−1(p)).

Proof. The first inequality, follows since ζ is left continuous and thus Pε(Sζ−1(p)) = ζ(ζ−1(p)) ≥ p. Toshow the second inequality, let tp := ζ−1(p) and consider the sets An := {z : µδ(z)/µε(z) ≤ tp − 1/n} andnote that An ⊆ An+1. We claim that ∪∞n=1An = Stp . Suppose that z ∈ ∪∞n=1An. Then, ∃n such thatµδ(z)/µε(z) ≤ tp − 1/n < tp and hence z ∈ Stp . If, on the other hand, z ∈ Stp , then µδ(z)/µε(z) < tp. Hence, ∃nsuch that µδ(z)/µε(z) ≤ tp − 1/n and thus z ∈ ∪∞n=1An. This yields

Pε(Stp) = Pε(∪∞n=1An). (38)

Note that ∀n : tp − 1/n < tp = inf{t : ζ(t) ≥ p} and thus ζ(tp − 1/n) ≤ p. Finally, together with (38), thisyields

Pε(Stp) = limn→∞

Pε(An) ≤ p (39)

concluding the proof.

Lemma A.4. Suppose that the sets ∂t := St \ St satisfy Pε(∂t) = Pδ(∂t) = 0 for t ≥ 0. Then for p ∈ [0, 1]

ξ(ζ−1(p), p) = Pδ(Sζ−1(p)

)= Pδ

(Sζ−1(p)

). (40)

Proof. Let p ∈ [0, 1] and tp := ζ−1(p). Note that, since Pε(∂t) = 0 for all t ≥ 0, we have Pε(Stp) = Pε(Stp) =p. Thus

Stp,p ={S ⊆ Z|Stp ⊆ S ⊆ Stp

}. (41)

Since, in addition Pδ(∂t) = 0 and thus Pδ(Stp) = Pδ(Stp) we have

ξ(tp, p) = supS∈Stp,p

Pδ(S) = Pδ(Sζ−1(p)

)= Pδ

(Sζ−1(p)

)(42)

concluding the proof.

25

Page 26: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

B Extended Neyman-Pearson Lemma

We now show the following Lemma which is connected to the Neyman-Pearson Lemma [51] from statisticalhypothesis testing. We refer the reader to [36] for a proof of this result.

Lemma B.5 (Extended Neyman-Pearson). Let f : Z → R≥0 be a measurable function such that 0 ≤ supx f(x) ≤M <∞. Then, the following implications hold

1. For any measurable set S ⊆ Z such that St ⊆ S ⊆ St:

Eε(f(ε)) ≥M · Pε(S)⇒ Eδ(f(δ)) ≥M · Pδ(S). (43)

2. For any measurable set S ⊆ Z such that Stc ⊆ S ⊆ Stc:

Eε(f(ε)) ≤M · Pε(S)⇒ Eδ(f(δ)) ≤M · Pδ(S). (44)

C Proof of Theorem 1

Theorem 1 (restated). Let ε and δ be Z-valued random variables. Let xtest ∈ X , Dtrain ∈ Dn a dataset,F : H(Rd×Dn, SC)→ H(Rd×Dn×Z, SC) a smoothing functional and h ∈ H(X ×Dn, SC) a base classifier.Suppose that the ε-smoothed classifier is (pA, pB)-confident at (xtest, Dtrain) for some cA ∈ Y, that is

gεh(xtest|Dtrain)cA ≥ pA ≥ pB ≥ maxc6=cA

gεh(xtest|Dtrain)c. (45)

Let ζ : R≥0 → [0, 1] be the function defined by

ζ(t) := Pε(St) (46)

and let ζ−1(p) := inf{t : ζ(t) ≥ p} be its generalized inverse. For t ≥ 0 and p ∈ [0, 1], let

St,p := {S ⊆ Z|St ⊆ S ⊆ St ∧ Pε(S) ≤ p} (47)

and define the function ξ : R≥0 × [0, 1]→ [0, 1] by

ξ(t, p) := sup{Pδ(S)|St ⊆ S ⊆ St ∧ Pε(S) ≤ p}. (48)

If δ satisfies1− ξ(ζ−1(1− pB), 1− pB) < ξ(ζ−1(pA), pA), (49)

thengδh(xtest|Dtrain)cA > max

c 6=cAgδh(xtest|Dtrain)c. (50)

The following proof is similar in nature to the proof for functionals that model evasion attacks and whichis provided in [36]. Here, we extend this proof to consider smoothing over general transforms on classifierswhich allows to model a broader range of attack scenarios.

Proof. Let tA := ζ−1(pA), tB := ζ−1(1 − pB). For ease of notation let SA := StA , SB := StB , SA := StAand SB := StB . Note that by Lemma A.1, ζ is right-continuous and hence for any p ∈ [0, 1] we haveζ(ζ−1(p)) ≥ p. In particular, Pε

(SA)

= ζ(ζ−1(pA) ≥ pA and Pε(SB)

= ζ(ζ−1(1− pB) ≥ 1− pB . Let

SA := StA, pA , SB := StB , 1−pB . (51)

26

Page 27: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Note that by Lemma A.3, Pε(SA) ≤ pA and Pε(SB) ≤ 1− pB . Hence SA 6= ∅ and SB 6= ∅. Let SA ∈ SA andSB ∈ SB arbitrary. Then, since by assumption gεh is (pA, pB)-confident at (xtest, Dn), for any c 6= cA we have

Eε (F(h)(xtest, Dtrain, ε)cA) = gεh(xtest|Dtrain)cA ≥ pA ≥ Pε (SA) (52)

and

Eε (F(h)(xtest, Dtrain, ε)c) = gεh(xtest|Dtrain)c ≤ pB ≤ 1− Pε (SB) = Pε (ScB) . (53)

Note that SA ⊆ SA ⊆ SA. We can thus apply part 1 of Lemma B.5 to the function z 7→ F(h)(xtest, Dtrain, z)cAand M = 1 and obtain

gδh(xtest|Dtrain)cA = Eδ (F(h)(xtest, Dtrain, δ)cA) ≥ Pδ (SA) . (54)

Similarly, Sc

B ⊆ ScB ⊆ ScB and applying part 2 of Lemma B.5 to the function z 7→ F(h)(xtest, Dtrain, z)c and

M = 1 yields

gδh(xtest|Dtrain)c = Eδ (h(F(h)(xtest, Dtrain, δ)c) ≤ Pδ (ScB) = 1− Pδ (SB) . (55)

Since the choice of SA and SB was arbitrary, we get

gδh(xtest|Dtrain)cA ≥ supS∈SA

Pδ(S) = ξ(tA, pA), (56)

gδh(xtest|Dtrain)c ≤ infS∈SB

Pδ(Sc) = 1− supS∈SB

Pδ(S) = 1− ξ(tB , 1− pB). (57)

Since the above inequalities hold for all c 6= cA, we find gδh(xtest|Dtrain)cA > maxc6=cA gδh(xtest|Dtrain)c if

1− ξ(ζ−1(1− pB), 1− pB) < ξ(ζ−1(pA), pA) (58)

which proves the Theorem.

Proposition 1 (restated). If the support of δ is disjoint from the support of ε, then δ can not satisfy (12).

Proof. Let Sδ := {z ∈ Z|µδ(z) 6= 0} and Sε := {z ∈ Z|µε(z) 6= 0} denote the support of δ and ε respectivelyand let Nδ := Z \ Sδ and Nε := Z \ Sε be their complement in Z. Suppose that Sδ

⋂Sε = ∅. Note

that in this case, for any t ∈ [0, ∞) we have that µδ(z)µε(z)

≤ t is satisfied if and only if z ∈ Sε ∪ (Sδ ∪ Sε)c

and hence St ≡ Sε ∪ (Sδ ∪ Sε)c = Nδ. Since Sε ⊆ Nδ we have ζ(t) = Pε(St) ≥ Pε(Sε) = 1 and thusζ(t) ≡ 1. Hence ∀ p : ζ−1(p) = inf{t : ζ(t) ≥ p} = 0 and ζ−1 ≡ 0. Since S0 = ∅ and S0 = Nδ we get that∀ p : ξ(ζ−1(p), p) = sup{Pδ(S)|S ⊆ Nδ ∧ Pε(S) ≤ p} = 0. And thus for any pA and pB the strict inequality

1 = 1− ξ(ζ−1(1− pB), 1− pB) < ξ(ζ−1(pA), pA) = 0 (59)

can never be satisfied.

D Tightness

Theorem 2 (restated). Let 1 ≥ pA ≥ pB ≥ 0 such that pA + pB ≤ 1. Let ε and δ be Z-valued random variableswith non-disjoint support and such that ζ(0) = 0 and ζ is strictly increasing, continuous and

1− ξ(ζ−1(1− pB), 1− pB) > ξ(ζ−1(pA), pA). (60)

Consider the smoothing functional FT defined by

FT (h)(x, D, z) = h(φ(x, z)|ψ(D, z)) (61)

27

Page 28: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

where φ : X ×Z → X and ψ : Dn ×Z → Dn are deterministic functions. Let xtest ∈ X and Dtrain ∈ Dn. Thenthere exists a base classifier h∗ such that the ε-smoothed classifier

gεh∗(x|D) = Eε(h∗(φ(x, ε)|ψ(D, ε))) (62)

is (pA, pB)-confident at (xtest, Dtrain) and

gδh∗(xtest|Dtrain)cA < maxc 6=cA

gδh∗(xtest|Dtrain)c. (63)

Proof. Fix some arbitrary cB 6= cA and let tA := ζ−1(pA) and tB := ζ−1(1− pB). Note that, since ζ is strictlyincreasing, continuous and ζ(0) = 0, we have ζ(ζ−1(p)) = p for any p and thus Pε(Sζ−1(p)) = ζ(ζ−1(p)) = p.For p ∈ [0, 1] let tp := ζ−1(p). Then

ξ(tp, p) = supS∈Stp,p

Pδ(S) = Pδ(Stp) (64)

and in particular

ξ(tB , 1− pB) = Pδ(StB ) and ξ(tA, pA) = Pδ(StA). (65)

Let A := StA and B := StB . It follows from the assumption that

1− Pδ(B) > Pδ(A). (66)

We will now construct a classifier h∗ such that gεh∗ is (pA, pB)-confident at (xtest, Dtrain) and

gδh∗(xtest|Dtrain)cB > gδh∗(xtest|Dtrain)cA . (67)

For that purpose, let Ψ0(z) := (φ(xtest, z), ψ(Dtrain, z)) ∈ X ×Dn and consider the equivalence relation

z ∼ z′ ⇐⇒ Ψ0(z) = Ψ0(z′). (68)

Let τ : Z/∼→ Z be the function which maps each equivalence class to its canonical representative. Denoteby Im(Ψ0) the image of Ψ0 and consider the map

Ψ0 : Z/∼ → Im(Ψ0)

[z] 7→ (Ψ0 ◦ τ)([z]).(69)

Note that Ψ0 is bijective and thus has an inverse Ψ−10 : Im(Ψ0) → Z/∼. Let Π0 denote a projection from

X ×Dn into Im(Ψ0).3 For x ∈ X , let h∗ be the function defined by

h∗(x|D)c =

1

{τ(Ψ−1

0 ◦Π0)(x, D)) ∈ A}

c = cA,

1

{τ(Ψ−1

0 ◦Π0)(x, D)) ∈ Bc}

c = cB ,

1{τ(Ψ−10 ◦Π0)(x,D))∈(A∪Bc)c}

C−2 otherwise.

(70)

In order to show that h∗ ∈ H(X × Dn, SC), we need to ensure that∑c h∗(x|D)c = 1. This follows if we

show that A ∩ Bc = ∅. Suppose by contradiction that ∃ : z ∈ A ∩ Bc and let tz := µδ(z)/µε(z). Then, since ζis non-decreasing and ζ(ζ−1(p)) = p

ζ(tz) ≤ ζ(tA) = pA and ζ(tz) > ζ(tB) = 1− pB (71)

and thus

0 = ζ(tz)− ζ(tz) > 1− pB + pA ≥ 0 (72)

3The projection Π0 is a map Π0 : X ×Dn → Im(Ψ0) such that Π0 ◦Π0 = Π0.

28

Page 29: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

which is a contradiction. Hence A ∩ Bc = ∅ and h∗ is indeed a classifier. Note that for any z ∈ Z, we have(Ψ−1

0 ◦Ψ0)(z) = [z] and thus, for any S ⊆ Z and z ∈ Z

(Ψ−10 ◦Π0)(Ψ0(z)) = (Ψ−1

0 ◦Ψ0)(z) ∈ S/∼ ⇐⇒ [z] ∈ S/∼ ⇐⇒ z ∈ S. (73)

Finally, this yields

gεh∗(xtest|Dtrain)cA = E (h∗(φ(xtest, ε)|ψ(Dtrain, ε))cA) = E(1

{τ(Ψ−1

0 ◦Π0)(Ψ0(ε)) ∈ A})

(74)

= Pε(A) = pA (75)

and

gεh∗(xtest|Dtrain)cB = E (h∗(φ(xtest, ε)|ψ(Dtrain, ε))cB ) = E(1

{τ(Ψ−1

0 ◦Π0)(Ψ0(ε)) ∈ Bc})

(76)

= Pε(Bc) = 1− (1− pB) = pB . (77)

Similarly, we find

gδh∗(xtest|Dtrain)cA = Pδ(A) (78)

and

gδh∗(xtest|Dtrain)cB = Pδ(Bc) = 1− Pδ(B). (79)

Finally, since cB was arbitrary, the ε-smoothed classifier gεh∗ is (pA, pB)-confident at (xtest, Dtrain) and

gδh∗(xtest|Dtrain)cA = Pδ(A) < 1− Pδ(B) = gδh∗(xtest|Dtrain)cB (80)

Proposition 2 (restated). Suppose Z =∏ni=1 Rm. For i = 1, . . . , n let εi

iid∼ N (mε, Σ) and δiiid∼ N (mδ, Σ)

with mδ 6= mε and covariance matrix Σ. Then, condition (12) provides a tight robustness guarantee in the senseof Theorem 2.

Proof. Let A := Σ−1 and consider the bilinear form 〈z, z′〉A := zTAz′. Then, for any z ∈∏ni=1 Rm

µδ(z)

µε(z)= exp

(−1

2

n∑i=1

〈zi −mδ, zi −mδ〉A +1

2

n∑i=1

〈zi −mε, zi −mε〉A

)(81)

= exp

(n∑i=1

〈zi, mδ −mε〉A −n

2(〈mδ, mδ〉A − 〈mε, mε〉A)

)(82)

Note that E (〈εi, mδ −mε〉A) = 〈mε, mδ −mε〉A and let

σ2(mδ, mε) = E(〈εi −mε, mδ −mε〉2A

)(83)

denote the variance of 〈εi, mδ −mε〉A. Then, since εi are iid, we getn∑i=1

〈εi −mε, mδ −mε〉A√n · σ2(mδ, mε)

∼ N (0, 1) (84)

and thus

ζ(t) = Pε(µδ(ε)

µε(ε)≤ t)

= Pε

(n∑i=1

〈εi, mδ −mε〉A ≤ log(t) +n

2(〈mδ, mδ〉A − 〈mε, mε〉A)

)(85)

= Φ

(log(t) + n

2 〈mδ −mε, mδ −mε〉A√n · σ2(mδ, mε)

). (86)

We can thus write ζ as Φ(a · log(t) + b) for constants a and b. Since this is a concatenation of continuousfunctions, ζ itself is also continuous on R>0. In addition, ζ is strictly increasing and limt→0 ζ(t) = 0. Theproposition thus follows from Theorem 2.

29

Page 30: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

E Proofs for Certifying Backdoor Attacks

In this section, we provide proofs for the Corollaries needed to certify robustness against backdoor attacks.

Corollary 1 (restated). Let Z =∏ni=1 Rd and consider the dataset transform ψ : Dn ×Z → Dn defined by

ψ(D, z) = {(xi + zi, yi)ni=1}. (87)

Let ρ : Z → X be any deterministic function and let F be the smoothing functional defined by

F(h)(x, D, z) = h(x+ ρ(z)|ψ(D, z)). (88)

Let {Zi}ni=1 be a collection of n iid random variables, Z := (Z1, . . . , Zn) and consider the (π + Z)-smoothedclassifier given by

gπ+Zh∗ (x|D)) = E (h∗(x+ ρ(π + Z)|ψ(D, π + Z))) . (89)

Let π0 ∈ Rd and suppose that gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain), i.e.

gπ+Zh∗ (xtest + π0|Dtrain)cA ≥ pA ≥ pB ≥ max

c 6=cAgπ+Zh∗ (xtest + π0|Dtrain)c (90)

for some cA ∈ Y. If condition (12) holds for the random variables ε := π + Z and δ := Z, then

gZh∗(xtest + π0| Dtrain)cA > maxc6=cA

gZh∗(xtest + π0| Dtrain)c. (91)

Proof. Let ε := π + Z and δ := Z. Then, for any x ∈ X , D ∈ Dn

gεh∗(x|D) = gπ+Zh∗ (x|D) (92)

and

gδh∗(x|D) = gZh∗(x|D). (93)

Thus, since by assumption gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain), applying Theorem 1 to the

random variables ε and δ and the smoothing functional F : H(X ×Dn, SC)→ H(X ×Dn ×Z, SC) given byF(h)(x, D, z) := h(x+ ρ(z)|ψ(D, z)) yields

gZh∗(x| Dtrain)cA = gδh∗(x| Dtrain)cA > maxc6=cA

gδh∗(x| Dtrain)cA = maxc6=cA

gZh∗(x| Dtrain)c (94)

if condition (12) is satisfied, completing the proof.

Corollary 2 (restated). Consider the setting in Corollary 1. Suppose that ∀ i : Ziiid∼ N (0, σ2

1d). If gπ+Zh∗ is

(pA, pB)-confident at (xtest + π0, Dtrain), and π := (π1, . . . , πn) satisfies√√√√ n∑i=1

‖πi‖22 <σ

2

(Φ−1(pA)− Φ−1(pB)

)(95)

thengZh∗(xtest + π0| Dtrain)cA > max

c 6=cAgZh∗(xtest + π0| Dtrain)c. (96)

Proof. Let ε := π + Z and δ := Z. Since the classifier gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain),

By Corollary 1, the proof is complete when we show that (12) reduces to (95). For that purpose, let A :=

30

Page 31: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

diag(σ−2, . . . , σ−2) ∈ Rd×d and consider the bilinear form 〈z1, z2〉A := zT1 Az2 for z1, z2 ∈ Rd. Note that forz ∈ Z

µδ(z)

µε(z)=

n∏i=1

exp(− 1

2 〈zi, zi〉A)

exp(− 1

2 〈zi − πi, zi − πi〉A) = exp

(n∑i=1

(〈zi, −πi〉A +

1

2〈πi, πi〉A

)). (97)

Note that

n∑i=1

〈εi, −πi〉A − 〈πi, −πi〉A√∑nj=1〈πi, πi〉A

∼ N (0, 1) and

n∑i=1

〈δi, −πi〉A√∑nj=1〈πi, πi〉A

∼ N (0, 1). (98)

We compute ζ as

ζ(t) = Pε(St) = P(µδ(ε)

µε(ε)≤ t)

= P

(n∑i=1

(〈εi, −πi〉A +

1

2〈πi, πi〉A

)≤ log(t)

)(99)

= Φ

log(t) + 12

∑ni=1〈πi, πi〉A√∑n

j=1〈πi, πi〉A

(100)

Since ζ is strictly increasing and continuous on R≥0 we compute its inverse ∀ p ∈ [0, 1], t ≥ 0 as

ζ−1(p) = Φ−1(p)

√√√√ n∑i=1

〈πi, πi〉A −1

2

n∑i=1

〈πi, πi〉A (101)

Note that the sets ∂t := St \ St have measure 0 under Pε and Pδ. Thus, by Lemma A.4, for any p ∈ [0, 1], ξevaluated at (tp, p) := (ζ−1(p), p) has the form

ξ(tp, p) = Pδ(Stp) = P

(n∑i=1

(〈δi, −πi〉A +

1

2〈πi, πi〉A

)≤ log(tp)

)(102)

= Φ

log(tp)− 12

∑ni=1〈πi, πi〉A√∑n

j=1〈πi, πi〉A

= Φ

Φ−1 (p)−

√√√√ n∑j=1

〈πi, πi〉A

. (103)

Thus, computing ξ at (ζ−1(1− pB), 1− pB) yields

1− ξ(ζ−1(1− pB), 1− pB) = Φ

√√√√ n∑j=1

〈πi, πi〉A + Φ−1 (pB)

. (104)

Similarly, computing ξ at (ζ−1(pA), pA) yields

ξ(ζ−1(pA), pA) = Φ

Φ−1 (pA)−

√√√√ n∑j=1

〈πi, πi〉A

. (105)

Finally, condition (12) is satisfied if and only if√√√√ n∑j=1

〈πi, πi〉A + Φ−1 (pB) < Φ−1 (pA)−

√√√√ n∑j=1

〈πi, πi〉A. (106)

31

Page 32: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Since ∀ i : 〈πi, πi〉A = σ−2 ‖πi‖22, this is equivalent to√√√√ n∑i=1

‖πi‖22 <σ

2

(Φ−1(pA)− Φ−1(pB)

)(107)

concluding the proof.

Corollary 3 (restated). Consider the setting in Corollary 1. Suppose that ∀ i : Ziiid∼ U([a, b]d) for finite a < b.

If gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain), and π = (π1, . . . , πn) satisfies

1−(pA − pB

2

)<

n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

, (108)

thengZh∗(xtest + π0| Dtrain)cA > max

c 6=cAgZh∗(xtest + π0| Dtrain)c. (109)

Proof. Let ε := π + Z and δ := Z. Since the classifier gπ+Zh∗ is (pA, pB)-confident at (xtest + π0, Dtrain),

By Corollary 1, the proof is complete when we show that (12) reduces to (108). For that purpose, letIεi :=

∏dj=1[a+ πi,j , b+ πi,j ], Iδi = [a, b]m and

Iε =

n∏i=1

Iεi , Iδ =

n∏i=1

Iδi . (110)

For z ∈ Z, consider

µε(z) =

n∏i=1

µεi(zi) =

{(b− a)−d·n z ∈ Iε0 otherwise

(111)

µδ(z) =

n∏i=1

µδi(zi) =

{(b− a)−d·n z ∈ Iδ0 otherwise.

(112)

Let S0 :=∏ni=1 Iεi \ Iδi . Then, for any z ∈ Iε ∪ Iδ

µδ(z)

µε(z)=

0 z ∈ S0,

1 z ∈ Iε ∩ Iδ,∞ z ∈ Iδ \ Iε.

(113)

Note that

Pε (S0) = 1− Pε (Iδ) = 1−n∏i=1

d∏j=1

P (a ≤ πi,j + Zi,j ≤ b)

(114)

= 1−n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

=: p0 (115)

where (x)+ = max{x, 0}. We then compute ζ for t ≥ 0

ζ(t) = Pε(µδ(ε)

µε(ε)≤ t)

=

{Pε (S0) t < 1,

Pε (Iε) t ≥ 1.(116)

=

{p0 t < 1,

1 t ≥ 1.(117)

32

Page 33: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Recall that ζ−1(p) := inf{t | ζ(t) ≥ p} for p ∈ [0, 1] and hence

ζ−1(p) =

{0 p ≤ p0,

1 p > p0.(118)

In order to evaluate ξ, we need to compute the lower and strict lower level sets at t = ζ−1(p). Recall thatSt = {z ∈ Z | µδ(z)µε(z)

< t} and St = {z ∈ Z | µδ(z)µε(z)≤ t} and consider

Sζ−1(p) =

{∅ p ≤ p0,

S0 p > p0

and Sζ−1(p) =

{S0 p ≤ p0,

Iε p > p0

(119)

Suppose p ≤ p0. Then Sζ−1(p) = ∅ and Sζ−1(p) = S0 and hence p ≤ p0 implies

ξ(ζ−1(p), p) = sup{Pδ(S) |S ⊆ S0 ∧ Pε(S) ≤ p} = 0. (120)

Condition (12) can thus only be satisfied, if pA > p0 and 1− pB > p0. Note that if p > p0, then Sζ−1(p) = S0

and Sζ−1(p) = Iε. For p ∈ [0, 1] let Sp = {S ⊆ Z |S0 ⊆ S ⊆ Iε ∧ Pε(S) ≤ p}. Thus

p > p0 ⇒ ξ(ζ−1(p), p) = supS∈Sp

Pδ(S). (121)

We can write any S ∈ Sp as the disjoint union S = S0 ∪S1 for some S1 ⊆ Iε ∩ Iδ such that Pε(S0 ∪S1) ≤ p.Note that Pδ (S0) = 0 and for any z ∈ S1, we have µε(z) = µδ(z). Hence

Pδ (S) = Pδ (S1) = Pε (S1) ≤ p− Pε (S0) = p− p0 (122)

Thus, The supremum of the left hand side over all S ∈ Sp equals the supremum of the right hand side overall S1 ∈ {S′ ⊆ Iε ∩ Iδ |Pε[S′] ≤ 1− Pε[S0]} =: S0,p and hence

supS∈Sp

Pδ (S) = supS1∈S0,p

{Pδ(S1) = p− p0. (123)

Hence, computing ξ at (ζ−1(pA), pA) yields

ξ(ζ−1(pA), pA) = pA − p0 = pA − 1 +

n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

(124)

and computing ξ at (ζ−1(1− pB), 1− pB) yields

1−ξ((ζ−1(1− pB), 1− pB)) = 1− (1− pB − p0) = pB + 1−n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

. (125)

Finally, condition (12) is satisfied whenever π satisfies

2− pA + pB < 2 ·n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

(126)

which is equivalent to

1−(pA − pB

2

)<

n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

. (127)

concluding the proof.

33

Page 34: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Corollary 4 (restated). Consider the setting in Corollary 3. Let π0 ∈ Rd be a backdoor pattern and supposethat π contains π0 exactly r times and the 0-vector n− r times. Then, condition (27) is satisfied, if π0 satisfies

‖π0‖∞ < (b− a)

(1−

(1− pA − pB

2

) 1d·r). (128)

Proof. Note that, if ‖π0‖∞ > b − a, then π + Z and Z have disjoint support and (27) cannot be satisfied byProposition 1. Hence, without loss of generality, assume that ‖π0‖∞ ≤ b − a. Since π contains π0 exactly rtimes and is 0 otherwise, we find

n∏i=1

d∏j=1

(1− |πi,j |

b− a

)+

=

d∏j=1

(1− |π0,j |

b− a

)r

. (129)

Note that ∀ i : |π0,i| ≤ ‖π0‖∞ and hence d∏j=1

(1− |π0,j |

b− a

)r

≥(

1−‖π0‖∞b− a

)d·r. (130)

Finally, algebra shows that (1−‖π0‖∞b− a

)d·r> 1−

(pA − pB

2

), (131)

is equivalent to

‖π0‖∞ < (b− a)

(1−

(1− pA − pB

2

) 1d·r), (132)

concluding the proof.

F Proofs: Smoothed 1-NN Classifiers

Proposition 3 (restated). Let Ziiid∼ N (0, σ2

1) for i = 1, . . . , n and let Z = (Z1, . . . , Zn). Consider thesmoothing functional

FP (h)(x, D, z) = h(x|ψ(D, z) (133)

where ψ(D, z) = {(xi + zi, yi)| (xi, yi) ∈ D}. For backdoor patterns π ∈∏ni=1 Rd, we can evaluate the

(π + Z)-smoothed classifier gπ+Zh1

according to

gπ+Zh1

(xtest|Dtrain)c =∑

i : yi=c

L−1∑l=1

pil ·

i−1∏j=1

L∑r=l+1

pjr

· n∏j=i+1

L∑r=l

pjr

(134)

where

pil = Fd,λi

(blσ2

)− Fd,λi

(bl−1

σ2

)(135)

and Fd,λi denotes CDF of the non-central χ2-distribution with d degrees of freedom and non-centrality parameterλi = ‖xi + πi − xtest‖22.

34

Page 35: RAB: Provable Robustness Against Backdoor Attacksarxiv.org/pdf/2003.08904v1.pdf · RAB: Provable Robustness Against Backdoor Attacks Maurice Webery Xiaojun Xu zBojan Karlas yCe Zhang

Proof. Recall that i?(xtest, Dtrain) denotes the index of the training instance in Dtrain most similar to xtest,i.e. i?(xtest, Dtrain) = arg mini=1,...,n sL(xi, xtest) where sL is the quantized euclidean distance. We writey(i) = yi to denote the class of a given training instance xi ∈ Dtrain. The (π + Z)-smoothed classifier gπ+Z

h1

smoothed with the functional (133) is then given by

gπ+Zh1

(xtest|Dtrain)c = E (h(xtest|ψ(Dtrain, π + Z)) = PZ (y(i?(xtest, ψ(Dtrain, π + Z))) = c) (136)

where ψ(Dtrain, π + Z) = {(xi + πi + Zi, yi)| (xi, yi) ∈ Dtrain}. For ease of notation, let si := sL(xi + π +Zi, xtest) and abbreviate i?(Z) := i?(xtest, ψ(Dtrain, π + Z). Note that for any class c

PZ (y(i?(Z)) = c) =∑

i : y(i)=c

PZ (i?(Z) = i) =∑

i : y(i)=c

L∑l=1

PZ (i?(Z) = i| si = βl)PZ (si = βl) . (137)

Let pil = PZ (si = βl) and note that

‖xi + πi + Zi − xtest‖22 =

d∑j=1

(xij + πij − x0j + Zij)2. (138)

Since Zij are iid Gaussians with mean 0 and variance σ2, the sum in (138) scaled by σ2 follows a non-centralχ2-distribution with non-centrality parameter λi := ‖xi + πi − xtest‖22 and d degrees of freedom. Denote theCDF by Fd, λi . Then

pil = PZ(‖xi + πi + Zi − xtest‖22 ∈ Il

)= Fd,λi

(blσ2

)− Fd,λi

(bl−1

σ2

). (139)

Computing the conditional probabilities in (137) yields

PZ (i?(Z) = i|si = βl) =

i−1∏j=1

PZ (sj < βl)

· n∏j=i+1

PZ (sj ≤ βl)

(140)

=

i−1∏j=1

L∑r=l+1

pjr

· n∏j=i+1

L∑r=l

pjr

(141)

and thus

gπ+Zh1

(xtest|Dtrain)c =∑

i : yi=c

L−1∑l=1

pil

i−1∏j=1

L∑r=l+1

pjr

n∏j=i+1

L∑r=l

pjr

(142)

concluding the proof.

35