provably robust deep - carnegie mellon school of computer...

Provably robust deep learning

J. Zico Kolter

Carnegie Mellon University and Bosch Center for AI

1

Wooaah...

OutlineIntroduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

2

OutlineIntroduction



Final thoughts

3

The AI breakthrough (some recent history)

4Karras et al., 2018 Radford et al., 2019 Vinyals et al., 2019

…but the stakes are low

5

??

??

Adversarial attacks

6

Sharif et al., 2016Evtimov et al., 2017

Athalye et al., 2017

Figure from Madry et al.

… and some recent work

7[Lee and Kolter, 2019], https://arxiv.org/abs/1906.11897

Why should we care?…you probably don’t have an adversary changing inputs to your classifier at a pixel level (or if you do, you have bigger problems)

1. Genuine security implications for deep networks (e.g., with physical attacks)

2. Says something fundamental about the representation of deep classifiers, smooth decision boundaries, sensitivity to distribution shift (within threat model), etc

8

OutlineIntroduction



Final thoughts

9

Adversarial attacks as optimization

10

𝐄",$

max(∈∆

Loss 𝑓/(𝑥 + 𝛿), 𝑦

𝐄",$

Loss 𝑓/(𝑥), 𝑦

The adversarial optimization problemHow do we solve the “inner” optimization problem

max(∈∆


Key insight: the same process that enabled us to learn the model parameters via gradient descent also allows us to create an adversarial example via gradient descent

𝜕

𝜕𝛿Loss 𝑓

/(𝑥 + 𝛿), 𝑦

11

Solving with projected gradient descentSince we are trying to maximize the loss when creating an adversarial example, we repeatedly move in the direction of the positive gradient

Since we also need to ensure that 𝛿 ∈ Δ, we also project back into this set after each step, a process known as projected gradient descent (PGD)

𝛿 ≔ Proj∆

𝛿 + 𝛼𝜕

𝜕𝛿Loss 𝑓

/𝑥 + 𝛿 , 𝑦

Example: for Δ = {𝛿: 𝛿∞≤ 𝜖} (called the ℓ

∞ball), the projection operator just

clips each coordinate to [−𝜖, 𝜖]

12

The Fast Gradient Sign MethodThe Fast Gradient Sign Method (FGSM) takes a single PGD step with step size 𝛼 →∞, which corresponds exactly to just taking a step in the signs of the gradient terms

Creates weaker attacks than running full PGD, but substantially faster

13

∆

δ = 0

α∂

∂δGQbb(fθ(x + δ), y)

P∆

Illustration of adversarial examplesWe will demonstrate adversarial attacks on MNIST data set, using two different architectures

14

FC-100FC-10Conv-32x28x28

Conv-32x28x28

Conv-64x14x14

Conv-64x14x14

FC-200FC-10

2-layer fully connected MLP 6 layer ConvNet

Illustrations of FGSM/PGD

15

ConvNet(FGSM):

ConvNet(PDG)

2.9% 1.1%

92.6%

41.7%

96.4%

74.3%

MLP ConvNet

Test Error, epsilon=0.1

Clean FGSM PGD

OutlineIntroduction



Final thoughts

16

Adversarial robustness

min/

𝐄",$

Loss 𝑓/(𝑥), 𝑦 ⟹ min

/

𝐄",$

max(∈∆


1. Adversarial training: Take model SGD steps at (approximate) worst-case perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]

2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter, 2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]

17

“pig”

Adversarial trainingHow do we optimize the objective

min/

∑

",$∈O

max(∈∆


We would like to solve it with gradient descent, but how do we compute the gradient of the objective with the max term inside?

18

Danskin’s TheoremA fundamental result in optimization:

𝜕

𝜕𝜃max(∈∆

Loss 𝑓/(𝑥 + 𝛿), 𝑦 =

𝜕

𝜕𝜃Loss 𝑓

/(𝑥 + 𝛿

⋆), 𝑦

where 𝛿⋆ = argmax

(∈∆


Seems “obvious,” but it is a very subtle result; means we can optimize through the max by just finding it’s maximizing value

Note however, it only applies when max is performed exactly

19

Adversarial trainingRepeat

1. Select minibatch 𝐵2. For each 𝑥, 𝑦 ∈ 𝐵, compute

adversarial example 𝛿⋆ 𝑥3. Update parameters

𝜃 ≔ 𝜃 −𝛼

𝐵∑

",$∈T

𝜕

𝜕𝜃Loss 𝑓

/(𝑥 + 𝛿

⋆𝑥 ), 𝑦

Common to also mix robust/standard updates (not done in our case)

20

1.1% 0.9%

41.7%

2.6%

74.4%

2.8%

ConvNet Robust ConvNet

Test Error, epsilon=0.1

Clean FGSM PGD

Evaluating robust modelsOur model looks good, but we should be careful declaring success

Need to evaluate against different attacks, PGD attacks run for longer, with random restarts, etc

Note: it is not particularly informative to evaluate against a different type of attack, e.g. evaluate ℓ

∞robust model against ℓ

1or ℓ

2attacks

21

Adversarial robustness

min/

𝐄",$

Loss 𝑓/(𝑥), 𝑦 ⟹ min

/

𝐄",$

max(∈∆


1. Adversarial training: Take model SGD steps at (approximate) worst-case perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]

2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter, 2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]

22

“pig”

Provable defenses

max(∈∆

Loss 𝑓/𝑥 + 𝛿 , 𝑦 ≤ max

(∈∆

Loss 𝑓/

rel𝑥 + 𝛿 , 𝑦 ≤ Loss(𝑓

/

dual𝑥,Δ , 𝑦)

23

ℓ u

z

zℓ u

z

z

ℓ

uz

z

Dual from [Wong and Kolter, 2018], also independently

derived via hybrid zonotope [Mirman et al., 2018] and

forward Lipschitz arguments [Weng et al., 2018]

Maximization problem is now a convex linear program [Wong and Kolter, 2018]

[Wong and Kolter, 2018], https://arxiv.org/abs/1711.00851

Robust optimization: putting it all togetherIn the end, instead of minimizing the traditional loss…

minimize/

∑

_=1

`

ℓ(ℎ/𝑥_, 𝑦

_)

…we just minimize our computed bound on loss, implemented in an auto-differentiation framework (PyTorch), and we get a guaranteed bound on worst-case loss (or error) for any norm-bounded adversarial attack

minimize/

∑

_=1

`

ℓ(𝐽c,/

𝑥_, 𝑦

_) ≥ minimize

/

∑

_=1

`

ℓ(max(∈∆

ℓ ℎ/𝑥_+ 𝛿 , 𝑦

_)

Full code available at https://github.com/locuslab/convex_adversarial

24

https://github.com/locuslab/convex_adversarial

2D Toy ExampleSimple 2D toy problem, 2-100-100-100-2 MLP network, trained with Adam (learning rate = 0.001, no hyperparameter tuning)

25

Standard training Robust convex training

Standard and robust errors on MNIST 𝜖 = 0.1

26

1.10%

17%

1.10%

100%

44%

3.70%0.00%

10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

100.00%

Standard CNN Robust linear classifier Our method (CNN)

Error Guaranteed robust error bound

MNIST AttacksWe can also look at how well real attacks perform at 𝜖 = 0.1

27

1.1% 1.1%

50%

2.1%

82%

2.8%

100%

3.7%0.0%

10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Standard training Our methodNo attack FGSM PGD Robust bound

What causes adversarial examples?Adversarial examples are caused (informally) by small regions of adversarial class “jutting” into an otherwise “nice” decision region (see also, e.g., [Roth et al., 2019])

28

Data point

Correct classIncorrect class

Randomization as a defense?We can “smooth” this decision region by adding Gaussian noise to the input and picking the majority class of the classifier over this noise

This was proposed (in many different ways) as a heuristic defense, but [Lecuyer et al, 2018] and later [Li et al., 2018] demonstrated that it gives certified bounds; we simplify and tighten this analysis in [Cohen et al., 2019]

29

𝑓(𝑥) 𝑔 𝑥 = argmax

$

𝐏c∼k(0,m

2o)[𝑓 𝑥 + 𝜖 = 𝑦]

Visual intuition of randomized smoothingTo classify panda images, classify a bunch of versions perturbed by random noise, take the majority vote

Note that this requires that our “base” classifier 𝑓 be able to classify noisy images well (in practice, means we also need to train on these noisy images)

30

The randomized smoothing guaranteeTheorem (binary case):• Given some input 𝑥, let 𝑦 = 𝑔(𝑥) be prediction of the smoothed classifier,

and let 𝑝 > 1/2 be the associated probability of this class under the smoothing distribution

𝑝 = 𝐏c∼k(0,m

2o)𝑓 𝑥 + 𝜖 = 𝑦

• Then 𝑔 𝑥 + 𝛿 = 𝑦 (i.e., smoothed classifier is robust)for any 𝛿 such that

𝛿2≤ 𝜎Φ

−1𝑝

where Φ−1 is the Gaussian inverse CDF

31

Proof of certified robustnessReasonable question: why can performance on random noise tell us anything about performance under adversarial noise?

Proof of theorem (informal): • Suppose I have two points 𝑥 and 𝑥 + 𝛿 and you an adversarial want to craft

a decision boundary for the underlying classifier 𝑓(𝑥) such that:1. 𝑥 is classified one way by smoothed classifier 𝑔(𝑥)2. 𝑥 + 𝛿 is classified differently by smoothed classifier 𝑔(𝑥)

32

x

x + δ

Proof of certified robustness (cont)

x

x + δ

33

x

x + δ

x

x + δ

x

x + δ

x

x + δ

𝑓(𝑥) 𝑔 𝑥

x

x + δ

x

x + δ

For linear classifier, we can compute ℓ2

distance to worse-case boundary exactly𝑅 = 𝜎Φ

−1𝑝

where 𝑝 is probability of majority class; implies any perturbation with 𝛿2≤ 𝑅

cannot change class label ∎

x

x + δ

R

(Follows from Neyman-Pearson

lemma in hypothesis testing)

See also [Li and Kuelbs 1998]

(thanks Ludwig Schmidt for pointing out reference)

Caveats (a.k.a. the fine print)The procedure here only guarantees robustness for the smoothed classifier 𝑔 not for the underlying classifier 𝑓

The probability 𝑝 of correct classification under smoothing cannot be computed exactly (the exactly convolution of a Gaussian with a neural network is intractable)• In practice, we need to resort to Monte Carlo estimates to compute a lower

bound on 𝑝 and certify the prediction (need a lot of samples to compute certified radius, though much fewer to just compute prediction)

• Bounds hold with high probability over (internal) randomness of sampling

We are certifying a tiny radius compared to noise distribution

34

Comparison to previous SOTA on CIFAR10

35

For identical networks, mostly outperforms previous SOTA for ℓ2

robustness, but also scales to much larger networks (where it uniformly outperforms duality-based approaches)

Performance on ImageNet

36

Example: we can certify smoothed classifier has top-1 accuracy of 37% under anyperturbation with 𝛿

2≤ 1 (in normalized pixels, i.e., RGB values in [0,1])

Future and ongoing workExtension to other perturbation norms besides ℓ

2?

• Seems extremely challenging (possibly impossible under certain assumptions), e.g., can’t do better than naive 𝑑1/2 scaling for ℓ

∞norm

A strange property:• Previous work on LP bounds was extremely specific to neural networks• Smoothing work never uses the fact that base classifier is neural network

My best guess for a way forward: we need to use model information to extract properties of base classifier beyond single probability 𝑝, use these to get better bounds

37

OutlineIntroduction



Final thoughts

38

Robust artificial intelligenceDeep learning is making amazing strides, but we have a long ways to go before we can build deep learning systems that achieve even ”small” degrees of robustness/adaptability compared to what humans take for granted

Resources:• http://zicokolter.com – Web page with all papers• http://github.com/locuslab – Code associated with all papers• http://adversarial-ml-tutorial.org – Tutorial/code on adversarial robustness• http://locuslab.github.io – Group blog

39

http://zicokolter.com/

http://github.com/locuslab

http://adversarial-ml-tutorial.org/

http://locuslab.github.io/

provably robust deep - carnegie mellon school of computer...

Documents