provably robust deep - carnegie mellon school of computer...
TRANSCRIPT
Provably robust deep learning
J. Zico Kolter
Carnegie Mellon University and Bosch Center for AI
1
Wooaah...
OutlineIntroduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
2
OutlineIntroduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
3
The AI breakthrough (some recent history)
4Karras et al., 2018 Radford et al., 2019 Vinyals et al., 2019
…but the stakes are low
5
??
??
Adversarial attacks
6
Sharif et al., 2016Evtimov et al., 2017
Athalye et al., 2017
Figure from Madry et al.
… and some recent work
7[Lee and Kolter, 2019], https://arxiv.org/abs/1906.11897
Why should we care?…you probably don’t have an adversary changing inputs to your classifier at a pixel level (or if you do, you have bigger problems)
1. Genuine security implications for deep networks (e.g., with physical attacks)
2. Says something fundamental about the representation of deep classifiers, smooth decision boundaries, sensitivity to distribution shift (within threat model), etc
8
OutlineIntroduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
9
Adversarial attacks as optimization
10
𝐄",$
max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
𝐄",$
Loss 𝑓/(𝑥), 𝑦
The adversarial optimization problemHow do we solve the “inner” optimization problem
max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
Key insight: the same process that enabled us to learn the model parameters via gradient descent also allows us to create an adversarial example via gradient descent
𝜕
𝜕𝛿Loss 𝑓
/(𝑥 + 𝛿), 𝑦
11
Solving with projected gradient descentSince we are trying to maximize the loss when creating an adversarial example, we repeatedly move in the direction of the positive gradient
Since we also need to ensure that 𝛿 ∈ Δ, we also project back into this set after each step, a process known as projected gradient descent (PGD)
𝛿 ≔ Proj∆
𝛿 + 𝛼𝜕
𝜕𝛿Loss 𝑓
/𝑥 + 𝛿 , 𝑦
Example: for Δ = {𝛿: 𝛿∞≤ 𝜖} (called the ℓ
∞ball), the projection operator just
clips each coordinate to [−𝜖, 𝜖]
12
The Fast Gradient Sign MethodThe Fast Gradient Sign Method (FGSM) takes a single PGD step with step size 𝛼 →∞, which corresponds exactly to just taking a step in the signs of the gradient terms
Creates weaker attacks than running full PGD, but substantially faster
13
∆
δ = 0
α∂
∂δGQbb(fθ(x + δ), y)
P∆
Illustration of adversarial examplesWe will demonstrate adversarial attacks on MNIST data set, using two different architectures
14
FC-100FC-10Conv-32x28x28
Conv-32x28x28
Conv-64x14x14
Conv-64x14x14
FC-200FC-10
2-layer fully connected MLP 6 layer ConvNet
Illustrations of FGSM/PGD
15
ConvNet(FGSM):
ConvNet(PDG)
2.9% 1.1%
92.6%
41.7%
96.4%
74.3%
MLP ConvNet
Test Error, epsilon=0.1
Clean FGSM PGD
OutlineIntroduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
16
Adversarial robustness
min/
𝐄",$
Loss 𝑓/(𝑥), 𝑦 ⟹ min
/
𝐄",$
max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
1. Adversarial training: Take model SGD steps at (approximate) worst-case perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]
2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter, 2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
17
“pig”
Adversarial trainingHow do we optimize the objective
min/
∑
",$∈O
max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
We would like to solve it with gradient descent, but how do we compute the gradient of the objective with the max term inside?
18
Danskin’s TheoremA fundamental result in optimization:
𝜕
𝜕𝜃max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦 =
𝜕
𝜕𝜃Loss 𝑓
/(𝑥 + 𝛿
⋆), 𝑦
where 𝛿⋆ = argmax
(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
Seems “obvious,” but it is a very subtle result; means we can optimize through the max by just finding it’s maximizing value
Note however, it only applies when max is performed exactly
19
Adversarial trainingRepeat
1. Select minibatch 𝐵2. For each 𝑥, 𝑦 ∈ 𝐵, compute
adversarial example 𝛿⋆ 𝑥3. Update parameters
𝜃 ≔ 𝜃 −𝛼
𝐵∑
",$∈T
𝜕
𝜕𝜃Loss 𝑓
/(𝑥 + 𝛿
⋆𝑥 ), 𝑦
Common to also mix robust/standard updates (not done in our case)
20
1.1% 0.9%
41.7%
2.6%
74.4%
2.8%
ConvNet Robust ConvNet
Test Error, epsilon=0.1
Clean FGSM PGD
Evaluating robust modelsOur model looks good, but we should be careful declaring success
Need to evaluate against different attacks, PGD attacks run for longer, with random restarts, etc
Note: it is not particularly informative to evaluate against a different type of attack, e.g. evaluate ℓ
∞robust model against ℓ
1or ℓ
2attacks
21
Adversarial robustness
min/
𝐄",$
Loss 𝑓/(𝑥), 𝑦 ⟹ min
/
𝐄",$
max(∈∆
Loss 𝑓/(𝑥 + 𝛿), 𝑦
1. Adversarial training: Take model SGD steps at (approximate) worst-case perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]
2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter, 2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
22
“pig”
Provable defenses
max(∈∆
Loss 𝑓/𝑥 + 𝛿 , 𝑦 ≤ max
(∈∆
Loss 𝑓/
rel𝑥 + 𝛿 , 𝑦 ≤ Loss(𝑓
/
dual𝑥,Δ , 𝑦)
23
ℓ u
z
zℓ u
z
z
ℓ
uz
z
Dual from [Wong and Kolter, 2018], also independently
derived via hybrid zonotope [Mirman et al., 2018] and
forward Lipschitz arguments [Weng et al., 2018]
Maximization problem is now a convex linear program [Wong and Kolter, 2018]
[Wong and Kolter, 2018], https://arxiv.org/abs/1711.00851
Robust optimization: putting it all togetherIn the end, instead of minimizing the traditional loss…
minimize/
∑
_=1
`
ℓ(ℎ/𝑥_, 𝑦
_)
…we just minimize our computed bound on loss, implemented in an auto-differentiation framework (PyTorch), and we get a guaranteed bound on worst-case loss (or error) for any norm-bounded adversarial attack
minimize/
∑
_=1
`
ℓ(𝐽c,/
𝑥_, 𝑦
_) ≥ minimize
/
∑
_=1
`
ℓ(max(∈∆
ℓ ℎ/𝑥_+ 𝛿 , 𝑦
_)
Full code available at https://github.com/locuslab/convex_adversarial
24
2D Toy ExampleSimple 2D toy problem, 2-100-100-100-2 MLP network, trained with Adam (learning rate = 0.001, no hyperparameter tuning)
25
Standard training Robust convex training
Standard and robust errors on MNIST 𝜖 = 0.1
26
1.10%
17%
1.10%
100%
44%
3.70%0.00%
10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
100.00%
Standard CNN Robust linear classifier Our method (CNN)
Error Guaranteed robust error bound
MNIST AttacksWe can also look at how well real attacks perform at 𝜖 = 0.1
27
1.1% 1.1%
50%
2.1%
82%
2.8%
100%
3.7%0.0%
10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%
100.0%
Standard training Our methodNo attack FGSM PGD Robust bound
What causes adversarial examples?Adversarial examples are caused (informally) by small regions of adversarial class “jutting” into an otherwise “nice” decision region (see also, e.g., [Roth et al., 2019])
28
Data point
Correct classIncorrect class
Randomization as a defense?We can “smooth” this decision region by adding Gaussian noise to the input and picking the majority class of the classifier over this noise
This was proposed (in many different ways) as a heuristic defense, but [Lecuyer et al, 2018] and later [Li et al., 2018] demonstrated that it gives certified bounds; we simplify and tighten this analysis in [Cohen et al., 2019]
29
𝑓(𝑥) 𝑔 𝑥 = argmax
$
𝐏c∼k(0,m
2o)[𝑓 𝑥 + 𝜖 = 𝑦]
Visual intuition of randomized smoothingTo classify panda images, classify a bunch of versions perturbed by random noise, take the majority vote
Note that this requires that our “base” classifier 𝑓 be able to classify noisy images well (in practice, means we also need to train on these noisy images)
30
The randomized smoothing guaranteeTheorem (binary case):• Given some input 𝑥, let 𝑦 = 𝑔(𝑥) be prediction of the smoothed classifier,
and let 𝑝 > 1/2 be the associated probability of this class under the smoothing distribution
𝑝 = 𝐏c∼k(0,m
2o)𝑓 𝑥 + 𝜖 = 𝑦
• Then 𝑔 𝑥 + 𝛿 = 𝑦 (i.e., smoothed classifier is robust)for any 𝛿 such that
𝛿2≤ 𝜎Φ
−1𝑝
where Φ−1 is the Gaussian inverse CDF
31
Proof of certified robustnessReasonable question: why can performance on random noise tell us anything about performance under adversarial noise?
Proof of theorem (informal): • Suppose I have two points 𝑥 and 𝑥 + 𝛿 and you an adversarial want to craft
a decision boundary for the underlying classifier 𝑓(𝑥) such that:1. 𝑥 is classified one way by smoothed classifier 𝑔(𝑥)2. 𝑥 + 𝛿 is classified differently by smoothed classifier 𝑔(𝑥)
32
x
x + δ
Proof of certified robustness (cont)
x
x + δ
33
x
x + δ
x
x + δ
x
x + δ
x
x + δ
𝑓(𝑥) 𝑔 𝑥
x
x + δ
x
x + δ
For linear classifier, we can compute ℓ2
distance to worse-case boundary exactly𝑅 = 𝜎Φ
−1𝑝
where 𝑝 is probability of majority class; implies any perturbation with 𝛿2≤ 𝑅
cannot change class label ∎
x
x + δ
R
(Follows from Neyman-Pearson
lemma in hypothesis testing)
See also [Li and Kuelbs 1998]
(thanks Ludwig Schmidt for pointing out reference)
Caveats (a.k.a. the fine print)The procedure here only guarantees robustness for the smoothed classifier 𝑔 not for the underlying classifier 𝑓
The probability 𝑝 of correct classification under smoothing cannot be computed exactly (the exactly convolution of a Gaussian with a neural network is intractable)• In practice, we need to resort to Monte Carlo estimates to compute a lower
bound on 𝑝 and certify the prediction (need a lot of samples to compute certified radius, though much fewer to just compute prediction)
• Bounds hold with high probability over (internal) randomness of sampling
We are certifying a tiny radius compared to noise distribution
34
Comparison to previous SOTA on CIFAR10
35
For identical networks, mostly outperforms previous SOTA for ℓ2
robustness, but also scales to much larger networks (where it uniformly outperforms duality-based approaches)
Performance on ImageNet
36
Example: we can certify smoothed classifier has top-1 accuracy of 37% under anyperturbation with 𝛿
2≤ 1 (in normalized pixels, i.e., RGB values in [0,1])
Future and ongoing workExtension to other perturbation norms besides ℓ
2?
• Seems extremely challenging (possibly impossible under certain assumptions), e.g., can’t do better than naive 𝑑1/2 scaling for ℓ
∞norm
A strange property:• Previous work on LP bounds was extremely specific to neural networks• Smoothing work never uses the fact that base classifier is neural network
My best guess for a way forward: we need to use model information to extract properties of base classifier beyond single probability 𝑝, use these to get better bounds
37
OutlineIntroduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
38
Robust artificial intelligenceDeep learning is making amazing strides, but we have a long ways to go before we can build deep learning systems that achieve even ”small” degrees of robustness/adaptability compared to what humans take for granted
Resources:• http://zicokolter.com – Web page with all papers• http://github.com/locuslab – Code associated with all papers• http://adversarial-ml-tutorial.org – Tutorial/code on adversarial robustness• http://locuslab.github.io – Group blog
39