towards principled methodologies and e cient algorithms...
TRANSCRIPT
![Page 1: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/1.jpg)
Towards Principled Methodologies andEfficient Algorithms for
Minimax Machine Learning
Tuo Zhao
Georgia Tech, Jun. 26. 2019
Joint work with Haoming Jiang, Minshuo Chen (Georgia Tech), Bo Dai (Google
Brain), Zhaoran Wang (Northwestern U) and others.
![Page 2: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/2.jpg)
Background
![Page 3: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/3.jpg)
VALSE Webinar, Jun. 26 2019
Minimax Machine Learning
Conventional Empirical Risk Minimization: Given training dataz1, ..., zn, we minimize an empirical risk function,
minθ
1
n
n∑i=1
f(zi; θ).
Minimax Formulation: We solve a minimax problem,
minθ
maxφ
1
n
n∑i=1
f(zi; θ, φ).
More Flexible.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 2/38
![Page 4: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/4.jpg)
VALSE Webinar, Jun. 26 2019
Minimax Machine Learning
Conventional Empirical Risk Minimization: Given training dataz1, ..., zn, we minimize an empirical risk function,
minθ
1
n
n∑i=1
f(zi; θ).
Minimax Formulation: We solve a minimax problem,
minθ
maxφ
1
n
n∑i=1
f(zi; θ, φ).
More Flexible.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 2/38
![Page 5: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/5.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Robust Deep Learning
Neural Networks are vulnerable to adversarial examples(Goodfellow et al. 2014, Madry et al. 2017).
Adversarial ExampleClean Sample Perturbation
Adversarial Perturbation: maxδi∈B
`(f(xi + δi; θ), yi),
Adversarial Training: minθ
1
n
n∑i=1
maxδi∈B
`(f(xi + δi; θ), yi),
where δi ∈ B denotes the imperceptible perturbation.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 3/38
![Page 6: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/6.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Robust Deep Learning
Neural Networks are vulnerable to adversarial examples(Goodfellow et al. 2014, Madry et al. 2017).
Adversarial ExampleClean Sample Perturbation
Adversarial Perturbation: maxδi∈B
`(f(xi + δi; θ), yi),
Adversarial Training: minθ
1
n
n∑i=1
maxδi∈B
`(f(xi + δi; θ), yi),
where δi ∈ B denotes the imperceptible perturbation.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 3/38
![Page 7: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/7.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Image Generation
Brock et al. (2019)
All are fake!
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 4/38
![Page 8: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/8.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Unsupervised Learning
Generative Adversarial Network: Goodfellow et al. (2014),Arjovsky et al. (2017), Miyato et al. (2018), Brock et al. (2019)
minθ
maxW
1
n
n∑i=1
φ (A(DW(xi))) + Ex∼DGθ [φ (1−A(DW(x)))].
DW : Discriminator; Gθ: Generator; φ: log() and A: Softmax.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 5/38
![Page 9: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/9.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Unsupervised Learning
Generative Adversarial Network: Goodfellow et al. (2014),Arjovsky et al. (2017), Miyato et al. (2018), Brock et al. (2019)
minθ
maxW
1
n
n∑i=1
φ (A(DW(xi))) + Ex∼DGθ [φ (1−A(DW(x)))].
DW : Discriminator; Gθ: Generator; φ: log() and A: Softmax.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 5/38
![Page 10: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/10.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Reinforcement Learning
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 6/38
![Page 11: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/11.jpg)
VALSE Webinar, Jun. 26 2019
Motivating Application: Reinforcement Learning
Minimax Formulation: Given M = (A,A, P,R, γ), we solve
minπ,V
maxν
L(π, V ; ν) = 2Es,a,s′ [ν(s, a)(R(s, a) + γV (s′)
− λ log(π(a|s))]− Es,a,s′ν2(s, a),
where s denotes the state, a denotes the action, and
Policy: π : S → P(A),
Value: V : S → R,
Reward: R : S ×A → R,
Axillary Dual: ν : S ×A → R.
The policy π is parameterized as a neural network, where as ν isparameterized as a reproducing kernel function (Dai et al. 2018).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 7/38
![Page 12: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/12.jpg)
VALSE Webinar, Jun. 26 2019
Successes of Minimax Machine Learning
Adversarial Robust Learning
Unsupervised Learning
Learning with Constraints
Reinforcement Learning
Domain Adaptation
Generative Adversarial Imitation Learning
. . .
=⇒ Identify the fundamental hardness of minimax machinelearning and make optimization easier.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 8/38
![Page 13: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/13.jpg)
Challenges
![Page 14: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/14.jpg)
VALSE Webinar, Jun. 26 2019
Minimax Optimization
General Formula:minx∈X
maxy∈Y
f(x, y),
X ⊂ Rd, Y ⊂ Rp, f is some continuous function.
Two Stage Optimization:
Stage 1: g(x) = maxy∈Y f(x, y),
Stage 2: minx∈X g(x),
Solve Stage 2 using gradient descent.
Limitation: A global maximum of maxy∈Y f(x, y) needs to beobtained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.(1971)).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38
![Page 15: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/15.jpg)
VALSE Webinar, Jun. 26 2019
Minimax Optimization
General Formula:minx∈X
maxy∈Y
f(x, y),
X ⊂ Rd, Y ⊂ Rp, f is some continuous function.
Two Stage Optimization:
Stage 1: g(x) = maxy∈Y f(x, y),
Stage 2: minx∈X g(x),
Solve Stage 2 using gradient descent.
Limitation: A global maximum of maxy∈Y f(x, y) needs to beobtained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.(1971)).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38
![Page 16: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/16.jpg)
VALSE Webinar, Jun. 26 2019
Minimax Optimization
General Formula:minx∈X
maxy∈Y
f(x, y),
X ⊂ Rd, Y ⊂ Rp, f is some continuous function.
Two Stage Optimization:
Stage 1: g(x) = maxy∈Y f(x, y),
Stage 2: minx∈X g(x),
Solve Stage 2 using gradient descent.
Limitation: A global maximum of maxy∈Y f(x, y) needs to beobtained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.(1971)).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38
![Page 17: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/17.jpg)
VALSE Webinar, Jun. 26 2019
Existing Literature
Bilinear Saddle Point Problem:
minx∈X
p(x) + max
y∈Y〈Ax, y〉 − q(y)
.
X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d;p(·) and q(·): convex functions satisfying certain assumptions.
Nice Structure: Convex in x and Concave in y; Bilinearinteraction (can be slightly relaxed).
Algorithms with Theoretical Guarantees:
Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski2005, Chen et al. 2014, Dang et al. 2015).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38
![Page 18: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/18.jpg)
VALSE Webinar, Jun. 26 2019
Existing Literature
Bilinear Saddle Point Problem:
minx∈X
p(x) + max
y∈Y〈Ax, y〉 − q(y)
.
X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d;p(·) and q(·): convex functions satisfying certain assumptions.
Nice Structure: Convex in x and Concave in y; Bilinearinteraction (can be slightly relaxed).
Algorithms with Theoretical Guarantees:
Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski2005, Chen et al. 2014, Dang et al. 2015).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38
![Page 19: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/19.jpg)
VALSE Webinar, Jun. 26 2019
Existing Literature
Bilinear Saddle Point Problem:
minx∈X
p(x) + max
y∈Y〈Ax, y〉 − q(y)
.
X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d;p(·) and q(·): convex functions satisfying certain assumptions.
Nice Structure: Convex in x and Concave in y; Bilinearinteraction (can be slightly relaxed).
Algorithms with Theoretical Guarantees:
Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski2005, Chen et al. 2014, Dang et al. 2015).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38
![Page 20: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/20.jpg)
VALSE Webinar, Jun. 26 2019
Challenges: Nonconcavity of Inner Maximization
Recall Stage 2: minx∈X
g(x) := max
y∈Yf(x, y)
.
Why Fail to Converge?: y 6= arg maxy f(x, y) may even lead to⟨∂g(x)
∂x,∂f(x, y)
∂x
⟩ 0.
Noisy Gradient
θ
φ
θ
Minimization Minmax
Limit Cycles
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 11/38
![Page 21: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/21.jpg)
VALSE Webinar, Jun. 26 2019
Challenges: Nonconcavity of Inner Maximization
Recall Stage 2: minx∈X
g(x) := max
y∈Yf(x, y)
.
Why Fail to Converge?: y 6= arg maxy f(x, y) may even lead to⟨∂g(x)
∂x,∂f(x, y)
∂x
⟩ 0.
Noisy Gradient
θ
φ
θ
Minimization Minmax
Limit Cycles
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 11/38
![Page 22: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/22.jpg)
VALSE Webinar, Jun. 26 2019
Our Proposed Solutions
State of the Art:
Convex-concave: Well studied.
Nonconvex-concave: Limitedly studied.Reinforcement Learning: Dai et al. (2018); ConstrainedOptimizationChen et al. (2019); · · ·Beyond: No algorithm works well.
Our Solutions:
Improving Landscape and Learning to Optimize
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 12/38
![Page 23: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/23.jpg)
Generative Adversarial Networks
![Page 24: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/24.jpg)
VALSE Webinar, Jun. 26 2019
Generative Adversarial Networks
Highly Nonconvex-Nonconcave Minimax Problem:
minθ
maxW
1
n
n∑i=1
φ (A(DW(xi))) + Ex∼DGθ [φ (1−A(DW(x)))].
DW : Discriminator; Gθ: Generator; φ,A: Properly chosenfunctions (e.g., log(·) and Softmax).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 13/38
![Page 25: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/25.jpg)
VALSE Webinar, Jun. 26 2019
Generative Adversarial Networks
Instability Issue: Mode Collapse
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 14/38
![Page 26: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/26.jpg)
VALSE Webinar, Jun. 26 2019
Stabilizing GAN Training
Better Algorithm:
Two Time-Scale Update
Functional Gradient
Progressive Learning
. . .
Better Landscape:
Gradient Penalty
Weight Clipping
OrthogonalRegularization
Spectral Normalization
. . .
Algorithm works only if the landscape is good enough.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 15/38
![Page 27: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/27.jpg)
VALSE Webinar, Jun. 26 2019
Stabilizing GAN Training
Better Algorithm:
Two Time-Scale Update
Functional Gradient
Progressive Learning
. . .
Better Landscape:
Gradient Penalty
Weight Clipping
OrthogonalRegularization
Spectral Normalization
. . .
Algorithm works only if the landscape is good enough.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 15/38
![Page 28: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/28.jpg)
VALSE Webinar, Jun. 26 2019
Better Optimization Landscape
Lipschitz Continuous Discriminator:
An L-layer discriminator can be formulated as follows:
DW(x) = WLσL−1(WL−1 · · ·σ1(W1x) · · · ),where Wi’s are weight matrices and σi’s are activations.
1-Lipschitz condition:
|DW(x)−DW(x′)| ≤∥∥x− x′∥∥
Inspired by Wasserstein GAN (Arjovsky et al., 2017)
Empirically works well, but why?
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 16/38
![Page 29: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/29.jpg)
VALSE Webinar, Jun. 26 2019
Better Optimization Landscape
Lipschitz Continuous Discriminator:
An L-layer discriminator can be formulated as follows:
DW(x) = WLσL−1(WL−1 · · ·σ1(W1x) · · · ),where Wi’s are weight matrices and σi’s are activations.
1-Lipschitz condition:
|DW(x)−DW(x′)| ≤∥∥x− x′∥∥
Inspired by Wasserstein GAN (Arjovsky et al., 2017)
Empirically works well, but why?
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 16/38
![Page 30: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/30.jpg)
VALSE Webinar, Jun. 26 2019
Control Weight Matrix Scaling
Scaling Issue: Consider a simple 2-layer discriminator with ReLUactivation (σ(·) = max(·, 0)):
DW(x) = W2σ(W1x).
Since the ReLU activation is homogeneous, we can rescale theweight matrices by a factor λ > 0 as
W1 ⇒ λ ·W1 W2 ⇒W2/λ.
Although the neural network model remains the same, theoptimization landscape becomes worse.
Orthogonal Regularization:
minW1,W2
L(W1,W2) + λ(∥∥∥W>1 W1 − I
∥∥∥2
F+∥∥∥W>2 W2 − I
∥∥∥2
F
).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 17/38
![Page 31: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/31.jpg)
VALSE Webinar, Jun. 26 2019
Control Weight Matrix Scaling
Scaling Issue: Consider a simple 2-layer discriminator with ReLUactivation (σ(·) = max(·, 0)):
DW(x) = W2σ(W1x).
Since the ReLU activation is homogeneous, we can rescale theweight matrices by a factor λ > 0 as
W1 ⇒ λ ·W1 W2 ⇒W2/λ.
Although the neural network model remains the same, theoptimization landscape becomes worse.
Orthogonal Regularization:
minW1,W2
L(W1,W2) + λ(∥∥∥W>1 W1 − I
∥∥∥2
F+∥∥∥W>2 W2 − I
∥∥∥2
F
).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 17/38
![Page 32: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/32.jpg)
VALSE Webinar, Jun. 26 2019
Illustrations of Landscape
minx,yF(x, y) = (1− xy)2,
minx,yFλ(x, y) = (1− xy)2 + λ(x2 − y2)2.
x
y
F(x, y)
x
y F(x, y)
x
y
F(x, y)
x
y F(x, y)
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 18/38
![Page 33: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/33.jpg)
VALSE Webinar, Jun. 26 2019
Also Improves Generalization
Theorem (Informal, Jiang et al. 2019)
Under some technical assumptions and assume
‖Wi‖2 ≤ BWi for i ∈ [L] and ‖xk‖2 ≤ Bx for k ∈ [n].
Generator and discriminator are well trained, i.e.,
dF ,φ(µn, νn)− infν∈DG
dF ,φ(µn, ν) ≤ ε,
where dF ,φ(·, ·) is the neural distance with probability atleast 1− δ, we have
dF ,φ(µ, νn)− infν∈DG
dF ,φ(µ, ν) ≤ O(Bx∏Li=1BWi
√d2L√
n
).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 19/38
![Page 34: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/34.jpg)
VALSE Webinar, Jun. 26 2019
From Lipschitz Continuity to Generalization
Importance of Spectrum Control:
dF ,φ(µ, νn)− infν∈DG
dF ,φ(µ, ν) ≤ O(Bx∏Li=1BWi
√d2L√
n
).
1-Lipschitz =⇒ polynomial bound O
(√d2Ln
).
Controling the product of spectral norms avoids badlandscape and benefits the generalization of GANs.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 20/38
![Page 35: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/35.jpg)
VALSE Webinar, Jun. 26 2019
Better then Orthogonal Regularization
Spectral Normalization (SN, Miyato et al. 2018):
25000 50000 75000 100000 125000 150000 175000 200000
5
6
7
8
9
SN (Miyato et al. 2018)
SN (Alternative)
Orthgonal Regularization
Inception Score on STL-10
Miyato et al. (2018) > Orth. Reg. > SN (Alternative)
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 21/38
![Page 36: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/36.jpg)
VALSE Webinar, Jun. 26 2019
Better than Spectral Normalization
Singular Value Decay: Decay patterns of sorted singular valuesof weight matrices.
0.0 0.2 0.4 0.6 0.8 1.0
0.994
0.996
0.998
1.000
1.002
1.004
1.006
0-th layer
1-th layer
2-th layer
3-th layer
4-th layer
5-th layer
6-th layer
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
0-th layer1-th layer2-th layer3-th layer4-th layer5-th layer6-th layer
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0 0-th layer1-th layer2-th layer3-th layer4-th layer5-th layer6-th layer
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0-th layer1-th layer2-th layer3-th layer4-th layer5-th layer6-th layer
Orthogonal Reg. Miyato et al. 2018 SN (Alt.) Jiang et al. (2019)No Decay Slow Decay Fast Decay Slower DecayIS: 8.77 IS: 8.83 IS: 8.69 IS: 9.25
Observation: Slow singular value decay is better than bothno decay and fast decay.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 22/38
![Page 37: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/37.jpg)
VALSE Webinar, Jun. 26 2019
Experiments (CIFAR10 and STL-10)
CIFAR: FID CIFAR: Inception Score
STL: FID STL: Inception Score
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 23/38
![Page 38: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/38.jpg)
VALSE Webinar, Jun. 26 2019
Experiments (ImageNet)
Valley Jellyfish Pizza
Anemone Shoji Brain Coral
Cardoon Altar Jack-o’-lantern
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 24/38
![Page 39: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/39.jpg)
Adversarial Robust Learning
![Page 40: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/40.jpg)
VALSE Webinar, Jun. 26 2019
Adversarial Training
Adversarial ExampleClean Sample Perturbation
Highly Nonconvex-Nonconcave Minimax Problem:
minθ
1
n
n∑i=1
(maxδi∈B
`(f(xi + δi; θ), yi).
xi: feature; yi: label; δi: perturbation;
f(·; θ): neural network; `: loss function; B: constraint;
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 25/38
![Page 41: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/41.jpg)
VALSE Webinar, Jun. 26 2019
Adversarial Training
Adversarial ExampleClean Sample Perturbation
Highly Nonconvex-Nonconcave Minimax Problem:
minθ
1
n
n∑i=1
(maxδi∈B
`(f(xi + δi; θ), yi).
xi: feature; yi: label; δi: perturbation;
f(·; θ): neural network; `: loss function; B: constraint;
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 25/38
![Page 42: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/42.jpg)
VALSE Webinar, Jun. 26 2019
Adversarial Training
minθ
1
n
n∑i=1
(maxδi∈B
`(f(xi + δi; θ), yi).
Two Stage Optimization:
Inner Maximization Problem (Attack)
Outer Minimization Problem (Defense)
Popular Approaches for Attack:
Fast Gradient Sign Method (Goodfellow et al. (2014))
Projected Gradient Method (Kurakin et al. (2016))
Carlini-Wagner Attack (Paszke et al. (2017))
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38
![Page 43: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/43.jpg)
VALSE Webinar, Jun. 26 2019
Adversarial Training
minθ
1
n
n∑i=1
(maxδi∈B
`(f(xi + δi; θ), yi).
Two Stage Optimization:
Inner Maximization Problem (Attack)
Outer Minimization Problem (Defense)
Popular Approaches for Attack:
Fast Gradient Sign Method (Goodfellow et al. (2014))
Projected Gradient Method (Kurakin et al. (2016))
Carlini-Wagner Attack (Paszke et al. (2017))
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38
![Page 44: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/44.jpg)
VALSE Webinar, Jun. 26 2019
Adversarial Training
minθ
1
n
n∑i=1
(maxδi∈B
`(f(xi + δi; θ), yi).
Two Stage Optimization:
Inner Maximization Problem (Attack)
Outer Minimization Problem (Defense)
Popular Approaches for Attack:
Fast Gradient Sign Method (Goodfellow et al. (2014))
Projected Gradient Method (Kurakin et al. (2016))
Carlini-Wagner Attack (Paszke et al. (2017))
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38
![Page 45: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/45.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Learn/Optimize (L2L)
High Level Idea:
Cast the optimizer as a learning model;
Allow the model to learn to exploit structure automatically.
Implementation: Parameterize optimizer as a neural network,and learn its parameters (Andrychowicz et al. 2016).
Optimization Algorithms
(e.g., Gradient Descent)
x0<latexit sha1_base64="pJX5d/KVD2TALrkO//cbwYf0jxQ=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qnn9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AELeo2j</latexit>
Initial Solution
Gradient
rf(xt)<latexit sha1_base64="2wkOktp8wgq5ZwsD8qB2VvNlNqY=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQL2E3CnoMevEYwTwgWcLsZDYZMju7zvQGQ8h3ePGgiFc/xpt/4+Rx0MSChqKqm+6uIJHCoOt+O5m19Y3Nrex2bmd3b/8gf3hUN3GqGa+xWMa6GVDDpVC8hgIlbyaa0yiQvBEMbqd+Y8i1EbF6wFHC/Yj2lAgFo2glv61oICkJi08dPO/kC27JnYGsEm9BCrBAtZP/andjlkZcIZPUmJbnJuiPqUbBJJ/k2qnhCWUD2uMtSxWNuPHHs6Mn5MwqXRLG2pZCMlN/T4xpZMwoCmxnRLFvlr2p+J/XSjG89sdCJSlyxeaLwlQSjMk0AdIVmjOUI0so08LeSlifasrQ5pSzIXjLL6+SernkXZTK95eFys0ijiycwCkUwYMrqMAdVKEGDB7hGV7hzRk6L8678zFvzTiLmWP4A+fzB9WIkXw=</latexit>
OutputSolution
xT<latexit sha1_base64="X9lF3KKIwpkijMFVCiEBA1kh19o=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4j5gXJEmYnvcmQ2dllZlYMIZ/gxYMiXv0ib/6Nk2QPmljQUFR1090VJIJr47rfTm5tfWNzK79d2Nnd2z8oHh41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfmtR1Sax7Juxgn6ER1IHnJGjZUennr1XrHklt05yCrxMlKCDLVe8avbj1kaoTRMUK07npsYf0KV4UzgtNBNNSaUjegAO5ZKGqH2J/NTp+TMKn0SxsqWNGSu/p6Y0EjrcRTYzoiaoV72ZuJ/Xic14bU/4TJJDUq2WBSmgpiYzP4mfa6QGTG2hDLF7a2EDamizNh0CjYEb/nlVdKslL2LcuX+slS9yeLIwwmcwjl4cAVVuIMaNIDBAJ7hFd4c4bw4787HojXnZDPH8AfO5w9CCo3H</latexit>
V. S.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 27/38
![Page 46: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/46.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Learn/Optimize (L2L)
Advantages:
Attacker Network is powerful in representation.
=⇒ Yield strong and flexible perturbations.
Shared attacker model.
=⇒ Learn common structures across all perturbations.
Learning through overparametrization.
=⇒ Ease the training process.
Reduce search space.
=⇒ Computational efficiency
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 28/38
![Page 47: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/47.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Learn/Optimize (L2L)
New Formulation:
minθ
maxφ
1
n
n∑i=1
[`(f(xi + g(A(xi, yi, θ);φ); θ), yi)
],
Notations:
f(·; θ): Classifier;
g(·;φ): Attacker/Optimizer;
A(xi, yi, θ): Input of Optimizer g (Interact g with f via A).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 29/38
![Page 48: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/48.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Learn/Optimize (L2L)
New Formulation:
minθ
maxφ
1
n
n∑i=1
[`(f(xi + g(A(xi, yi, θ);φ); θ), yi)
],
Notations:
f(·; θ): Classifier;
g(·;φ): Attacker/Optimizer;
A(xi, yi, θ): Input of Optimizer g (Interact g with f via A).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 29/38
![Page 49: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/49.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Attack:
Grad L2L: Motivated by gradient ascent with
A(xi, yi, θ) = [xi,∇x`(f(xi; θ), yi)].
Original Input
Classifier ℎ
Attacker 𝑔
Gradient w.r.t. Input
Noise Perturbed InputsConcatenateInput and Gradient
Clean Loss
Adv. Loss
+
Backpropagation
Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38
![Page 50: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/50.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Attack:
Grad L2L: Motivated by gradient ascent with
A(xi, yi, θ) = [xi,∇x`(f(xi; θ), yi)].
Original Input
Classifier ℎ
Attacker 𝑔
Gradient w.r.t. Input
Noise Perturbed InputsConcatenateInput and Gradient
Clean Loss
Adv. Loss
+
Backpropagation
Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38
![Page 51: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/51.jpg)
VALSE Webinar, Jun. 26 2019
Learn to Attack:
Grad L2L: Motivated by gradient ascent with
A(xi, yi, θ) = [xi,∇x`(f(xi; θ), yi)].
Original Input
Classifier ℎ
Attacker 𝑔
Gradient w.r.t. Input
Noise Perturbed InputsConcatenateInput and Gradient
Clean Loss
Adv. Loss
+
Backpropagation
Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38
![Page 52: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/52.jpg)
VALSE Webinar, Jun. 26 2019
Experiments
Accuracy on Clean Samples and PGM adversaries
Per Iteration Computational Cost
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 31/38
![Page 53: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/53.jpg)
Reinforcement Learning
![Page 54: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/54.jpg)
VALSE Webinar, Jun. 26 2019
Smoothed Bellman Error Minimization
Minimax Formulation: Given M = (A,A, P,R, γ), we solve
minπ,V
maxν
L(π, V ; ν) = 2Es,a,s′ [ν(s, a)(R(s, a) + γV (s′)
− λ log(π(a|s))]− Es,a,s′ν2(s, a),
where s denotes the state, a denotes the action, and
Policy: π : S → P(A),
Value: V : S → R,
Reward: R : S ×A → R,
Axillary Dual: ν : S ×A → R.
The policy π and ν are parameterized as a neural network anda reproducing kernel function, respectively (Dai et al. 2018).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38
![Page 55: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/55.jpg)
VALSE Webinar, Jun. 26 2019
Smoothed Bellman Error Minimization
Minimax Formulation: Given M = (A,A, P,R, γ), we solve
minπ,V
maxν
L(π, V ; ν) = 2Es,a,s′ [ν(s, a)(R(s, a) + γV (s′)
− λ log(π(a|s))]− Es,a,s′ν2(s, a),
where s denotes the state, a denotes the action, and
Policy: π : S → P(A),
Value: V : S → R,
Reward: R : S ×A → R,
Axillary Dual: ν : S ×A → R.
The policy π and ν are parameterized as a neural network anda reproducing kernel function, respectively (Dai et al. 2018).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38
![Page 56: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/56.jpg)
VALSE Webinar, Jun. 26 2019
Smoothed Bellman Error Minimization
Minimax Formulation: Given M = (A,A, P,R, γ), we solve
minπ,V
maxν
L(π, V ; ν) = 2Es,a,s′ [ν(s, a)(R(s, a) + γV (s′)
− λ log(π(a|s))]− Es,a,s′ν2(s, a),
where s denotes the state, a denotes the action, and
Policy: π : S → P(A),
Value: V : S → R,
Reward: R : S ×A → R,
Axillary Dual: ν : S ×A → R.
The policy π and ν are parameterized as a neural network anda reproducing kernel function, respectively (Dai et al. 2018).
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38
![Page 57: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/57.jpg)
VALSE Webinar, Jun. 26 2019
Parameterization of V , π and ν
State Approximation: There exists a feature vector ψ(s)associated with every state s ∈ S.
Neural Networks for π and V :
π(aj |s) = fj(ψ(s); Θ) and V (s) = h(ψ(s),∆),
where fj is a neural network with parameter Θ and∑aj∈A π(aj |s) = 1.
Reproducing Kernel Functions for ν:
ν(aj |s) = gj(ψ(s); Ω),
where gj is a reproducing kernel function with parameter Ω.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 33/38
![Page 58: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/58.jpg)
VALSE Webinar, Jun. 26 2019
Benefit of Reproducing Kernel Parameterization
Alternative Minimax Formulation:
min∆,Θ
maxΩ∈CL(∆,Θ,Ω)−R(Ω)
,
where R(Ω) is a strongly concave regularizer.
Stochastic Alternating Gradient Algorithm:
Ω(t+1) = ΠC(Ω(t) + ηΩ∇ΩL(∆(t),Θ(t),Ω(t))),
∆(t+1) = ∆(t) − η∆∇∆L′(∆(t),Θ(t),Ω(t+1)),
V (t+1) = V (t) − ηV∇V L′(∆(t),Θ(t),Ω(t+1)),
where ηV , η∆ and ηΩ are properly chosen step sizes, and L and L′
are unbiased independent stochastic approximations of L.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 34/38
![Page 59: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/59.jpg)
VALSE Webinar, Jun. 26 2019
Benefit of Reproducing Kernel Parameterization
Alternative Minimax Formulation:
min∆,Θ
maxΩ∈CL(∆,Θ,Ω)−R(Ω)
,
where R(Ω) is a strongly concave regularizer.
Stochastic Alternating Gradient Algorithm:
Ω(t+1) = ΠC(Ω(t) + ηΩ∇ΩL(∆(t),Θ(t),Ω(t))),
∆(t+1) = ∆(t) − η∆∇∆L′(∆(t),Θ(t),Ω(t+1)),
V (t+1) = V (t) − ηV∇V L′(∆(t),Θ(t),Ω(t+1)),
where ηV , η∆ and ηΩ are properly chosen step sizes, and L and L′
are unbiased independent stochastic approximations of L.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 34/38
![Page 60: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/60.jpg)
VALSE Webinar, Jun. 26 2019
Sublinear Convergence
Theorem (Informal, Chen et al. 2019)
Given a pre-specified error ε > 0, we assume that L(∆,Θ,Ω) issufficiently smooth in ∆,Θ,Ω ∈ C, and strongly concave in Ω.Given properly chosen step sizes and a batch size of O(1/ε), weneed at most
T = O(1/ε)
iterations such that
min1≤t≤T
E∥∥∥∇∆L(∆t,Θ(t),Ω(t+1))
∥∥∥2
2+ E
∥∥∥∇ΘL(∆t,Θ(t),Ω(t+1))∥∥∥2
2
+ E∥∥∥Ω(t) −ΠC(Ω
(t) +∇ΩL(∆(t),Θ(t),Ω(t)))∥∥∥2
2≤ ε.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 35/38
![Page 61: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/61.jpg)
VALSE Webinar, Jun. 26 2019
Experiments
Reproducing Kernel v.s. Neural Networks for ν.
0.0 0.2 0.4 0.6 0.8 1.0
Number of iterations
0.0
0.2
0.4
0.6
0.8
1.0
Per
form
ance
(sca
led)
GAIL
GMMIL
0 50 100 150 200
2
1
0
1
Reacher
0 100 200 300 400 500
0.2
0.0
0.2
0.4
0.6
0.8
1.0
HalfCheetah
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
Hopper
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
Walker
0 100 200 300 400 500
0.50
0.25
0.00
0.25
0.50
0.75
Ant
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Humanoid
Figure 3: Performance of the learned policy during training iterations, from GMMIL (red) and GAIL (blue). The x-axis is theiteration, and the y-axis is the scaled return of the policy. We used the settings with the largest number of expert trajectories,i.e. 25 for half cheetah, hopper, walker, and ant; 240 for humanoid tasks. These results are from single TRPO gradient updateper iteration.
0.0 0.2 0.4 0.6 0.8 1.0
Number of iterations
0.0
0.2
0.4
0.6
0.8
1.0
Per
form
ance
(sca
led)
GAIL
GMMIL
0 20 40 60 80 100
3
2
1
0
1
Reacher
0 20 40 60 80 1000.2
0.0
0.2
0.4
0.6
0.8
1.0
HalfCheetah
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Hopper
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Walker
0 50 100 150 200 250 300
0.8
0.6
0.4
0.2
0.0
0.2
Ant
0 100 200 300 400
0.0
0.2
0.4
0.6
0.8Humanoid
Figure 4: Performance of the learned policy during training iterations using importance sampling to perform 5 TRPO gradientupdates per iteration, both for GMMIL (red) and GAIL (blue). Note that the x-axis scales are different from Fig. 3.
measures of the learned policy and the expert policy. Thisallows our approach to avoid the hard minimax optimiza-tion of GAN inherent to training in GAIL, which results inmore robust and sample-efficient imitation learning. As anend result, our approach becomes an imitation learning ver-sion of GMMNs (Li, Swersky, and Zemel 2015) and MMDnets (Dziugaite, Roy, and Ghahramani 2015).
Through an extensive set of experiments on high-dimensional robotic imitation tasks with up to 376 state vari-ables and 17 action variables (i.e. Humanoid), we showedthat GMMIL successfully imitates expert policies, evenwhen the expert trajectory was scarcely provided. The re-
turns obtained by the learned policy exhibited lower vari-ances, hinting that using MMD makes the overall optimiza-tion much more stable compared to the minimax optimiza-tion in GAIL. In addition, we showed that we can naturallyreuse the trajectories by importance sampling, allowing forfurther improving the sample efficiency.
As for the future work, we would like to address many as-pects in which our formulation could be improved. First ofall, it is well known that the test power of MMD degradeswith the dimensionality of the data (Ramdas et al. 2015).Although we did not suffer from this issue in our experi-ments, this could be true with visual domains. Second, even
The reproducing kernel parameterization leads to an easieroptimization problem. However, it might not be advanta-geous on more complicated problems.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 36/38
![Page 62: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/62.jpg)
VALSE Webinar, Jun. 26 2019
Experiments
Reproducing Kernel v.s. Neural Networks for ν.
0.0 0.2 0.4 0.6 0.8 1.0
Number of iterations
0.0
0.2
0.4
0.6
0.8
1.0
Per
form
ance
(sca
led)
GAIL
GMMIL
0 50 100 150 200
2
1
0
1
Reacher
0 100 200 300 400 500
0.2
0.0
0.2
0.4
0.6
0.8
1.0
HalfCheetah
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
Hopper
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
Walker
0 100 200 300 400 500
0.50
0.25
0.00
0.25
0.50
0.75
Ant
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Humanoid
Figure 3: Performance of the learned policy during training iterations, from GMMIL (red) and GAIL (blue). The x-axis is theiteration, and the y-axis is the scaled return of the policy. We used the settings with the largest number of expert trajectories,i.e. 25 for half cheetah, hopper, walker, and ant; 240 for humanoid tasks. These results are from single TRPO gradient updateper iteration.
0.0 0.2 0.4 0.6 0.8 1.0
Number of iterations
0.0
0.2
0.4
0.6
0.8
1.0
Per
form
ance
(sca
led)
GAIL
GMMIL
0 20 40 60 80 100
3
2
1
0
1
Reacher
0 20 40 60 80 1000.2
0.0
0.2
0.4
0.6
0.8
1.0
HalfCheetah
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Hopper
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Walker
0 50 100 150 200 250 300
0.8
0.6
0.4
0.2
0.0
0.2
Ant
0 100 200 300 400
0.0
0.2
0.4
0.6
0.8Humanoid
Figure 4: Performance of the learned policy during training iterations using importance sampling to perform 5 TRPO gradientupdates per iteration, both for GMMIL (red) and GAIL (blue). Note that the x-axis scales are different from Fig. 3.
measures of the learned policy and the expert policy. Thisallows our approach to avoid the hard minimax optimiza-tion of GAN inherent to training in GAIL, which results inmore robust and sample-efficient imitation learning. As anend result, our approach becomes an imitation learning ver-sion of GMMNs (Li, Swersky, and Zemel 2015) and MMDnets (Dziugaite, Roy, and Ghahramani 2015).
Through an extensive set of experiments on high-dimensional robotic imitation tasks with up to 376 state vari-ables and 17 action variables (i.e. Humanoid), we showedthat GMMIL successfully imitates expert policies, evenwhen the expert trajectory was scarcely provided. The re-
turns obtained by the learned policy exhibited lower vari-ances, hinting that using MMD makes the overall optimiza-tion much more stable compared to the minimax optimiza-tion in GAIL. In addition, we showed that we can naturallyreuse the trajectories by importance sampling, allowing forfurther improving the sample efficiency.
As for the future work, we would like to address many as-pects in which our formulation could be improved. First ofall, it is well known that the test power of MMD degradeswith the dimensionality of the data (Ramdas et al. 2015).Although we did not suffer from this issue in our experi-ments, this could be true with visual domains. Second, even
The reproducing kernel parameterization leads to an easieroptimization problem. However, it might not be advanta-geous on more complicated problems.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 36/38
![Page 63: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/63.jpg)
Take Home Messages
![Page 64: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/64.jpg)
VALSE Webinar, Jun. 26 2019
Summary
Minimax optimization is very difficult in general;
Heuristics leverage specific structures in machine learningproblems;
Normalization techniques improve the optimization landscape,and stabilize the training of GAN;
The learning to optimize techniques have potentials tooutperform hand-designed algorithms;
The “large-batch” stochastic alternating gradient descentattains sublinear convergence to some stationary solution fornonconvex-concave stochastic minimax optimizationproblems;
Lots of new problems, and open to everyone!
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 37/38
![Page 65: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/65.jpg)
VALSE Webinar, Jun. 26 2019
References
[1] Jiang et al., “On Computation and Generalization of GenerativeAdversarial Networks under Spectrum Control”. International Conferenceon Learning Representations (ICLR), 2019
[2] Jiang et al., “Learning to Defense by Learning to Attack”. ICLRWorkshop on Deep Generative Models for Highly Structured Data, 2019
[3] Chen et al., “On Computation and Generalization of GenerativeAdversarial Imitation Learning”. Submitted.
[4] Chen et al., “On Landscape of Lagrangian Functions and StochasticSearch for Constrained Nonconvex Optimization”. InternationalConference on Artificial Intelligence and Statistics (AISTATS), 2019
[5] Liu et al., “Deep Hyperspherical Learning”. Annual Conference onNeural Information Processing Systems (NIPS), 2017
[6] Li et al. “Symmetry, Saddle Points and Global Optimization
Landscape of Nonconvex Matrix Factorization”, IEEE Transactions on
Information Theory (TIT), 2019.
Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 38/38
![Page 66: Towards Principled Methodologies and E cient Algorithms ...valser.org/webinar/slide/slides/20190626/presentation.pdf · 6/26/2019 · Towards Principled Methodologies and E cient](https://reader036.vdocuments.net/reader036/viewer/2022081611/5f038e407e708231d409a211/html5/thumbnails/66.jpg)
Questions?