atutorial onzero-order optimization
TRANSCRIPT
![Page 1: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/1.jpg)
A Tutorial on Zero-Order Optimization
Yujie Tang
![Page 2: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/2.jpg)
Gradient Descent
• If f is L-smooth, , then
• If f is L-smooth and convex, , then
• If f is L-smooth and µ-strongly convex,
, then
A Review of Gradient Descent
![Page 3: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/3.jpg)
Optimization without First-Order Information
• Case 1: can be evaluated accuratelyfor K = 2
• Case 2: are independent for any K
zero-order oracle
One approach:
• construct gradient estimators based on 0-order information
• replace the true gradients in first-order methods by their 0-order estimators
![Page 4: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/4.jpg)
Case 1: 2-Point & Multi-Point Estimators
• A naïve approach:
• When f is L-smooth, we have
• Works well for low-dimensional problems
• Not favorable for high-dimensional problems
![Page 5: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/5.jpg)
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• u: smoothing radius
• Under quite general conditions, we have
where is a smooth version of f
If ¸ has a density given by
then
If , then
![Page 6: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/6.jpg)
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• Some facts for L-smooth / convex / µ-strongly convex function f:
• is L-smooth / convex / µ-strongly convex
•
![Page 7: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/7.jpg)
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• Some facts for L-smooth / convex / µ-strongly convex function f:
• is L-smooth / convex / µ-strongly convex
•
![Page 8: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/8.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:• Roughly follows the trajectory of true gradient descent, with fluctuations
• Somewhat like a stochastic gradient descent
![Page 9: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/9.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
• If f is L-smooth, , then
• If f is L-smooth and convex, , then
![Page 10: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/10.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
• If f is L-smooth and µ-strongly convex, , then
where
![Page 11: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/11.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
![Page 12: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/12.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
N: number of function evaluations
![Page 13: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/13.jpg)
Case 1: 2-Point & Multi-Point Estimators
• Add more function evaluations:
where ¸ is spherically symmetric with
• bias: the same, variance: reduced by 1/m
• # of function evaluations: multiplied by m
• More evaluations does not necessarily improve convergence (in terms of N).
• should be avoided.
![Page 14: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/14.jpg)
Case 2: Single-Point Estimator
• Single-point estimator:
where ¸ is spherically symmetric with
• We still have
and
• Variance is much worse:
noise
![Page 15: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/15.jpg)
Case 2: Single-Point Estimator
• Single-point estimator:
where ¸ is spherically symmetric with
• Does averaging multiple single-point estimators work?
• The last two terms coincide with 2-point estimator in Case 1
• Still one remaining term
noise
![Page 16: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/16.jpg)
Case 2: Single-Point Estimator
• GD with single-point estimator:
![Page 17: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/17.jpg)
Case 2: Single-Point Estimator
• Best known lower bound for smooth and strongly convex functions:
• No lower bounds are know for other classes of functions.
• In fact, convergence can be achieved for convex functions
by other types of zero-order methods that do not use gradient estimators.
![Page 18: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/18.jpg)
References
2-point and multi-point estimators• [Nesterov2017] Y. Nesterov and V. Spokoiny. “Random gradient-free minimization of convex functions,”
2017.
deterministic, 2-point
• [Duchi2015] J. C. Duchi et al. “Optimal rates for zero-order convex optimization: The power of two function evaluations,” 2015.
stochastic, 2-point and multi-point, minimax lower bound
• [Shamir2017] O. Shamir. “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” 2017.
online bandit, 2-point, optimal regret for nonsmooth cases
![Page 19: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/19.jpg)
References
Single-point estimators • [Flaxman2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. “Online convex optimization in the
bandit setting: gradient descent without a gradient,” 2005.[Bach2016] F. Bach and V. Perchet. “Highly-smooth zero-th order online optimization,” 2016. single-point gradient estimator
• [Agarwal2013] A. Agarwal et al. “Stochastic convex optimization with bandit feedback,” 2013.[Belloni2015] A. Belloni et al. “Escaping the local minima via simulated annealing: Optimization of approximately convex functions,” 2015.[Bubeck2017] S. Bubeck, Y. T. Lee, and R. Eldan. “Kernel-based methods for bandit convex optimization,” 2017. single-point evaluation, no gradient estimation, convergence
![Page 20: ATutorial onZero-Order Optimization](https://reader031.vdocuments.net/reader031/viewer/2022012419/6174020e1914a733a26e3aa9/html5/thumbnails/20.jpg)
References
Single-point estimators • [Dani2008] V. Dani, S. M. Kakade, and T. P. Hayes. “The price of bandit information for online
optimization,” 2008.[Jamieson2012] K. G. Jamieson, R. Nowak, and B. Recht. “Query complexity of derivative-free optimization,” 2012. [Shamir2013] O. Shamir. “On the complexity of bandit and derivative-free stochastic convex optimization,” 2013.Minimax lower bound