atutorial onzero-order optimization

20
A Tutorial on Zero-Order Optimization Yujie Tang

Upload: others

Post on 23-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ATutorial onZero-Order Optimization

A Tutorial on Zero-Order Optimization

Yujie Tang

Page 2: ATutorial onZero-Order Optimization

Gradient Descent

• If f is L-smooth, , then

• If f is L-smooth and convex, , then

• If f is L-smooth and µ-strongly convex,

, then

A Review of Gradient Descent

Page 3: ATutorial onZero-Order Optimization

Optimization without First-Order Information

• Case 1: can be evaluated accuratelyfor K = 2

• Case 2: are independent for any K

zero-order oracle

One approach:

• construct gradient estimators based on 0-order information

• replace the true gradients in first-order methods by their 0-order estimators

Page 4: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• A naïve approach:

• When f is L-smooth, we have

• Works well for low-dimensional problems

• Not favorable for high-dimensional problems

Page 5: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• 2-point gradient estimator:

where ¸ is spherically symmetric with

• u: smoothing radius

• Under quite general conditions, we have

where is a smooth version of f

If ¸ has a density given by

then

If , then

Page 6: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• 2-point gradient estimator:

where ¸ is spherically symmetric with

• Some facts for L-smooth / convex / µ-strongly convex function f:

• is L-smooth / convex / µ-strongly convex

Page 7: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• 2-point gradient estimator:

where ¸ is spherically symmetric with

• Some facts for L-smooth / convex / µ-strongly convex function f:

• is L-smooth / convex / µ-strongly convex

Page 8: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Gradient descent with 2-point estimator:• Roughly follows the trajectory of true gradient descent, with fluctuations

• Somewhat like a stochastic gradient descent

Page 9: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Gradient descent with 2-point estimator:

• If f is L-smooth, , then

• If f is L-smooth and convex, , then

Page 10: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Gradient descent with 2-point estimator:

• If f is L-smooth and µ-strongly convex, , then

where

Page 11: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Gradient descent with 2-point estimator:

Page 12: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Gradient descent with 2-point estimator:

N: number of function evaluations

Page 13: ATutorial onZero-Order Optimization

Case 1: 2-Point & Multi-Point Estimators

• Add more function evaluations:

where ¸ is spherically symmetric with

• bias: the same, variance: reduced by 1/m

• # of function evaluations: multiplied by m

• More evaluations does not necessarily improve convergence (in terms of N).

• should be avoided.

Page 14: ATutorial onZero-Order Optimization

Case 2: Single-Point Estimator

• Single-point estimator:

where ¸ is spherically symmetric with

• We still have

and

• Variance is much worse:

noise

Page 15: ATutorial onZero-Order Optimization

Case 2: Single-Point Estimator

• Single-point estimator:

where ¸ is spherically symmetric with

• Does averaging multiple single-point estimators work?

• The last two terms coincide with 2-point estimator in Case 1

• Still one remaining term

noise

Page 16: ATutorial onZero-Order Optimization

Case 2: Single-Point Estimator

• GD with single-point estimator:

Page 17: ATutorial onZero-Order Optimization

Case 2: Single-Point Estimator

• Best known lower bound for smooth and strongly convex functions:

• No lower bounds are know for other classes of functions.

• In fact, convergence can be achieved for convex functions

by other types of zero-order methods that do not use gradient estimators.

Page 18: ATutorial onZero-Order Optimization

References

2-point and multi-point estimators• [Nesterov2017] Y. Nesterov and V. Spokoiny. “Random gradient-free minimization of convex functions,”

2017.

deterministic, 2-point

• [Duchi2015] J. C. Duchi et al. “Optimal rates for zero-order convex optimization: The power of two function evaluations,” 2015.

stochastic, 2-point and multi-point, minimax lower bound

• [Shamir2017] O. Shamir. “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” 2017.

online bandit, 2-point, optimal regret for nonsmooth cases

Page 19: ATutorial onZero-Order Optimization

References

Single-point estimators • [Flaxman2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. “Online convex optimization in the

bandit setting: gradient descent without a gradient,” 2005.[Bach2016] F. Bach and V. Perchet. “Highly-smooth zero-th order online optimization,” 2016. single-point gradient estimator

• [Agarwal2013] A. Agarwal et al. “Stochastic convex optimization with bandit feedback,” 2013.[Belloni2015] A. Belloni et al. “Escaping the local minima via simulated annealing: Optimization of approximately convex functions,” 2015.[Bubeck2017] S. Bubeck, Y. T. Lee, and R. Eldan. “Kernel-based methods for bandit convex optimization,” 2017. single-point evaluation, no gradient estimation, convergence

Page 20: ATutorial onZero-Order Optimization

References

Single-point estimators • [Dani2008] V. Dani, S. M. Kakade, and T. P. Hayes. “The price of bandit information for online

optimization,” 2008.[Jamieson2012] K. G. Jamieson, R. Nowak, and B. Recht. “Query complexity of derivative-free optimization,” 2012. [Shamir2013] O. Shamir. “On the complexity of bandit and derivative-free stochastic convex optimization,” 2013.Minimax lower bound