the marginal value of adaptive gradient methods in machine learning

33
IDS Lab The Marginal Value of Adaptive Gradient Methods in Machine Learning Does deep learning really doing some generalization? 2 presented by Jamie Seol

Upload: jamie-seol

Post on 28-Jan-2018

389 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: The marginal value of adaptive gradient methods in machine learning

IDS Lab

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Does deep learning really doing some generalization? 2presented by Jamie Seol

Page 2: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface• Toy problem: smooth quadratic strong convex optimization• Let object f be as following, and WLOG suppose A to be a

symmetric and nonsingular

• why WLOG? symmetric because it’s a quadratic form, and singular curvature (curvature of quadratic function is A) is reducible in quadratic function

• moreover, strong convex = positive definite curvature• meaning that all eigenvalues are positive

Page 3: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface• Note that A was a real symmetric matrix, so by the spectral

theorem, A has eigendecomposition with unitary basis

• In this simple objective function, we can explicitly compute the optima

Page 4: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface• We’ll apply a gradient descent! let superscript be an iteration:

• Will it converge to the optima? let’s check it out!

• We use some tricky trick using change of basis

• This new sequence x(k) should converge to 0• But when?

Page 5: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface

• This holds• [homework: prove it]

• With rewriting by element-wise notation:

Page 6: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface

• So, the gradient descent converges only if

• for all i• In summary, it converges when

• And the optimal is

• where 𝜎(A) denote a spectral radius of A, meaning the maximal absolute value among eigenvalues [homework: n = 1 case]

Page 7: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface (appendix)

• Actually, this result is rather obvious• Note that A was a curvature of the objective, and the spectral

radius or the largest eigenvalue means "stretching" above A’s principal axis• curvature ← see differential geometry• principal axis ← see linear algebra

• So, it is vacuous that the learning rate should be in a safe area regarding the "stretching", which can be done with simple normalization

Page 8: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface• Similarly, the optimal momentum decay can also be induced,

using condition number 𝜅• condition number of a matrix is ratio between maximal and

minimal (absolute) eigenvalues

• Therefore, if we can control the boundary of the spectral radius of the objective, then we can approximate the optimal parameters for gradient descent• this is the main idea of the YellowFin optimizer

Page 9: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface• So what?

• We pretty much do know well about behaviors of gradient descent• if the objective is smooth quadratic strong convex..• but the objectives of deep learning is not nice enough!

• We just don’t really know about characteristics of deep learning objective functions yet• requires more research

Page 10: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface 2• Here’s a typical linear regression problem

• If the number of features d is bigger than the number of samples m, than it is underdetermined system

• So it has (possibly infinitely) many solutions

• Let’s use stochastic gradient descent (SGD)• which solution will SGD find?

Page 11: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface 2• Actually, we’ve already discussed about this in the previous

seminar

• Anyway, even if the system is underdetermined, SGD always converges to some unique solution which belongs to span of X

Page 12: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preface 2• Moreover, experiments show that SGD’s solution has small norm

• We know that the l2-regularization helps generalization• l2-regularization: keeping parameter’s norm small

• So, we can say that the SGD has implicit regularization• but there’s evidence that l2-regularization does not help at all…

• see previous seminar presented by me

• 잘 되지만 사실 잘 안되고, 그래도 좋은 편이지만 그닥 좋지만은 않다…

Page 13: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Introduction• In summary,

• adaptive gradient descent methods• might be poor

• at generalization

Page 14: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preliminaries• Famous non-adaptive gradient descent methods:

• Stochastic Gradient Descent [SGD]

• Heavy-Ball [HB] (Polyak, 1964)

• Nesterov’s Accelerated Gradient [NAG] (Nesterov, 1983)

Page 15: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preliminaries• Adaptive methods can be summarized as:

• AdaGrad (Duchi, 2011)• RMSProp (Tieleman and Hinton, 2012, in coursera!)• Adam (Kingma and Ba, 2015)

• In short, these methods adaptively changes learning rate and momentum decay

Page 16: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Preliminaries• All together

Page 17: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Synopsis• For a system with multiple solution, what solution does an

algorithm find and how well does it generalize to unseen data?

• Claim: there exists a constructive problem(dataset) in which• non-adaptive methods work well and

• finds a solution with good generalization power• adaptive methods work poor

• finds a solution with poor generalization power• we even can make this arbitrarily poor, while the non-

adaptive solution still working

Page 18: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Problem settings• Think of a simple binary least-squares classification problem

• When d > n, if there is a optima with loss 0 then there are infinite number of optima

• But as shown in preface 2, SGD converges to the unique solution• with known to be the minimum norm solution

• which generalizes well• why? becase in here, it’s also the largest margin solution

• All other non-adaptive methods also converges to the same

Page 19: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Lemma• Let sign(x) denote a function that maps each component of x to its

sign• ex) sign([2, -3]) = [1, -1]

• If there exists a solution proportional to sign(XTy), this is precisely the unique solution where all adaptive methods converge• quite interesting lemma!• pf) use induction

• Note that this solution is just:• mean of positive labeled vectors - mean of negative labeled

vectors

Page 20: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Page 21: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• Let’s fool adaptive methods

• first, assign yi to 1 with probability p > 1/2

• when y = [-1, -1, -1, -1]

• when y = [1, 1, 1, 1]

Page 22: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• Note that for such a dataset, the only discriminative feature is the

first one!• if y = [1, -1, -1, 1, -1] then X becomes:

Page 23: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• Let and assume b > 0 (p > 1/2)• Suppose , then

Page 24: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• So, holds!

• Take a closer look• all first three are 1, and rest is 0 for new data

• this solution is bad!• it will classify every new data to positive class!!!

• what a horrible generalization!

Page 25: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• How about non-adaptive method?

• So, when , the solution makes no errors• wow

Page 26: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Funny dataset• Think this is too extreme?• Well, even in the real dataset, the following are rather common:

• a few frequent feature (j = 2, 3)• some are good indicators, but hard to identify (j = 1)

• many other sparse feature (other)

Page 27: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Experiments

• (authors said that they downloaded models from internet…)• Results in summary:

• adaptive makes poor generalization• even if it had lower loss than the non-adaptive ones!!!

• adaptive looks fast, but that’s it• adaptive says "no more tuning" but tuning initial values were

still significant• and it requires as much time as non-adaptive tuning…

Page 28: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Experiments• CIFAR-10

• use non-adaptive

Page 29: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Experiments• low training loss, more test error (Adam vs HB)

Page 30: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Experiments• Character-level language model

• AdaGrad looks very fast, but indeed, not good• surprisingly, RMSProp closely trails SGD on test

Page 31: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Experiments• Parsing

• well, it is true that non-adaptive methods are slow

Page 32: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

Conclusion• Adaptive methods are not advantageous for optimization• It might be fast, but poor generalization

• then why is Adam so popular?• because it’s popular…?• specially, known to be popular in GAN and Q-learning

• these are not exactly optimization problems• we don’t know any nature of objectives in those two yet

Page 33: The marginal value of adaptive gradient methods in machine learning

IDS Lab

Jamie Seol

References• Wilson, Ashia C., et al. "The Marginal Value of Adaptive Gradient

Methods in Machine Learning." arXiv preprint arXiv:1705.08292 (2017). • Zhang, Jian, Ioannis Mitliagkas, and Christopher Ré. "YellowFin and the

Art of Momentum Tuning." arXiv preprint arXiv:1706.03471 (2017). • Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking

generalization." arXiv preprint arXiv:1611.03530 (2016). • Polyak, Boris T. "Some methods of speeding up the convergence of

iteration methods." USSR Computational Mathematics and Mathematical Physics 4.5 (1964): 1-17.

• Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006