newton methods for neural networks: introductioncjlin/courses/optdl2020/slides/newton... · hessian...

30
Newton Methods for Neural Networks: Introduction Chih-Jen Lin National Taiwan University Last updated: May 25, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 30

Upload: others

Post on 27-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton Methods for Neural Networks:Introduction

Chih-Jen LinNational Taiwan University

Last updated: May 25, 2020

Chih-Jen Lin (National Taiwan Univ.) 1 / 30

Page 2: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Outline

1 Introduction

2 Newton method

3 Hessian and Gaussian-Newton Matrices

Chih-Jen Lin (National Taiwan Univ.) 2 / 30

Page 3: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Introduction

Outline

1 Introduction

2 Newton method

3 Hessian and Gaussian-Newton Matrices

Chih-Jen Lin (National Taiwan Univ.) 3 / 30

Page 4: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Introduction

Optimization Methods Other thanStochastic Gradient

We have explained why stochastic gradient ispopular for deep learning

The same reasons may explain why other methodsare not suitable for deep learning

But we also notice that from the simplest SG towhat people are using many modifications weremade

Can we extend other optimization methods to besuitable for deep learning?

Chih-Jen Lin (National Taiwan Univ.) 4 / 30

Page 5: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Outline

1 Introduction

2 Newton method

3 Hessian and Gaussian-Newton Matrices

Chih-Jen Lin (National Taiwan Univ.) 5 / 30

Page 6: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton Method

Consider an optimization problem

minθ

f (θ)

Newton method solves the 2nd-order approximationto get a direction d

mind

∇f (θ)Td +1

2dT∇2f (θ)d (1)

If f (θ) isn’t strictly convex, (1) may not have aunique solution

Chih-Jen Lin (National Taiwan Univ.) 6 / 30

Page 7: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton Method (Cont’d)

We may use a positive-definite G to approximate∇2f (θ).

Then (1) can be solved by

Gd = −∇f (θ)

The resulting direction is a descent one

∇f (θ)Td = −∇f (θ)TG−1∇f (θ) < 0

Chih-Jen Lin (National Taiwan Univ.) 7 / 30

Page 8: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton Method (Cont’d)

The procedure:

while stopping condition not satisfied doLet G be ∇2f (θ) or its approximationExactly or approximately solve

Gd = −∇f (θ)

Find a suitable step size αUpdate

θ ← θ + αd .

end while

Chih-Jen Lin (National Taiwan Univ.) 8 / 30

Page 9: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Step Size I

Selection of the step size α: usually two types ofapproaches

Line searchTrust region (or its predecessor:Levenberg-Marquardt algorithm)

If using line search, details are similar to what wehad for gradient descent

We gradually reduce α such that

f (θ + αd) < f (θ) + ν∇f (θ)T (αd)

Chih-Jen Lin (National Taiwan Univ.) 9 / 30

Page 10: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton versus Gradient Descent I

We know they use second-order and first-orderinformation respectively

What are their special properties?

It is known that using higher order information leadsto faster final local convergence

Chih-Jen Lin (National Taiwan Univ.) 10 / 30

Page 11: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton versus Gradient Descent II

An illustration (modified from Tsai et al. (2014))presented earlier

time

dist

ance

toop

tim

um

time

dist

ance

toop

tim

umSlow final convergence Fast final convergence

Chih-Jen Lin (National Taiwan Univ.) 11 / 30

Page 12: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Newton versus Gradient Descent III

But the question is for machine learning why weneed fast local convergence?

The answer is no

However, higher-order methods tend to be morerobust

Their behavior may be more consistent across easyand difficult problems

It’s known that stochastic gradient is sometimessensitive to parameters

Thus what we hope to try here is if we can have amore robust optimization method

Chih-Jen Lin (National Taiwan Univ.) 12 / 30

Page 13: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Difficulties of Newton for NN I

The Newton linear system

Gd = −∇f (θ) (2)

can be large.G ∈ Rn×n,

where n is the total number of variables

Thus G is often too large to be stored

Chih-Jen Lin (National Taiwan Univ.) 13 / 30

Page 14: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Difficulties of Newton for NN II

Evan if we can store G , calculating

d = −G−1∇f (θ)

is usually very expensive

Thus a direct use of Newton for deep learning ishopeless

Chih-Jen Lin (National Taiwan Univ.) 14 / 30

Page 15: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Existing Works Trying to Make NewtonPractical I

Many works tried to address this issue

Their approaches significantly vary

I roughly categorize them to two groups

Hessian-free (Martens, 2010; Martens andSutskever, 2012; Wang et al., 2020; Henriqueset al., 2018)Hessian approximation (Martens and Grosse,2015; Botev et al., 2017; Zhang et al., 2017)In particular, diagonal approximation

Chih-Jen Lin (National Taiwan Univ.) 15 / 30

Page 16: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Existing Works Trying to Make NewtonPractical II

There are many others where I didn’t put into theabove two groups for various reasons (Osawa et al.,2019; Wang et al., 2018; Chen et al., 2019;Wilamowski et al., 2007)

There are also comparisons (Chen and Hsieh, 2018)

With the many possibilities it is difficult to reachconclusions

We decide to first check the robustness of standardNewton methods on small-scale data

Chih-Jen Lin (National Taiwan Univ.) 16 / 30

Page 17: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Newton method

Existing Works Trying to Make NewtonPractical III

Thus in our discussion we try not to doapproximations

Chih-Jen Lin (National Taiwan Univ.) 17 / 30

Page 18: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Outline

1 Introduction

2 Newton method

3 Hessian and Gaussian-Newton Matrices

Chih-Jen Lin (National Taiwan Univ.) 18 / 30

Page 19: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Introduction

We will check techniques to address the difficulty ofstoring or inverting the Hessian

But before that let’s derive the mathematical form

Chih-Jen Lin (National Taiwan Univ.) 19 / 30

Page 20: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Hessian Matrix I

For CNN, the gradient of f (θ) is

∇f (θ) =1

Cθ +

1

l

l∑i=1

(J i)T∇zL+1,iξ(zL+1,i ; y i ,Z 1,i),

(3)where

J i =

∂zL+1,i

1

∂θ1· · · ∂zL+1,i

1

∂θn......

...∂zL+1,i

nL+1

∂θ1· · ·

∂zL+1,inL+1

∂θn

nL+1×n

, i = 1, . . . , l , (4)

Chih-Jen Lin (National Taiwan Univ.) 20 / 30

Page 21: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Hessian Matrix II

is the Jacobian of zL+1,i(θ).

The Hessian matrix of f (θ) is

∇2f (θ) =1

CI +

1

l

l∑i=1

(J i)TB iJ i

+1

l

l∑i=1

nL∑j=1

∂ξ(zL+1,i ; y i ,Z 1,i)

∂zL+1,ij

∂2zL+1,i

j

∂θ1∂θ1· · · ∂2zL+1,i

j

∂θ1∂θn... . . . ...∂2zL+1,i

j

∂θn∂θ1· · · ∂2zL+1,i

j

∂θn∂θn

,Chih-Jen Lin (National Taiwan Univ.) 21 / 30

Page 22: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Hessian Matrix III

where I is the identity matrix and B i is the Hessianof ξ(·) with respect to zL+1,i :

B i = ∇2zL+1,i ,zL+1,iξ(zL+1,i ; y i ,Z 1,i)

More precisely,

B its =

∂2ξ(zL+1,i ; y i ,Z 1,i)

∂zL+1,it ∂zL+1,i

s

,∀t, s = 1, . . . , nL+1. (5)

Usually B i is very simple.

Chih-Jen Lin (National Taiwan Univ.) 22 / 30

Page 23: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Hessian Matrix IV

For example, if the squared loss is used,

ξ(zL+1,i ; y i) = ||zL+1,i − y i ||2.

then

B i =

2. . .

2

Usually we consider a convex loss function

ξ(zL+1,i ; y i)

with respect to zL+1,i

Chih-Jen Lin (National Taiwan Univ.) 23 / 30

Page 24: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Hessian Matrix V

Thus B i is positive semi-definite

The last term of ∇2f (θ) may not be positivesemi-definite

Note that for a twice differentiable function f (θ)

f (θ) is convex

if and only if

∇2f (θ) is positive semi-definite

Chih-Jen Lin (National Taiwan Univ.) 24 / 30

Page 25: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Jacobian Matrix

The Jacobian matrix of zL+1,i(θ) ∈ RnL+1 is

J i =

∂zL+1,i

1

∂θ1· · · ∂zL+1,i

1

∂θn......

...∂zL+1,i

nL

∂θ1· · · ∂zL+1,i

nL

∂θn

∈ RnL+1×n, i = 1, . . . l .

nL+1: # of neurons in the output layer

n: number of total variables

nL+1 × n can be large

Chih-Jen Lin (National Taiwan Univ.) 25 / 30

Page 26: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Gauss-Newton Matrix I

The Hessian matrix ∇2f (θ) is now not positivedefinite.

We may need a positive definite approximation

This is a deep research issue

Many existing Newton methods for NN hasconsidered the Gauss-Newton matrix (Schraudolph,2002)

G =1

CI +

1

l

l∑i=1

(J i)TB iJ i

by removing the last term in ∇2f (θ)

Chih-Jen Lin (National Taiwan Univ.) 26 / 30

Page 27: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

Gauss-Newton Matrix II

The Gauss-Newton matrix is positive definite if B i ispositive semi-definite

This can be achieved if we use a convex lossfunction in terms of zL+1,i(θ)

We then solve

Gd = −∇f (θ)

Chih-Jen Lin (National Taiwan Univ.) 27 / 30

Page 28: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

References I

A. Botev, H. Ritter, and D. Barber. Practical Gauss-Newton optimisation for deep learning. InProceedings of the 34th International Conference on Machine Learning, pages 557–565,2017.

P. H. Chen and C.-J. Hsieh. A comparison of second-order methods for deep convolutionalneural networks, 2018. URL https://openreview.net/forum?id=HJYoqzbC-.

S.-W. Chen, C.-N. Chou, and E. Y. Chang. An approximate second-order method for trainingfully-connected neural networks. In Proceedings of the Thirty-Third AAAI Conference onArtificial Intelligence, 2019.

J. F. Henriques, S. Ehrhardt, S. Albanie, and A. Vedaldi. Small steps and giant leaps: MinimalNewton solvers for deep learning, 2018. arXiv preprint 1805.08095.

J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27thInternational Conference on Machine Learning (ICML), 2010.

J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximatecurvature. In Proceedings of the 32nd International Conference on Machine Learning,pages 2408–2417, 2015.

J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-freeoptimization. In Neural Networks: Tricks of the Trade, pages 479–535. Springer, 2012.

Chih-Jen Lin (National Taiwan Univ.) 28 / 30

Page 29: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

References II

K. Osawa, Y. Tsuji, Y. Ueno, A. Naruse, R. Yokota, and S. Matsuoka. Large-scale distributedsecond-order optimization using kronecker-factored approximate curvature for deepconvolutional neural networks. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2019.

N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.Neural Computation, 14(7):1723–1738, 2002.

C.-H. Tsai, C.-Y. Lin, and C.-J. Lin. Incremental and decremental training for linearclassification. In Proceedings of the 20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2014. URLhttp://www.csie.ntu.edu.tw/~cjlin/papers/ws/inc-dec.pdf.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sundararajan,and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30(6):1673–1724, 2018. URL http://www.csie.ntu.edu.tw/~cjlin/papers/dnn/dsh.pdf.

C.-C. Wang, K. L. Tan, and C.-J. Lin. Newton methods for convolutional neural networks.ACM Transactions on Intelligent Systems and Technology, 2020. URLhttps://www.csie.ntu.edu.tw/~cjlin/papers/cnn/newton-CNN.pdf. To appear.

B. M. Wilamowski, N. Cotton, J. Hewlett, and O. Kaynak. Neural network trainer with secondorder learning algorithms. In In Proceedings of the 11th International Conference onIntelligent Engineering Systems, 2007.

Chih-Jen Lin (National Taiwan Univ.) 29 / 30

Page 30: Newton Methods for Neural Networks: Introductioncjlin/courses/optdl2020/slides/newton... · Hessian and Gaussian-Newton Matrices Gauss-Newton MatrixI The Hessian matrix r2f( ) is

Hessian and Gaussian-Newton Matrices

References III

H. Zhang, C. Xiong, J. Bradbury, and R. Socher. Block-diagonal Hessian-free optimization fortraining neural networks, 2017. arXiv preprint arXiv:1712.07296.

Chih-Jen Lin (National Taiwan Univ.) 30 / 30