limited-memory quasi-newton and hessian-free newton...

202
Limited-Memory Quasi-Newton and Hessian-Free Newton Methods for Non-Smooth Optimization Mark Schmidt Department of Computer Science University of British Columbia December 10, 2010

Upload: others

Post on 31-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Limited-Memory Quasi-Newton and Hessian-Free NewtonMethods for Non-Smooth Optimization

Mark Schmidt

Department of Computer ScienceUniversity of British Columbia

December 10, 2010

Page 2: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Outline

1 Motivation and OverviewStructure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

2 L-BFGS and Hessian-Free Newton

3 Two-Metric (Sub-)Gradient Projection

4 Inexact Projected/Proximal Newton

5 Discussion

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 3: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Motivating Problem: Structure Learning in Discrete MRFs

X1

X3X8

X7X4

X5

X6

X2

X9

We want to fit a Markov random field to discrete data, butdon’t know the graph structure.

We can learn a sparse structure by using `1-regularization ofthe edge parameters [Lee et al. 2006, Wainwright et al. 2006].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 4: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Motivating Problem: Structure Learning in Discrete MRFs

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.8

2.3

0.8

0.3

1.3

-0.7

-0.7

0.9

1.1

We want to fit a Markov random field to discrete data, butdon’t know the graph structure.

We can learn a sparse structure by using `1-regularization ofthe edge parameters [Lee et al. 2006, Wainwright et al. 2006].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 5: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Motivating Problem: Structure Learning in Discrete MRFs

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.8

2.3

0.8

0.3

1.3

-0.7

-0.7

0.9

1.1

0

0

0

0

0

0

We want to fit a Markov random field to discrete data, butdon’t know the graph structure.

We can learn a sparse structure by using `1-regularization ofthe edge parameters [Lee et al. 2006, Wainwright et al. 2006].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 6: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Motivating Problem: Structure Learning in Discrete MRFs

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.8

2.3

0.8

0.3

1.3

-0.7

-0.7

0.9

1.1

0

0

0

0

0

0

We want to fit a Markov random field to discrete data, butdon’t know the graph structure.

We can learn a sparse structure by using `1-regularization ofthe edge parameters [Lee et al. 2006, Wainwright et al. 2006].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 7: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Optimization with `1-Regularization

This requires solving an optimization problem of the form

minx

f (x) +n∑

i=1

λi |xi |1

Solving this optimization has 3 complicating factors:1 the number of parameters is large2 evaluating the objective is expensive3 the objective is non-smooth

If the objective was smooth, we might consider Hessian-freeNewton methods or limited-memory quasi-Newton methods.

The non-smooth term is separable.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 8: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Optimization with `1-Regularization

This requires solving an optimization problem of the form

minx

f (x) +n∑

i=1

λi |xi |1

Solving this optimization has 3 complicating factors:1 the number of parameters is large2 evaluating the objective is expensive3 the objective is non-smooth

If the objective was smooth, we might consider Hessian-freeNewton methods or limited-memory quasi-Newton methods.

The non-smooth term is separable.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 9: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Optimization with `1-Regularization

This requires solving an optimization problem of the form

minx

f (x) +n∑

i=1

λi |xi |1

Solving this optimization has 3 complicating factors:1 the number of parameters is large2 evaluating the objective is expensive3 the objective is non-smooth

If the objective was smooth, we might consider Hessian-freeNewton methods or limited-memory quasi-Newton methods.

The non-smooth term is separable.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 10: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Optimization with `1-Regularization

This requires solving an optimization problem of the form

minx

f (x) +n∑

i=1

λi |xi |1

Solving this optimization has 3 complicating factors:1 the number of parameters is large2 evaluating the objective is expensive3 the objective is non-smooth

If the objective was smooth, we might consider Hessian-freeNewton methods or limited-memory quasi-Newton methods.

The non-smooth term is separable.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 11: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.4 -0.2

0.9 -1.4 -0.2

-0.8 0.5 1.4

1.1 -1.5 2.4

1.5 -0.7 -0.6

0.1 -1.1 0.7

0.0 0.4 -1.1

1.5 -0.2 0.0

-0.8 1.1 0.6

-0.3 0.9 -0.8

0.3 -1.1 -2.9

-0.8 -1.1 1.4

2.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

0.3 -1.7 0.4

-0.8 -0.1 0.3

1.4 -0.2 -0.80.0 1.1 0.1

-0.2 1.1 -1.2

0.6 -0.9 -1.10.5 0.9 -0.4

1.8 0.3 0.3

-2.3 -1.3 3.61.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 0.7

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 12: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.4 -0.2

0.9 -1.4 -0.2

-0.8 0.5 1.4

1.1 -1.5 2.4

1.5 -0.7 -0.6

0.1 -1.1 0.7

0.0 0.4 -1.1

1.5 -0.2 0.0

-0.8 1.1 0.6

-0.3 0.9 -0.8

0.3 -1.1 -2.9

-0.8 -1.1 1.4

2.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

0.3 -1.7 0.4

-0.8 -0.1 0.3

1.4 -0.2 -0.80.0 1.1 0.1

-0.2 1.1 -1.2

0.6 -0.9 -1.10.5 0.9 -0.4

1.8 0.3 0.3

-2.3 -1.3 3.61.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 0.7

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 13: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X

Y

Y

X

Z

Z

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 14: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X

Y

Y

X

Z

Z

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 15: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X

Y

Y

X

Z

Z

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 16: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X

Y

Y

X

Z

Z

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 17: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.4 -0.2

0.9 -1.4 -0.2

-0.8 0.5 1.4

1.1 -1.5 2.4

1.5 -0.7 -0.6

0.1 -1.1 0.7

0.0 0.4 -1.1

1.5 -0.2 0.0

-0.8 1.1 0.6

-0.3 0.9 -0.8

0.3 -1.1 -2.9

-0.8 -1.1 1.4

2.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

0.3 -1.7 0.4

-0.8 -0.1 0.3

1.4 -0.2 -0.80.0 1.1 0.1

-0.2 1.1 -1.2

0.6 -0.9 -1.10.5 0.9 -0.4

1.8 0.3 0.3

-2.3 -1.3 3.61.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 0.7

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 02.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

1.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 0.7

-0.3 0.9 -0.8

0.3 -1.1 -2.9

-0.8 -1.1 1.4

0 0 0

0 0 0

0 0 0

0.3 -1.7 0.4

-0.8 -0.1 0.3

1.4 -0.2 -0.8

0.0 1.1 0.1

-0.2 1.1 -1.2

0.6 -0.9 -1.10.5 0.9 -0.4

1.8 0.3 0.3

-2.3 -1.3 3.6

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

-0.2 -1.4 -0.2

0.9 -1.4 -0.2

-0.8 0.5 1.4

1.1 -1.5 2.4

1.5 -0.7 -0.6

0.1 -1.1 0.7

0.0 0.4 -1.1

1.5 -0.2 0.0

-0.8 1.1 0.6

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

2.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

In some cases, we want sparsity in groups of parameters:1 Multi-parameter edges [Lee et al., 2006].2 Blockwise-sparsity [Duchi et al., 2008].3 Conditional random fields [Schmidt et al., 2008]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 18: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

In these cases we might consider group `1-regularization:

minx

f (x) +∑g

λg ||xg ||p

Typically, we use the `2-norm, `∞-norm, or nuclear norm.

Now, the non-smooth is term not even separable.

However, the non-smooth term is still simple.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 19: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

In these cases we might consider group `1-regularization:

minx

f (x) +∑g

λg ||xg ||p

Typically, we use the `2-norm, `∞-norm, or nuclear norm.

Now, the non-smooth is term not even separable.

However, the non-smooth term is still simple.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 20: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.4 -0.2

0.9 -1.4 -0.2

-0.8 0.5 1.4

1.1 -1.5 2.4

1.5 -0.7 -0.6

0.1 -1.1 0.7

0.0 0.4 -1.1

1.5 -0.2 0.0

-0.8 1.1 0.6

-0.3 0.9 -0.8

0.3 -1.1 -2.9

-0.8 -1.1 1.4

2.8 0.7 -0.2

-1.4 -0.1 -0.1

3.0 0.7 1.5

0.3 -1.7 0.4

-0.8 -0.1 0.3

1.4 -0.2 -0.80.0 1.1 0.1

-0.2 1.1 -1.2

0.6 -0.9 -1.10.5 0.9 -0.4

1.8 0.3 0.3

-2.3 -1.3 3.61.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 0.7

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

Group `1-Regularization with the `2 group norm.

Encourage group sparsity.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 21: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X9

-0.2 -1.2 -0.2

0.9 -1.2 -0.2

-0.8 0.5 1.2

0.2 -0.2 0.2

0.2 -0.2 -0.2

0.1 0.2 0.2

0.0 0.2 -0.2

0.2 -0.2 0.0

-0.2 0.2 0.2

-0.3 0.8 -0.8

0.3 -0.8 -0.8

-0.8 -0.8 0.8

0.3 0.3 -0.2

-0.3 -0.1 -0.1

0.3 0.3 0.3

0.2 -0.2 0.2

-0.2 -0.1 0.2

0.2 -0.2 -0.20.0 0.8 0.1

-0.2 0.8 -1.2

0.6 -0.8 -0.80.5 0.7 -0.4

0.7 0.3 0.3

-0.7 -0.7 0.71.4 -1.2 0.5

1.4 0.7 1.0

0.7 1.6 1.6

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

Group `1-Regularization with the `∞ group norm.

Encourage group sparsity and parameter tieing.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 22: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

X1

X3X8

X7X4

X5

X6

X2

X90.7

-0.1

1.5

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

2.8 0.7 -0.2

-0.7

1.0

0.7

1.4 -1.2 0.5 0 0 0

0 0 0

0 0 0

-0.1

0.3

-0.8

0.3 -1.7 0.4

0 0 0

0 0 0

0 0 00.3

0.3

3.6

0.5 0.9 -0.4

0 0 0

0 0 0

0 0 0

0.3 0.4

0.3 0.1

3.6 -1.1

0.5 0.9 -0.4

0.1 1.3 0.3

Group `1-Regularization with the nuclear group norm.

Encourage group sparsity and low-rank.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 23: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

In these cases we might consider group `1-regularization:

minx

f (x) +∑g

λg ||xg ||p

Typically, we use the `2-norm, `∞-norm, or nuclear norm.

Now, the non-smooth is term not even separable.

However, the non-smooth term is still simple.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 24: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Group `1-Regularization

In these cases we might consider group `1-regularization:

minx

f (x) +∑g

λg ||xg ||p

Typically, we use the `2-norm, `∞-norm, or nuclear norm.

Now, the non-smooth is term not even separable.

However, the non-smooth term is still simple.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 25: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Structured Sparsity

Do we have to use pairwise models?

We can use structured sparsity [Bach, 2008, Zhao et al.,2009] to learn sparse hierarchical models.

In this case we use overlapping group `1-regularization:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

Now, the non-smooth term is not even simple.

However, it is the sum of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 26: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Structured Sparsity

Do we have to use pairwise models?

We can use structured sparsity [Bach, 2008, Zhao et al.,2009] to learn sparse hierarchical models.

In this case we use overlapping group `1-regularization:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

Now, the non-smooth term is not even simple.

However, it is the sum of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 27: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Structured Sparsity

Do we have to use pairwise models?

We can use structured sparsity [Bach, 2008, Zhao et al.,2009] to learn sparse hierarchical models.

In this case we use overlapping group `1-regularization:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

Now, the non-smooth term is not even simple.

However, it is the sum of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 28: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Structured Sparsity

Do we have to use pairwise models?

We can use structured sparsity [Bach, 2008, Zhao et al.,2009] to learn sparse hierarchical models.

In this case we use overlapping group `1-regularization:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

Now, the non-smooth term is not even simple.

However, it is the sum of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 29: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Structure Learning with `1-RegularizationStructure Learning with Group `1-RegularizationStructure Learning with Structured Sparsity

Structure Learning with Structured Sparsity

Do we have to use pairwise models?

We can use structured sparsity [Bach, 2008, Zhao et al.,2009] to learn sparse hierarchical models.

In this case we use overlapping group `1-regularization:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

Now, the non-smooth term is not even simple.

However, it is the sum of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 30: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Outline

1 Motivation and Overview

2 L-BFGS and Hessian-Free NewtonHessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

3 Two-Metric (Sub-)Gradient Projection

4 Inexact Projected/Proximal Newton

5 Discussion

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 31: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

A Basic Newton-like Method

We want first consider minimizing a twice-differentiable f (x).

Newton-like methods use a quadratic approximation of f (x):

Qk(x, α) , f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

Hk is a positive-definite approximation of the Hessian.

The new iterate is set to the minimizer of the approximation

xk+1 ← xk − αdk ,

where dk is the solution to

Hkdk = ∇f (xk)

Guarantees descent for small enough α.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 32: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

A Basic Newton-like Method

We want first consider minimizing a twice-differentiable f (x).

Newton-like methods use a quadratic approximation of f (x):

Qk(x, α) , f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

Hk is a positive-definite approximation of the Hessian.

The new iterate is set to the minimizer of the approximation

xk+1 ← xk − αdk ,

where dk is the solution to

Hkdk = ∇f (xk)

Guarantees descent for small enough α.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 33: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

A Basic Newton-like Method

We want first consider minimizing a twice-differentiable f (x).

Newton-like methods use a quadratic approximation of f (x):

Qk(x, α) , f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

Hk is a positive-definite approximation of the Hessian.

The new iterate is set to the minimizer of the approximation

xk+1 ← xk − αdk ,

where dk is the solution to

Hkdk = ∇f (xk)

Guarantees descent for small enough α.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 34: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

A Basic Newton-like Method

We want first consider minimizing a twice-differentiable f (x).

Newton-like methods use a quadratic approximation of f (x):

Qk(x, α) , f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

Hk is a positive-definite approximation of the Hessian.

The new iterate is set to the minimizer of the approximation

xk+1 ← xk − αdk ,

where dk is the solution to

Hkdk = ∇f (xk)

Guarantees descent for small enough α.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 35: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Gradient Method and Newton’s Method

f(x)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 36: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Gradient Method and Newton’s Method

f(x)

xk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 37: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Gradient Method and Newton’s Method

f(x)

xk

xk - !!f(xk)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 38: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Gradient Method and Newton’s Method

f(x)

xk

xk - !!f(xk)

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 39: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Gradient Method and Newton’s Method

f(x)

xk

xk - !!f(xk)

Q(x,!)

xk - !dk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 40: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Armijo Inexact Line-Search

We first try an initial step size α and then decrease it until wesatisfy the Armijo sufficient decrease condition:

f (xk+1) ≤ f (xk) + η∇f (xk)T (xk+1 − x).

With Hermite interpolation and control of the spectrum ofHk , we typically accept the initial α or backtrack only once.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 41: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Armijo Inexact Line-Search

We first try an initial step size α and then decrease it until wesatisfy the Armijo sufficient decrease condition:

f (xk+1) ≤ f (xk) + η∇f (xk)T (xk+1 − x).

With Hermite interpolation and control of the spectrum ofHk , we typically accept the initial α or backtrack only once.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 42: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence

Under suitable smoothness and convexity assumptions, themethod achieves a quadratic convergence rate:

It requires O(log log 1/ε) iterations to for ε-accuracy.

Any algorithm of the form

xk+1 ← xk − Bk∇f (xk)

has a superlinear local convergence rate around a strictminimizer if and only if Bk eventually behaves like the inverseHessian [Dennis & More, 1974].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 43: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence

Under suitable smoothness and convexity assumptions, themethod achieves a quadratic convergence rate:

It requires O(log log 1/ε) iterations to for ε-accuracy.

Any algorithm of the form

xk+1 ← xk − Bk∇f (xk)

has a superlinear local convergence rate around a strictminimizer if and only if Bk eventually behaves like the inverseHessian [Dennis & More, 1974].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 44: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence

Under suitable smoothness and convexity assumptions, themethod achieves a quadratic convergence rate:

It requires O(log log 1/ε) iterations to for ε-accuracy.

Any algorithm of the form

xk+1 ← xk − Bk∇f (xk)

has a superlinear local convergence rate around a strictminimizer if and only if Bk eventually behaves like the inverseHessian [Dennis & More, 1974].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 45: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Disadvantages of Pure Newton Method

For many problems, we can’t afford to compute the Hessian,or even store an n by n matrix.

There are several limited-memory alternatives available:1 Non-linear conjugate gradient.2 Nesterov’s optimal gradient method.3 Diagonally-scaled steepest descent.4 Non-monotonic Barzilai-Borwein method.5 Hessian-free Newton methods.6 Limited-memory quasi-Newton methods.

This talk mainly focuses on the last two.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 46: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Disadvantages of Pure Newton Method

For many problems, we can’t afford to compute the Hessian,or even store an n by n matrix.

There are several limited-memory alternatives available:1 Non-linear conjugate gradient.2 Nesterov’s optimal gradient method.3 Diagonally-scaled steepest descent.4 Non-monotonic Barzilai-Borwein method.5 Hessian-free Newton methods.6 Limited-memory quasi-Newton methods.

This talk mainly focuses on the last two.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 47: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

HFN: Hessian-Free Newton Methods

We want to implement the step:

xk+1 ← xk − αdk .

where dk is the solution of the linear system

Hkd = ∇f (xk).

Hessian-free Newton (HFN): find dk with an iterative solver.

We typically use conjugate gradient (but others are possible).

Conjugate gradient only requires Hessian-vector products Hky.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 48: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

HFN: Hessian-Free Newton Methods

We want to implement the step:

xk+1 ← xk − αdk .

where dk is the solution of the linear system

Hkd = ∇f (xk).

Hessian-free Newton (HFN): find dk with an iterative solver.

We typically use conjugate gradient (but others are possible).

Conjugate gradient only requires Hessian-vector products Hky.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 49: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

HFN: Hessian-Free Newton Methods

We want to implement the step:

xk+1 ← xk − αdk .

where dk is the solution of the linear system

Hkd = ∇f (xk).

Hessian-free Newton (HFN): find dk with an iterative solver.

We typically use conjugate gradient (but others are possible).

Conjugate gradient only requires Hessian-vector products Hky.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 50: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

HFN: Hessian-Free Newton Methods

We want to implement the step:

xk+1 ← xk − αdk .

where dk is the solution of the linear system

Hkd = ∇f (xk).

Hessian-free Newton (HFN): find dk with an iterative solver.

We typically use conjugate gradient (but others are possible).

Conjugate gradient only requires Hessian-vector products Hky.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 51: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Compute Hessian-vector products

We can compute Hessian-vector products without explicitlyforming the Hessian.Sometimes this is due to the structure of the Hessian(∇2f (x) = ATDA for logistic regression)Alternately, we can approximate the product numerically forone gradient evaluation:

Hky ≈ ∇f (xk + µy)−∇f (xk)

µ.

Under weak assumptions about f (xk), we can approximatethe Hessian-vector product without cancellation error with avery small complex µ [Squire and Trapp, 1988]:

∇f (xk + iµy) = ∇f (xk) + iµHky +O(µ2).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 52: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Compute Hessian-vector products

We can compute Hessian-vector products without explicitlyforming the Hessian.Sometimes this is due to the structure of the Hessian(∇2f (x) = ATDA for logistic regression)Alternately, we can approximate the product numerically forone gradient evaluation:

Hky ≈ ∇f (xk + µy)−∇f (xk)

µ.

Under weak assumptions about f (xk), we can approximatethe Hessian-vector product without cancellation error with avery small complex µ [Squire and Trapp, 1988]:

∇f (xk + iµy) = ∇f (xk) + iµHky +O(µ2).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 53: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Compute Hessian-vector products

We can compute Hessian-vector products without explicitlyforming the Hessian.Sometimes this is due to the structure of the Hessian(∇2f (x) = ATDA for logistic regression)Alternately, we can approximate the product numerically forone gradient evaluation:

Hky ≈ ∇f (xk + µy)−∇f (xk)

µ.

Under weak assumptions about f (xk), we can approximatethe Hessian-vector product without cancellation error with avery small complex µ [Squire and Trapp, 1988]:

∇f (xk + iµy) = ∇f (xk) + iµHky +O(µ2).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 54: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Compute Hessian-vector products

We can compute Hessian-vector products without explicitlyforming the Hessian.Sometimes this is due to the structure of the Hessian(∇2f (x) = ATDA for logistic regression)Alternately, we can approximate the product numerically forone gradient evaluation:

Hky ≈ ∇f (xk + µy)−∇f (xk)

µ.

Under weak assumptions about f (xk), we can approximatethe Hessian-vector product without cancellation error with avery small complex µ [Squire and Trapp, 1988]:

∇f (xk + iµy) = ∇f (xk) + iµHky +O(µ2).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 55: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence of Inexact Newton Methods

There is no need to solve to full accuracy, leading to aresidual:

Hkd = ∇f (xk) + rk .

Dembo, Eisenstat, Steihaug [1982] show fast convergencerates when the residuals are smaller than the gradient:

1 Linear convergence: ||rk || ≤ ηk ||∇f (xk)|| with ηk < η < 1.2 Superlinear convergence: limk→∞ ηk = 0.3 Quadratic convergence: ηk = O(||∇f (xk)||).

For superlinear convergence, a typical forcing sequence is

ηk = min{.5,√||∇f (xk)||}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 56: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence of Inexact Newton Methods

There is no need to solve to full accuracy, leading to aresidual:

Hkd = ∇f (xk) + rk .

Dembo, Eisenstat, Steihaug [1982] show fast convergencerates when the residuals are smaller than the gradient:

1 Linear convergence: ||rk || ≤ ηk ||∇f (xk)|| with ηk < η < 1.2 Superlinear convergence: limk→∞ ηk = 0.3 Quadratic convergence: ηk = O(||∇f (xk)||).

For superlinear convergence, a typical forcing sequence is

ηk = min{.5,√||∇f (xk)||}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 57: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence of Inexact Newton Methods

There is no need to solve to full accuracy, leading to aresidual:

Hkd = ∇f (xk) + rk .

Dembo, Eisenstat, Steihaug [1982] show fast convergencerates when the residuals are smaller than the gradient:

1 Linear convergence: ||rk || ≤ ηk ||∇f (xk)|| with ηk < η < 1.2 Superlinear convergence: limk→∞ ηk = 0.3 Quadratic convergence: ηk = O(||∇f (xk)||).

For superlinear convergence, a typical forcing sequence is

ηk = min{.5,√||∇f (xk)||}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 58: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Rate of Convergence of Inexact Newton Methods

There is no need to solve to full accuracy, leading to aresidual:

Hkd = ∇f (xk) + rk .

Dembo, Eisenstat, Steihaug [1982] show fast convergencerates when the residuals are smaller than the gradient:

1 Linear convergence: ||rk || ≤ ηk ||∇f (xk)|| with ηk < η < 1.2 Superlinear convergence: limk→∞ ηk = 0.3 Quadratic convergence: ηk = O(||∇f (xk)||).

For superlinear convergence, a typical forcing sequence is

ηk = min{.5,√||∇f (xk)||}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 59: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Discussion of HFN Methods

Preconditioning often drastically reduces the number ofHessian-vector products.

The conjugate gradient algorithm can be modified to give adescent direction even if the Hessian is not positive-definite.

If the Hessian contains negative eigenvalues, the method maybe able to find a direction of negative curvature.

For details, Nocedal and Wright [2006, §7.1].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 60: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Discussion of HFN Methods

Preconditioning often drastically reduces the number ofHessian-vector products.

The conjugate gradient algorithm can be modified to give adescent direction even if the Hessian is not positive-definite.

If the Hessian contains negative eigenvalues, the method maybe able to find a direction of negative curvature.

For details, Nocedal and Wright [2006, §7.1].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 61: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Discussion of HFN Methods

Preconditioning often drastically reduces the number ofHessian-vector products.

The conjugate gradient algorithm can be modified to give adescent direction even if the Hessian is not positive-definite.

If the Hessian contains negative eigenvalues, the method maybe able to find a direction of negative curvature.

For details, Nocedal and Wright [2006, §7.1].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 62: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Discussion of HFN Methods

Preconditioning often drastically reduces the number ofHessian-vector products.

The conjugate gradient algorithm can be modified to give adescent direction even if the Hessian is not positive-definite.

If the Hessian contains negative eigenvalues, the method maybe able to find a direction of negative curvature.

For details, Nocedal and Wright [2006, §7.1].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 63: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hessian-Free Newton Methods vs. Quasi-Newton Methods

The main difference between HFN and quasi-Newton methods:

Hessian-free methods approximately invert the Hessian.

Quasi-Newton methods invert an approximate Hessian.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 64: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hessian-Free Newton Methods vs. Quasi-Newton Methods

The main difference between HFN and quasi-Newton methods:

Hessian-free methods approximately invert the Hessian.

Quasi-Newton methods invert an approximate Hessian.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 65: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hessian-Free Newton Methods vs. Quasi-Newton Methods

The main difference between HFN and quasi-Newton methods:

Hessian-free methods approximately invert the Hessian.

Quasi-Newton methods invert an approximate Hessian.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 66: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Broyden-Fletcher-Goldfarb-Shanno Quasi-Newton Method

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation H0 , σI, and chooseHk+1 to interpolate the gradient difference:

Hk+1sk = yk .

Since Hk+1 is not unique; the BFGS method chooses thesymmetric matrix whose difference with Hk is minimal:

Hk+1 = Hk − HkskskHk

skHksk+

ykyTk

yTk sk.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 67: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Broyden-Fletcher-Goldfarb-Shanno Quasi-Newton Method

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation H0 , σI, and chooseHk+1 to interpolate the gradient difference:

Hk+1sk = yk .

Since Hk+1 is not unique; the BFGS method chooses thesymmetric matrix whose difference with Hk is minimal:

Hk+1 = Hk − HkskskHk

skHksk+

ykyTk

yTk sk.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 68: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Broyden-Fletcher-Goldfarb-Shanno Quasi-Newton Method

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation H0 , σI, and chooseHk+1 to interpolate the gradient difference:

Hk+1sk = yk .

Since Hk+1 is not unique; the BFGS method chooses thesymmetric matrix whose difference with Hk is minimal:

Hk+1 = Hk − HkskskHk

skHksk+

ykyTk

yTk sk.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 69: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Convergence and Limited-Memory BFGS (L-BFGS)

Update skipping/damping or a more sophisticated line search(Wolfe conditions) can keep Hk+1 positive-definite.

The BFGS method has a superlinear convergence rate.

But, it still uses a dense Hk .

Instead of storing Hk , the limited-memory BFGS (L-BFGS)method stores the previous m differences sk and yk .

We can solve a linear system involving these updates appliedto a diagonal H0 in O(mn) [Nocedal, 1980].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 70: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Convergence and Limited-Memory BFGS (L-BFGS)

Update skipping/damping or a more sophisticated line search(Wolfe conditions) can keep Hk+1 positive-definite.

The BFGS method has a superlinear convergence rate.

But, it still uses a dense Hk .

Instead of storing Hk , the limited-memory BFGS (L-BFGS)method stores the previous m differences sk and yk .

We can solve a linear system involving these updates appliedto a diagonal H0 in O(mn) [Nocedal, 1980].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 71: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Convergence and Limited-Memory BFGS (L-BFGS)

Update skipping/damping or a more sophisticated line search(Wolfe conditions) can keep Hk+1 positive-definite.

The BFGS method has a superlinear convergence rate.

But, it still uses a dense Hk .

Instead of storing Hk , the limited-memory BFGS (L-BFGS)method stores the previous m differences sk and yk .

We can solve a linear system involving these updates appliedto a diagonal H0 in O(mn) [Nocedal, 1980].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 72: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Convergence and Limited-Memory BFGS (L-BFGS)

Update skipping/damping or a more sophisticated line search(Wolfe conditions) can keep Hk+1 positive-definite.

The BFGS method has a superlinear convergence rate.

But, it still uses a dense Hk .

Instead of storing Hk , the limited-memory BFGS (L-BFGS)method stores the previous m differences sk and yk .

We can solve a linear system involving these updates appliedto a diagonal H0 in O(mn) [Nocedal, 1980].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 73: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Convergence and Limited-Memory BFGS (L-BFGS)

Update skipping/damping or a more sophisticated line search(Wolfe conditions) can keep Hk+1 positive-definite.

The BFGS method has a superlinear convergence rate.

But, it still uses a dense Hk .

Instead of storing Hk , the limited-memory BFGS (L-BFGS)method stores the previous m differences sk and yk .

We can solve a linear system involving these updates appliedto a diagonal H0 in O(mn) [Nocedal, 1980].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 74: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

L-BFGS Recursive Formula

Solving Hkdk = g given {H0, sk , yk} [Nocedal, 1980]:

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 75: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Scaling L-BFGS

The choice of H0 on each iteration is crucial to theperformance of L-BFGS methods.

A common choice is H0 = α−1bb I [Shanno & Phua, 1978]:

αbb = argminα||sk − αIyk || = (sTk yk)/(yTk yk)

Convergence theory is not as nice for L-BFGS, but oftenoutperforms HFN and other competing approaches.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 76: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Scaling L-BFGS

The choice of H0 on each iteration is crucial to theperformance of L-BFGS methods.

A common choice is H0 = α−1bb I [Shanno & Phua, 1978]:

αbb = argminα||sk − αIyk || = (sTk yk)/(yTk yk)

Convergence theory is not as nice for L-BFGS, but oftenoutperforms HFN and other competing approaches.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 77: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Scaling L-BFGS

The choice of H0 on each iteration is crucial to theperformance of L-BFGS methods.

A common choice is H0 = α−1bb I [Shanno & Phua, 1978]:

αbb = argminα||sk − αIyk || = (sTk yk)/(yTk yk)

Convergence theory is not as nice for L-BFGS, but oftenoutperforms HFN and other competing approaches.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 78: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Barzilai-Borwein Method

We can also consider the approximate quasi-Newton method:

xk+1 = xk − αbb∇f (xk).

This is the Barzilai & Borwein [1988] method.

The step size is typically used with a non-monotomic Armijocondition [Grippo et al., 1986, Raydan et al., 1997]:

f (xk+1) ≤ maxj∈{k−m:k}

{f (xj)}+ η∇f (xk)T (xk+1 − x).

This simple method performs surprisingly well in a variety ofproblems [Raydan et al., 1997, Birgin et al., 2000, Dai &Fletcher, 2005, Figuereido et al., 2007, van den Berg andFriedlander, 2008, Wright et al., 2010].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 79: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Barzilai-Borwein Method

We can also consider the approximate quasi-Newton method:

xk+1 = xk − αbb∇f (xk).

This is the Barzilai & Borwein [1988] method.

The step size is typically used with a non-monotomic Armijocondition [Grippo et al., 1986, Raydan et al., 1997]:

f (xk+1) ≤ maxj∈{k−m:k}

{f (xj)}+ η∇f (xk)T (xk+1 − x).

This simple method performs surprisingly well in a variety ofproblems [Raydan et al., 1997, Birgin et al., 2000, Dai &Fletcher, 2005, Figuereido et al., 2007, van den Berg andFriedlander, 2008, Wright et al., 2010].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 80: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Barzilai-Borwein Method

We can also consider the approximate quasi-Newton method:

xk+1 = xk − αbb∇f (xk).

This is the Barzilai & Borwein [1988] method.

The step size is typically used with a non-monotomic Armijocondition [Grippo et al., 1986, Raydan et al., 1997]:

f (xk+1) ≤ maxj∈{k−m:k}

{f (xj)}+ η∇f (xk)T (xk+1 − x).

This simple method performs surprisingly well in a variety ofproblems [Raydan et al., 1997, Birgin et al., 2000, Dai &Fletcher, 2005, Figuereido et al., 2007, van den Berg andFriedlander, 2008, Wright et al., 2010].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 81: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Barzilai-Borwein Method

We can also consider the approximate quasi-Newton method:

xk+1 = xk − αbb∇f (xk).

This is the Barzilai & Borwein [1988] method.

The step size is typically used with a non-monotomic Armijocondition [Grippo et al., 1986, Raydan et al., 1997]:

f (xk+1) ≤ maxj∈{k−m:k}

{f (xj)}+ η∇f (xk)T (xk+1 − x).

This simple method performs surprisingly well in a variety ofproblems [Raydan et al., 1997, Birgin et al., 2000, Dai &Fletcher, 2005, Figuereido et al., 2007, van den Berg andFriedlander, 2008, Wright et al., 2010].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 82: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hybrid L-BFGS and Hessian-Free Methods

L-BFGS and HFN methods can be combined:

Use an L-BFGS approximation to precondition the conjugategradient iterations [Morales & Nocedal, 2000].

Use conjugate gradient iterations to to improve H0 in theL-BFGS approximation [Morales & Nocedal, 2002] .

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 83: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hybrid L-BFGS and Hessian-Free Methods

L-BFGS and HFN methods can be combined:

Use an L-BFGS approximation to precondition the conjugategradient iterations [Morales & Nocedal, 2000].

Use conjugate gradient iterations to to improve H0 in theL-BFGS approximation [Morales & Nocedal, 2002] .

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 84: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Hessian-Free Newton MethodsLimited-Memory Quasi-Newton MethodsScaling L-BFGS, Barzilai-Borwein Method, Hybrid Methods

Hybrid L-BFGS and Hessian-Free Methods

L-BFGS and HFN methods can be combined:

Use an L-BFGS approximation to precondition the conjugategradient iterations [Morales & Nocedal, 2000].

Use conjugate gradient iterations to to improve H0 in theL-BFGS approximation [Morales & Nocedal, 2002] .

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 85: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Outline

1 Motivation and Overview

2 L-BFGS and Hessian-Free Newton

3 Two-Metric (Sub-)Gradient ProjectionBound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

4 Inexact Projected/Proximal Newton

5 Discussion

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 86: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Optimization with `1-Regularization

We want to optimize a smooth function with `1-regularization:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

The non-smooth regularizer breaks quasi-Newton andHessian-free Newton methods.

But the regularizer is separable.

We consider two methods that take advantage of this:1 Two-metric projection on an equivalent problem.2 Two-metric sub-gradient projection applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 87: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Optimization with `1-Regularization

We want to optimize a smooth function with `1-regularization:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

The non-smooth regularizer breaks quasi-Newton andHessian-free Newton methods.

But the regularizer is separable.

We consider two methods that take advantage of this:1 Two-metric projection on an equivalent problem.2 Two-metric sub-gradient projection applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 88: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Optimization with `1-Regularization

We want to optimize a smooth function with `1-regularization:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

The non-smooth regularizer breaks quasi-Newton andHessian-free Newton methods.

But the regularizer is separable.

We consider two methods that take advantage of this:1 Two-metric projection on an equivalent problem.2 Two-metric sub-gradient projection applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 89: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Optimization with `1-Regularization

We want to optimize a smooth function with `1-regularization:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

The non-smooth regularizer breaks quasi-Newton andHessian-free Newton methods.

But the regularizer is separable.

We consider two methods that take advantage of this:1 Two-metric projection on an equivalent problem.2 Two-metric sub-gradient projection applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 90: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Converting to a Bound-Constrained Problem

We can re-write the non-smooth objective

minx`(x) +

n∑i=1

λi |xi |,

as a smooth objective with non-negative constraints:

minx`(x+−x−) +

n∑i=1

λi (x+i + x−i ), subject to x+ ≥ 0, x− ≥ 0.

We can now use methods for bound-constrained optimizationof smooth objectives.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 91: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Converting to a Bound-Constrained Problem

We can re-write the non-smooth objective

minx`(x) +

n∑i=1

λi |xi |,

as a smooth objective with non-negative constraints:

minx`(x+−x−) +

n∑i=1

λi (x+i + x−i ), subject to x+ ≥ 0, x− ≥ 0.

We can now use methods for bound-constrained optimizationof smooth objectives.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 92: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection

A classic algorithm for bound-constrained problems is gradientprojection:

xk+1 ← [xk − α∇f (xk)]+.

The Armijo condition guarantees sufficient decrease andglobal convergence.

However, the convergence rate may be vey slow.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 93: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 94: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)

x1

x2

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 95: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)Feasible Set

x1

x2

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 96: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)Feasible Set

xk

x1

x2

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 97: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)Feasible Set

xk

x1

x2

xk - !!f(xk)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 98: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)Feasible Set

xk

x1

x2

xk - !!f(xk)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 99: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Gradient Projection for Non-Negative Constraints

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 100: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

To speed convergence, we might consider projecting aNewton-like step:

xk+1 ← [xk − αdk ]+,

where dk is the solution of

Hkdk = ∇f (xk).

This is known as a two-metric projection algorithm(the gradient and projection norm are different).

This method does not work. It may not be possible toguarantee descent even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 101: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

To speed convergence, we might consider projecting aNewton-like step:

xk+1 ← [xk − αdk ]+,

where dk is the solution of

Hkdk = ∇f (xk).

This is known as a two-metric projection algorithm(the gradient and projection norm are different).

This method does not work. It may not be possible toguarantee descent even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 102: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

To speed convergence, we might consider projecting aNewton-like step:

xk+1 ← [xk − αdk ]+,

where dk is the solution of

Hkdk = ∇f (xk).

This is known as a two-metric projection algorithm(the gradient and projection norm are different).

This method does not work. It may not be possible toguarantee descent even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 103: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 104: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 105: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

xk - !dk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 106: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

xk - !dk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 107: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Naive Two-Metric Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

xk - !dk[xk - !dk]+

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 108: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Diagonal-Scaling and Spectral Projected Gradient (SPG)

We can guarantee descent with further restrictions on Hk .

For example, we can make Hk diagonal:

xk+1 ← [xk − αDk∇f (xk)]+

In the spectral projected gradient (SPG) method, we use theBarzilai-Borwein step and non-monotonic Armijo condition:

xk+1 ← [xk − αbb∇f (xk)]+

[Birgin et al., 2000, Figueiredo et al., 2007]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 109: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Diagonal-Scaling and Spectral Projected Gradient (SPG)

We can guarantee descent with further restrictions on Hk .

For example, we can make Hk diagonal:

xk+1 ← [xk − αDk∇f (xk)]+

In the spectral projected gradient (SPG) method, we use theBarzilai-Borwein step and non-monotonic Armijo condition:

xk+1 ← [xk − αbb∇f (xk)]+

[Birgin et al., 2000, Figueiredo et al., 2007]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 110: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Diagonal-Scaling and Spectral Projected Gradient (SPG)

We can guarantee descent with further restrictions on Hk .

For example, we can make Hk diagonal:

xk+1 ← [xk − αDk∇f (xk)]+

In the spectral projected gradient (SPG) method, we use theBarzilai-Borwein step and non-monotonic Armijo condition:

xk+1 ← [xk − αbb∇f (xk)]+

[Birgin et al., 2000, Figueiredo et al., 2007]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 111: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

Do we need Hk to be diagonal?

To guarantee descent, it is sufficient that Hk is diagonal withrespect to a subset of the variables [Gafni & Bertsekas, 1984]:

A , {i |xki ≤ ε and ∇i f (xk) > 0}

Re-arranging variables, this leads to a scaling of the form

Hk =

[Dk 00 Hk

]We want Hk to approximate the sub-Hessian ∇2

F f (xk).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 112: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

Do we need Hk to be diagonal?

To guarantee descent, it is sufficient that Hk is diagonal withrespect to a subset of the variables [Gafni & Bertsekas, 1984]:

A , {i |xki ≤ ε and ∇i f (xk) > 0}

Re-arranging variables, this leads to a scaling of the form

Hk =

[Dk 00 Hk

]We want Hk to approximate the sub-Hessian ∇2

F f (xk).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 113: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

Do we need Hk to be diagonal?

To guarantee descent, it is sufficient that Hk is diagonal withrespect to a subset of the variables [Gafni & Bertsekas, 1984]:

A , {i |xki ≤ ε and ∇i f (xk) > 0}

Re-arranging variables, this leads to a scaling of the form

Hk =

[Dk 00 Hk

]We want Hk to approximate the sub-Hessian ∇2

F f (xk).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 114: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

Do we need Hk to be diagonal?

To guarantee descent, it is sufficient that Hk is diagonal withrespect to a subset of the variables [Gafni & Bertsekas, 1984]:

A , {i |xki ≤ ε and ∇i f (xk) > 0}

Re-arranging variables, this leads to a scaling of the form

Hk =

[Dk 00 Hk

]We want Hk to approximate the sub-Hessian ∇2

F f (xk).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 115: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

We can thus write the two-metric projection step as:

xk+1A ← [xkA − αDk∇Af (xk)]+

xk+1F ← [xkF − αdk ]+

where dk is the solution of

Hkdk = ∇F f (xk).

We can implement an HFN method by using conjugategradient to solve this system (very fast if solution is sparse).

We can implement an L-BFGS method by setting Hk to theL-BFGS approximation.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 116: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

We can thus write the two-metric projection step as:

xk+1A ← [xkA − αDk∇Af (xk)]+

xk+1F ← [xkF − αdk ]+

where dk is the solution of

Hkdk = ∇F f (xk).

We can implement an HFN method by using conjugategradient to solve this system (very fast if solution is sparse).

We can implement an L-BFGS method by setting Hk to theL-BFGS approximation.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 117: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Projection

We can thus write the two-metric projection step as:

xk+1A ← [xkA − αDk∇Af (xk)]+

xk+1F ← [xkF − αdk ]+

where dk is the solution of

Hkdk = ∇F f (xk).

We can implement an HFN method by using conjugategradient to solve this system (very fast if solution is sparse).

We can implement an L-BFGS method by setting Hk to theL-BFGS approximation.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 118: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Discussion of Two-Metric Projection

If the algorithm identifies the optimal manifold, it is equivalentto the unconstrained method on the non-zero variables.

But should we convert to a bound-constrained problem in thefirst place?

The number of variables is doubled.The transformed problem might be harder(the transformed problem is never strongly convex)

Can we apply tricks from bound-constrained optimization todirectly solve to `1-regularization problems?

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 119: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Discussion of Two-Metric Projection

If the algorithm identifies the optimal manifold, it is equivalentto the unconstrained method on the non-zero variables.

But should we convert to a bound-constrained problem in thefirst place?

The number of variables is doubled.The transformed problem might be harder(the transformed problem is never strongly convex)

Can we apply tricks from bound-constrained optimization todirectly solve to `1-regularization problems?

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 120: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Discussion of Two-Metric Projection

If the algorithm identifies the optimal manifold, it is equivalentto the unconstrained method on the non-zero variables.

But should we convert to a bound-constrained problem in thefirst place?

The number of variables is doubled.The transformed problem might be harder(the transformed problem is never strongly convex)

Can we apply tricks from bound-constrained optimization todirectly solve to `1-regularization problems?

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 121: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

Recall the motivating problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

If f (x) is convex, the objective has sub-gradients anddirectional derivatives everywhere.

We use zk to denote the minimum-norm sub-gradient:

zk = argminz∈∂f (xk )

||z||

The direction that minimizes the directional derivative is −zk .

This is the steepest descent direction for non-smooth convexoptimization problems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 122: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

Recall the motivating problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

If f (x) is convex, the objective has sub-gradients anddirectional derivatives everywhere.

We use zk to denote the minimum-norm sub-gradient:

zk = argminz∈∂f (xk )

||z||

The direction that minimizes the directional derivative is −zk .

This is the steepest descent direction for non-smooth convexoptimization problems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 123: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

Recall the motivating problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

If f (x) is convex, the objective has sub-gradients anddirectional derivatives everywhere.

We use zk to denote the minimum-norm sub-gradient:

zk = argminz∈∂f (xk )

||z||

The direction that minimizes the directional derivative is −zk .

This is the steepest descent direction for non-smooth convexoptimization problems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 124: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

Recall the motivating problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

If f (x) is convex, the objective has sub-gradients anddirectional derivatives everywhere.

We use zk to denote the minimum-norm sub-gradient:

zk = argminz∈∂f (xk )

||z||

The direction that minimizes the directional derivative is −zk .

This is the steepest descent direction for non-smooth convexoptimization problems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 125: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

Recall the motivating problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |

If f (x) is convex, the objective has sub-gradients anddirectional derivatives everywhere.

We use zk to denote the minimum-norm sub-gradient:

zk = argminz∈∂f (xk )

||z||

The direction that minimizes the directional derivative is −zk .

This is the steepest descent direction for non-smooth convexoptimization problems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 126: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

For our problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |,

we can compute the minimum-norm sub-gradientcoordinate-wise because the `1-norm is separable:

zki ,

∇i`(x) + λi sign(xi ), |xi | > 0

∇i`(x)− λi sign(∇i`(x)), xi = 0, |∇i`(x)| > λi0, xi = 0, |∇i`(x)| ≤ λi

This is the steepest descent direction for `1-regularizationproblems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 127: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Non-Smooth Steepest Descent

For our problem:

minx

f (x) = `(x) +n∑

i=1

λi |xi |,

we can compute the minimum-norm sub-gradientcoordinate-wise because the `1-norm is separable:

zki ,

∇i`(x) + λi sign(xi ), |xi | > 0

∇i`(x)− λi sign(∇i`(x)), xi = 0, |∇i`(x)| > λi0, xi = 0, |∇i`(x)| ≤ λi

This is the steepest descent direction for `1-regularizationproblems.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 128: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Scaled Non-Smooth Steepest Descent

We can consider a non-smooth steepest descent step:

xk+1 ← xk − αzk .

We can use zk in the Armijo condition to guarantee asufficient decrease.∗

We can even try a Newton-like version:

xk+1 ← xk − αdk ,

where dk solves Hkdk = zk .

However, there are two problems with this step:1 The iterations are unlikely to be sparse.2 It doesn’t guarantee descent, even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 129: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Scaled Non-Smooth Steepest Descent

We can consider a non-smooth steepest descent step:

xk+1 ← xk − αzk .

We can use zk in the Armijo condition to guarantee asufficient decrease.∗

We can even try a Newton-like version:

xk+1 ← xk − αdk ,

where dk solves Hkdk = zk .

However, there are two problems with this step:1 The iterations are unlikely to be sparse.2 It doesn’t guarantee descent, even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 130: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Scaled Non-Smooth Steepest Descent

We can consider a non-smooth steepest descent step:

xk+1 ← xk − αzk .

We can use zk in the Armijo condition to guarantee asufficient decrease.∗

We can even try a Newton-like version:

xk+1 ← xk − αdk ,

where dk solves Hkdk = zk .

However, there are two problems with this step:1 The iterations are unlikely to be sparse.2 It doesn’t guarantee descent, even if Hk is positive-definite.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 131: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Scaled Non-Smooth Steepest Descent

To get sparse iterates, many authors use orthant projection:

xk+1 ← PO[xk − αdk , xk ],

where [Osborne et al., 2000, Andrew & Gao, 2007]

PO(y, x)i ,

{0 if xiyi < 0

yi otherwise

Sets variables that change signs to exactly zero:1 Effecitve at sparsifying the solution.2 Restricts quadratic approximation to valid region.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 132: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

f(x)

x1

x2

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 133: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

f(x)

x1

x2

xk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 134: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

Orthant of xk

f(x)

x1

x2

xk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 135: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

Orthant of xk

f(x)

x1

x2

xkQ(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 136: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

Orthant of xk

f(x)

x1

x2

xkQ(x,!)

xk - !dk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 137: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

Orthant of xk

f(x)

x1

x2

xkQ(x,!)

xk - !dk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 138: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Orthant Projection

Orthant of xk

f(x)

x1

x2

xkQ(x,!)

xk - !dk

PO[xk - !dk,xk]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 139: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

There are several ways to guarantee descent.

We could use a diagonal scaling Dk :

xk+1 ← PO[xk − αDkzk , xk ].

We could use the Barzilai-Borwein step with non-monotonicline searches:

xk+1 ← PO[xk − αbbzk , xk ].

We could make the scaling diagonal with respect to anappropriate subset of the variables:

A , {i ||xki | ≤ ε}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 140: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

There are several ways to guarantee descent.

We could use a diagonal scaling Dk :

xk+1 ← PO[xk − αDkzk , xk ].

We could use the Barzilai-Borwein step with non-monotonicline searches:

xk+1 ← PO[xk − αbbzk , xk ].

We could make the scaling diagonal with respect to anappropriate subset of the variables:

A , {i ||xki | ≤ ε}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 141: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

There are several ways to guarantee descent.

We could use a diagonal scaling Dk :

xk+1 ← PO[xk − αDkzk , xk ].

We could use the Barzilai-Borwein step with non-monotonicline searches:

xk+1 ← PO[xk − αbbzk , xk ].

We could make the scaling diagonal with respect to anappropriate subset of the variables:

A , {i ||xki | ≤ ε}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 142: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

There are several ways to guarantee descent.

We could use a diagonal scaling Dk :

xk+1 ← PO[xk − αDkzk , xk ].

We could use the Barzilai-Borwein step with non-monotonicline searches:

xk+1 ← PO[xk − αbbzk , xk ].

We could make the scaling diagonal with respect to anappropriate subset of the variables:

A , {i ||xki | ≤ ε}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 143: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

The latter leads to a two-metric sub-gradient projectionmethod:

xk+1A ← PO[xkA − αDkzkA, x

kA],

xk+1F ← PO[xkF − αdk , xkF ],

where dk solvesHkdk = ∇F f (xk).

We can derive HFN and L-BFGS methods as before.

One choice of Dk might be the Barzilai-Borwein step αbbI.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 144: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

The latter leads to a two-metric sub-gradient projectionmethod:

xk+1A ← PO[xkA − αDkzkA, x

kA],

xk+1F ← PO[xkF − αdk , xkF ],

where dk solvesHkdk = ∇F f (xk).

We can derive HFN and L-BFGS methods as before.

One choice of Dk might be the Barzilai-Borwein step αbbI.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 145: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Two-Metric Sub-Gradient Projection

The latter leads to a two-metric sub-gradient projectionmethod:

xk+1A ← PO[xkA − αDkzkA, x

kA],

xk+1F ← PO[xkF − αdk , xkF ],

where dk solvesHkdk = ∇F f (xk).

We can derive HFN and L-BFGS methods as before.

One choice of Dk might be the Barzilai-Borwein step αbbI.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 146: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Advantages

The method has several appealing properties:

No variable doubling.

No losing strong convexity.

Gives sparse iterations.

Allows warm-starting.

Many variables can be set to zero at once.

Many variables can move away from zero at once.

If it identifies the optimal sparsity pattern, it is equivalentNewton’s method on the non-zero variables.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 147: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Practical Issues

For very large-scale problems with very spare solutions, practicalmethods seem to need to consider two more issues:

Continuation: Start with a large value of λ and progressivelydecrease it.

Sub-Optimization: Ignore variables that are unlikely to moveaway from zero (temporarily or permanently).

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 148: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Extensions

The algorithm extends to problems of the form

minl�x�u

`(x) + r(x),

where r(x) is separable and differentiable almost everywhere.

Two-metric projection algorithms also exist for otherconstraints [Gafni & Bertsekas, 1984].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 149: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Bound-Constrained FormulationSpectral Projected Gradient and Two-Metric ProjectionTwo-Metric Sub-Gradient Projection

Discussion

There are several closely related methods for using curvaturein `1-regularization algorithms, including:

Perkins et al. [2003].Andrew & Gao [2007].Shi et al. [2007].Kim & Park [2010].

While descent is guaranteed, convergence theory is not fullydeveloped:

Global convergence.Active set identification.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 150: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Outline

1 Motivation and Overview

2 L-BFGS and Hessian-Free Newton

3 Two-Metric (Sub-)Gradient Projection

4 Inexact Projected/Proximal NewtonGroup `1-RegularizationInexact Projected NewtonInexact Proximal Newton

5 Discussion

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 151: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Problems with non-separable case

We now turn to the group `1-regularization problem:

minx

f (x) = `(x) +∑g

λg ||xg ||2

The regularizer is not separable.

But the regularizer is simple.

We consider two methods that take advantage of this:

Inexact projected Newton on an equivalent problem.Inexact proximal Newton applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 152: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Problems with non-separable case

We now turn to the group `1-regularization problem:

minx

f (x) = `(x) +∑g

λg ||xg ||2

The regularizer is not separable.

But the regularizer is simple.

We consider two methods that take advantage of this:

Inexact projected Newton on an equivalent problem.Inexact proximal Newton applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 153: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Problems with non-separable case

We now turn to the group `1-regularization problem:

minx

f (x) = `(x) +∑g

λg ||xg ||2

The regularizer is not separable.

But the regularizer is simple.

We consider two methods that take advantage of this:

Inexact projected Newton on an equivalent problem.Inexact proximal Newton applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 154: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Problems with non-separable case

We now turn to the group `1-regularization problem:

minx

f (x) = `(x) +∑g

λg ||xg ||2

The regularizer is not separable.

But the regularizer is simple.

We consider two methods that take advantage of this:

Inexact projected Newton on an equivalent problem.Inexact proximal Newton applied directly.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 155: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Converting to a Constrained Problem

We can introduce extra variables t to convert the probleminto a smooth optimization with cone constraints:

minx,t

`(x) +∑g

λg tg , subject to ||xg ||2 ≤ tg , ∀g .

Alternately, we can the optimize over the norm ball:

minx`(x), subject to

∑g

λg ||xg ||2 ≤ τ

In both cases the constraints are simple; we can compute theprojection in linear time [Boyd & Vandenberghe, 2004,Exercise 8.3(c), van den Berg et al., 2008].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 156: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Converting to a Constrained Problem

We can introduce extra variables t to convert the probleminto a smooth optimization with cone constraints:

minx,t

`(x) +∑g

λg tg , subject to ||xg ||2 ≤ tg , ∀g .

Alternately, we can the optimize over the norm ball:

minx`(x), subject to

∑g

λg ||xg ||2 ≤ τ

In both cases the constraints are simple; we can compute theprojection in linear time [Boyd & Vandenberghe, 2004,Exercise 8.3(c), van den Berg et al., 2008].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 157: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient

Recall the basic projected gradient step:

xk+1 ← argminx∈C

||x− (xk − α∇f (xk))||22

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 158: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient

Recall the basic projected gradient step:

xk+1 ← argminx∈C

||x− (xk − α∇f (xk))||22

Feasible Set

f(x)

xk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 159: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient

Recall the basic projected gradient step:

xk+1 ← argminx∈C

||x− (xk − α∇f (xk))||22

Feasible Set

f(x)

xk

xk - !!f(xk)

PC[xk - !!f(xk)]

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 160: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient and Projected Newton

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx∈C

||x− (xk − αdk)||22,

where dk solves Hkdk = ∇f (xk).

As we saw, in general this does not work.

To guarantee descent, projected Newton methods projectunder a quadratic norm:

xk+1 ← argminx∈C

||x− (xk − αdk)||2Hk ,

where ||y||Hk =√

yTHky [Levitin & Polyak, 1966].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 161: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient and Projected Newton

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx∈C

||x− (xk − αdk)||22,

where dk solves Hkdk = ∇f (xk).

As we saw, in general this does not work.

To guarantee descent, projected Newton methods projectunder a quadratic norm:

xk+1 ← argminx∈C

||x− (xk − αdk)||2Hk ,

where ||y||Hk =√

yTHky [Levitin & Polyak, 1966].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 162: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Gradient and Projected Newton

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx∈C

||x− (xk − αdk)||22,

where dk solves Hkdk = ∇f (xk).

As we saw, in general this does not work.

To guarantee descent, projected Newton methods projectunder a quadratic norm:

xk+1 ← argminx∈C

||x− (xk − αdk)||2Hk ,

where ||y||Hk =√

yTHky [Levitin & Polyak, 1966].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 163: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Newton

Projecting under the Hk norm is equivalent to minimizing thequadratic approximation over the convex set:

xk+1 ← argminx∈C

Qk(x, α)

where

Qk(x, α) = f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

In general, this projection will be expensive even if theconstraints are simple.

May be inexpensive if Hk is diagonal.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 164: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Projected Newton

Projecting under the Hk norm is equivalent to minimizing thequadratic approximation over the convex set:

xk+1 ← argminx∈C

Qk(x, α)

where

Qk(x, α) = f (xk)+(x−xk)T∇f (xk)+1

2α(x−xk)THk(x−xk)

In general, this projection will be expensive even if theconstraints are simple.

May be inexpensive if Hk is diagonal.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 165: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

If we want to use a non-diagonal Hk , we can consider aninexact projected Newton method.

Analogous to the unconstrained HFN methods, we computethe step using a constrained iterative solver.

For example, we can minimize Qk(y, α) use SPG iterations:

yk+1 ← argminy∈C

||y − (yk − αbb∇yQk(y, α))||22.

These iterations are dominated by the cost of:

Euclidean projections and Hessian-vector products.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 166: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

If we want to use a non-diagonal Hk , we can consider aninexact projected Newton method.

Analogous to the unconstrained HFN methods, we computethe step using a constrained iterative solver.

For example, we can minimize Qk(y, α) use SPG iterations:

yk+1 ← argminy∈C

||y − (yk − αbb∇yQk(y, α))||22.

These iterations are dominated by the cost of:

Euclidean projections and Hessian-vector products.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 167: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

If we want to use a non-diagonal Hk , we can consider aninexact projected Newton method.

Analogous to the unconstrained HFN methods, we computethe step using a constrained iterative solver.

For example, we can minimize Qk(y, α) use SPG iterations:

yk+1 ← argminy∈C

||y − (yk − αbb∇yQk(y, α))||22.

These iterations are dominated by the cost of:

Euclidean projections and Hessian-vector products.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 168: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

If we want to use a non-diagonal Hk , we can consider aninexact projected Newton method.

Analogous to the unconstrained HFN methods, we computethe step using a constrained iterative solver.

For example, we can minimize Qk(y, α) use SPG iterations:

yk+1 ← argminy∈C

||y − (yk − αbb∇yQk(y, α))||22.

These iterations are dominated by the cost of:

Euclidean projections and Hessian-vector products.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 169: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

xk

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 170: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

xk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 171: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

xk

PC[xk - !!f(xk)]

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 172: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

xk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 173: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

xk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 174: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Feasible Set

f(x)

minxεC Q(x,!)

dk

xk

Q(x,!)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 175: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Can we terminate this early?

If we set y0 = xk , then f (yk) < f (xk) for k ≥ 1 (for α small).

Alternately, we can set α = 1 and show that dk = yk − xk is afeasible descent direction for k ≥ 1.

In Schmidt et al. [2009], we used an L-BFGS approximation in aninexact projected Newton method:

The (approximate-)Hessian-vector products take O(mn).

The SPG iterations use projections but not the objective.

Efficient for optimizing costly functions with simpleconstraints.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 176: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Can we terminate this early?

If we set y0 = xk , then f (yk) < f (xk) for k ≥ 1 (for α small).

Alternately, we can set α = 1 and show that dk = yk − xk is afeasible descent direction for k ≥ 1.

In Schmidt et al. [2009], we used an L-BFGS approximation in aninexact projected Newton method:

The (approximate-)Hessian-vector products take O(mn).

The SPG iterations use projections but not the objective.

Efficient for optimizing costly functions with simpleconstraints.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 177: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Can we terminate this early?

If we set y0 = xk , then f (yk) < f (xk) for k ≥ 1 (for α small).

Alternately, we can set α = 1 and show that dk = yk − xk is afeasible descent direction for k ≥ 1.

In Schmidt et al. [2009], we used an L-BFGS approximation in aninexact projected Newton method:

The (approximate-)Hessian-vector products take O(mn).

The SPG iterations use projections but not the objective.

Efficient for optimizing costly functions with simpleconstraints.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 178: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Projected Newton

Can we terminate this early?

If we set y0 = xk , then f (yk) < f (xk) for k ≥ 1 (for α small).

Alternately, we can set α = 1 and show that dk = yk − xk is afeasible descent direction for k ≥ 1.

In Schmidt et al. [2009], we used an L-BFGS approximation in aninexact projected Newton method:

The (approximate-)Hessian-vector products take O(mn).

The SPG iterations use projections but not the objective.

Efficient for optimizing costly functions with simpleconstraints.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 179: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Operators

Can we avoid introducing constraints and directly solve theoriginal non-smooth group `1-regularization problem?

We can generalize projections to proximal operators:

proxr (xk) = argminx

1

2||x− xk ||22 + r(x).

The group `1-regularizer is simple; we can efficiently computethe proximal operator in linear time [Wright et al., 2009].

prox`1,2(xk)g = argminx

1

2||xg − xkg ||22 + λg ||xg ||2

= sgn(xg ) max{0, ||xg ||2 − λg}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 180: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Operators

Can we avoid introducing constraints and directly solve theoriginal non-smooth group `1-regularization problem?

We can generalize projections to proximal operators:

proxr (xk) = argminx

1

2||x− xk ||22 + r(x).

The group `1-regularizer is simple; we can efficiently computethe proximal operator in linear time [Wright et al., 2009].

prox`1,2(xk)g = argminx

1

2||xg − xkg ||22 + λg ||xg ||2

= sgn(xg ) max{0, ||xg ||2 − λg}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 181: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Operators

Can we avoid introducing constraints and directly solve theoriginal non-smooth group `1-regularization problem?

We can generalize projections to proximal operators:

proxr (xk) = argminx

1

2||x− xk ||22 + r(x).

The group `1-regularizer is simple; we can efficiently computethe proximal operator in linear time [Wright et al., 2009].

prox`1,2(xk)g = argminx

1

2||xg − xkg ||22 + λg ||xg ||2

= sgn(xg ) max{0, ||xg ||2 − λg}

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 182: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Gradient and Proximal Newton

The basic proximal gradient step:

xk+1 ← argminx

1

2||x− (xk − α∇f (xk))||22 + r(x)

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx

1

2||x− (xk − αdk)||22 + r(x),

where dk solves Hkdk = ∇f (xk).

But to ensure descent, we need to match the norms:

xk+1 ← argminx

1

2||x− (xk − αdk)||2Hk + r(x)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 183: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Gradient and Proximal Newton

The basic proximal gradient step:

xk+1 ← argminx

1

2||x− (xk − α∇f (xk))||22 + r(x)

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx

1

2||x− (xk − αdk)||22 + r(x),

where dk solves Hkdk = ∇f (xk).

But to ensure descent, we need to match the norms:

xk+1 ← argminx

1

2||x− (xk − αdk)||2Hk + r(x)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 184: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Proximal Gradient and Proximal Newton

The basic proximal gradient step:

xk+1 ← argminx

1

2||x− (xk − α∇f (xk))||22 + r(x)

To speed the convergence, we might consider a scaled step:

xk+1 ← argminx

1

2||x− (xk − αdk)||22 + r(x),

where dk solves Hkdk = ∇f (xk).

But to ensure descent, we need to match the norms:

xk+1 ← argminx

1

2||x− (xk − αdk)||2Hk + r(x)

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 185: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Proximal Newton

The proximal step under the Hk norm is equivalent tominimizing αr(x) and the quadratic approximation:

xk+1 ← arg minxQk(x, α) + αr(x)

This problem will be expensive even if r(x) is simple.

We could use a diagonal scaling or Barzilai-Borwein steps[Hofling & Tibshirani, 2009, Wright et al., 2009].

To use a non-diagonal scaling, we can use an iterative solver:

Use Euclidean proximal operators and Hessian-vector products.Guarantees descent after first iteration.With an L-BFGS approximation, suitable for optimizing costlyobjectives with simple regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 186: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Proximal Newton

The proximal step under the Hk norm is equivalent tominimizing αr(x) and the quadratic approximation:

xk+1 ← arg minxQk(x, α) + αr(x)

This problem will be expensive even if r(x) is simple.

We could use a diagonal scaling or Barzilai-Borwein steps[Hofling & Tibshirani, 2009, Wright et al., 2009].

To use a non-diagonal scaling, we can use an iterative solver:

Use Euclidean proximal operators and Hessian-vector products.Guarantees descent after first iteration.With an L-BFGS approximation, suitable for optimizing costlyobjectives with simple regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 187: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Proximal Newton

The proximal step under the Hk norm is equivalent tominimizing αr(x) and the quadratic approximation:

xk+1 ← arg minxQk(x, α) + αr(x)

This problem will be expensive even if r(x) is simple.

We could use a diagonal scaling or Barzilai-Borwein steps[Hofling & Tibshirani, 2009, Wright et al., 2009].

To use a non-diagonal scaling, we can use an iterative solver:

Use Euclidean proximal operators and Hessian-vector products.Guarantees descent after first iteration.With an L-BFGS approximation, suitable for optimizing costlyobjectives with simple regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 188: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Inexact Proximal Newton

The proximal step under the Hk norm is equivalent tominimizing αr(x) and the quadratic approximation:

xk+1 ← arg minxQk(x, α) + αr(x)

This problem will be expensive even if r(x) is simple.

We could use a diagonal scaling or Barzilai-Borwein steps[Hofling & Tibshirani, 2009, Wright et al., 2009].

To use a non-diagonal scaling, we can use an iterative solver:

Use Euclidean proximal operators and Hessian-vector products.Guarantees descent after first iteration.With an L-BFGS approximation, suitable for optimizing costlyobjectives with simple regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 189: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Group `1-RegularizationInexact Projected NewtonInexact Proximal Newton

Discussion

Easily extends to other group norms:

`∞ norm: the projection/proximal operators can be computedin O(n log n)/O(n) [Duchi et al., 2008, Quattoni et al., 2009,Wright et al, 2009].

Nuclear norm: the projection/proximal operators can becomputed in O(n3/2) [Cai et al., 2010].

Convergence theory of inexact projected/proximal Newton is notfully developed:

Proof of global convergence.

Accuracy needed for convergence rates.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 190: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Outline

1 Motivation and Overview

2 L-BFGS and Hessian-Free Newton

3 Two-Metric (Sub-)Gradient Projection

4 Inexact Projected/Proximal Newton

5 DiscussionSums of Simple RegularizersStochastic ObjectiveSummary

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 191: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Sums of Simple Regularizers

Recall the overlapping group `1-regularization problem:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

This regularizer is not simple.

But, it is the sum of simple regularizers.

After converting to a constrained problem, we can useDykstra’s [1983] algorithm to compute the projection.

Bauschke & Combettes [2008] generalize Dykstra’s algorithmto compute proximal operators for sums of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 192: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Sums of Simple Regularizers

Recall the overlapping group `1-regularization problem:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

This regularizer is not simple.

But, it is the sum of simple regularizers.

After converting to a constrained problem, we can useDykstra’s [1983] algorithm to compute the projection.

Bauschke & Combettes [2008] generalize Dykstra’s algorithmto compute proximal operators for sums of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 193: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Sums of Simple Regularizers

Recall the overlapping group `1-regularization problem:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

This regularizer is not simple.

But, it is the sum of simple regularizers.

After converting to a constrained problem, we can useDykstra’s [1983] algorithm to compute the projection.

Bauschke & Combettes [2008] generalize Dykstra’s algorithmto compute proximal operators for sums of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 194: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Sums of Simple Regularizers

Recall the overlapping group `1-regularization problem:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

This regularizer is not simple.

But, it is the sum of simple regularizers.

After converting to a constrained problem, we can useDykstra’s [1983] algorithm to compute the projection.

Bauschke & Combettes [2008] generalize Dykstra’s algorithmto compute proximal operators for sums of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 195: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Sums of Simple Regularizers

Recall the overlapping group `1-regularization problem:

minx

f (x) +∑

A⊆{1,...,p}

λA(∑

{B|A⊆B}

||xB ||22)1/2

This regularizer is not simple.

But, it is the sum of simple regularizers.

After converting to a constrained problem, we can useDykstra’s [1983] algorithm to compute the projection.

Bauschke & Combettes [2008] generalize Dykstra’s algorithmto compute proximal operators for sums of simple functions.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 196: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Other Non-Smooth Newton-like Methods

There is also recent work on other methods for computing theproximal operator for overlapping group `1-regularization[Jenatton et al., 2010, Kim & Xing, 2010, Liu & Ye, 2010,Mairal et al., 2010].

Several other Newton-like methods for general non-smoothoptimization are available:

Incorporating the sub-differential into the quadraticapproximation [Yu et al., 2008].Applying the basic method to a smoothed version of theproblem [Chen et al., 2010].Augmented Lagrangian and dual augmented Lagrangianmethods [Tomioka et al., 2011].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 197: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Other Non-Smooth Newton-like Methods

There is also recent work on other methods for computing theproximal operator for overlapping group `1-regularization[Jenatton et al., 2010, Kim & Xing, 2010, Liu & Ye, 2010,Mairal et al., 2010].

Several other Newton-like methods for general non-smoothoptimization are available:

Incorporating the sub-differential into the quadraticapproximation [Yu et al., 2008].Applying the basic method to a smoothed version of theproblem [Chen et al., 2010].Augmented Lagrangian and dual augmented Lagrangianmethods [Tomioka et al., 2011].

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 198: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Inexact Objective Information

In some scenarios we may have a stochastic objective:

We may be using a Monte Carlo approximation.

We may be using a mini-batch.

There is work on stochastic variants:

Limited-memory quasi-Newton [Sunehag et al., 2009].

Hessian-free Newton [Martens, 2010].

Optimal Barzilai-Borwein [Swersky, unpublished].

However, they don’t share the fast convergence rates ofdeterministic variants.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 199: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Co-Authors

Thanks to my great co-authors:

Ewout van den Berg

Michael Friedlander

Glenn Fung

Dongmin Kim

Kevin Murphy

Romer Rosales

Suvrit Sra

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 200: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Summary

Hessian-Free Newton and limited-memory BFGS methods aretwo workhorses of unconstrained optimization.

Two-metric (sub-)gradient projection methods let us applythese methods to problems with bound constraints ornon-differentiable but separable regularizers.

Inexact projected/proximal Newton methods are an appealingapproach to optimizing costly objective functions with simpleconstraints or simple non-differentiable regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 201: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Summary

Hessian-Free Newton and limited-memory BFGS methods aretwo workhorses of unconstrained optimization.

Two-metric (sub-)gradient projection methods let us applythese methods to problems with bound constraints ornon-differentiable but separable regularizers.

Inexact projected/proximal Newton methods are an appealingapproach to optimizing costly objective functions with simpleconstraints or simple non-differentiable regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO

Page 202: Limited-Memory Quasi-Newton and Hessian-Free Newton ...opt.kyb.tuebingen.mpg.de/slides/OPT2010-schmidt.pdf · 2 L-BFGS and Hessian-Free Newton 3 Two-Metric (Sub-)Gradient Projection

Motivation and OverviewL-BFGS and Hessian-Free Newton

Two-Metric (Sub-)Gradient ProjectionInexact Projected/Proximal Newton

Discussion

Sums of Simple RegularizersStochastic ObjectiveSummary

Summary

Hessian-Free Newton and limited-memory BFGS methods aretwo workhorses of unconstrained optimization.

Two-metric (sub-)gradient projection methods let us applythese methods to problems with bound constraints ornon-differentiable but separable regularizers.

Inexact projected/proximal Newton methods are an appealingapproach to optimizing costly objective functions with simpleconstraints or simple non-differentiable regularizers.

Mark Schmidt, NIPS Optimization Workshop 2010 Quasi-Newton and Hessian-Free Methods for NSO