gradient based optimization algorithms · 2019. 1. 23. · gradient based optimization algorithms...

116
Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint works with Amir Beck (Technion), Jérôme Bolte, (Toulouse I), Ronny Luss (IBM-New York), Shoham Sabach, (Göttingen), Ron Shefi (Tel Aviv) EPFL - Lausanne, February 4, 2014 Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Upload: others

Post on 01-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Gradient Based Optimization Algorithms

Marc Teboulle

School of Mathematical SciencesTel Aviv University

Based on joint works with

Amir Beck (Technion), Jérôme Bolte, (Toulouse I), Ronny Luss (IBM-New York),Shoham Sabach, (Göttingen), Ron Shefi (Tel Aviv)

EPFL - Lausanne, February 4, 2014

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 2: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:

Given f , solve for x :

f (x + d)� f (x)d

d=0= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 3: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:

Given f , solve for x :

f (x + d)� f (x)d

d=0= 0

...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....

Pierre de Fermat

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 4: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

First-Order/Gradient Methods

First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 5: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

First-Order/Gradient Methods

First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:

Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 6: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

First-Order/Gradient Methods

First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).

A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 7: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Polynomial versus Gradient Methods

Convex problems are polynomially solvable within " accuracy:

Running Time Poly(Problem’s size, # of accuracy digits).

Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).

Thus, a "single iteration" of IPM can last forever!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 8: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Polynomial versus Gradient Methods

Convex problems are polynomially solvable within " accuracy:

Running Time Poly(Problem’s size, # of accuracy digits).

Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).Thus, a "single iteration" of IPM can last forever!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 9: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Gradient-Based Algorithms

Widely used in applications....

Clustering Analysis: The k-means algorithmNeuro-computing: The backpropagation algorithmStatistical Estimation: The EM (Expectation-Maximization) algorithm.Machine Learning: SVM, Regularized regression, etc...Signal and Image Processing: Sparse Recovery, Denoising andDeblurring Schemes, Total Variation minimization...Matrix minimization Problems....and much more...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 10: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Goals and Outline

Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence

Outline

Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 11: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Goals and Outline

Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence

Outline

Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 12: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Algorithms - Historical Development

Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.

Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 13: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Algorithms - Historical Development

Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.

Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 14: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Gradient-Based Methods for Convex Composite ModelsSmooth+Nonsmooth

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 15: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Useful Convex Optimization Model

(M) min {F (x) = f (x) + g(x) : x 2 E}

E is a finite dimensional Euclidean spacef : E! R is convex smooth of type C1,1, i.e.,

9L(f ) > 0 : krf (x)�rf (y)k L(f )kx� yk, 8x, y.

g : E! (�1,1] is convex nonsmooth extended valued (allowingconstraints)We assume that (M) is solvable, i.e.,

X⇤ := argmin f 6= ;, and for x⇤ 2 X⇤, set F⇤ := F (x⇤).

The model (M) does already have structural information. Rich enough torecover various classes of smooth/nonsmooth convex minimization problems.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 16: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Special Cases of the General Model

g = 0 - smooth unconstrained convex minimization.

minx

f (x)

g = �(·|C) - constrained smooth convex minimization.

minx

{f (x) : x 2 C}

f = 0 - general convex optimization.

minx

g(x)

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 17: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 18: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 19: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..

Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 20: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...

A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 21: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 22: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Anecdote: The World’s Simplest Impossible problem

[From C. Moler (1990)]

Problem: Given the average of two numbers is 3. What are thenumbers?

Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..

This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.

Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 23: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.

Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)

• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..

• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 24: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.

Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)

• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..

• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 25: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.

Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)

• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..

• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 26: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Image Processing

g = �kL(·)k1 - l1-regularized convex problem.

minx

{f (x) + �kLxk1}; f (x) := d(b,Ax)

L - identity, differential operator, wavelet

g = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].

minx

{f (x) + �TV (x)}

1-dim:TV (x) =

P

i |xi � xi+1|2-dim:isotropic: TV(x) =

P

iP

j

p

(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2

anisotropic: TV(x) =P

iP

j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 27: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Image Processing

g = �kL(·)k1 - l1-regularized convex problem.

minx

{f (x) + �kLxk1}; f (x) := d(b,Ax)

L - identity, differential operator, waveletg = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].

minx

{f (x) + �TV (x)}

1-dim:TV (x) =

P

i |xi � xi+1|2-dim:isotropic: TV(x) =

P

iP

j

p

(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2

anisotropic: TV(x) =P

iP

j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 28: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Building Gradient-Based Schemes: Basic Old Idea

Pick an adequate approximate model

1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)

That is, leaving the nonsmooth part g(·) untouched.

2 Linearize only + use info on C: e.g., C compact, g := �C

q(x, y) = f (y) + hx� y,rf (y)i

Solve “some how", the resulting approximate model:

xk+1 = argminx

q(x, xk ), k = 0, . . .

.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 29: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Building Gradient-Based Schemes: Basic Old Idea

Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)

That is, leaving the nonsmooth part g(·) untouched.

2 Linearize only + use info on C: e.g., C compact, g := �C

q(x, y) = f (y) + hx� y,rf (y)i

Solve “some how", the resulting approximate model:

xk+1 = argminx

q(x, xk ), k = 0, . . .

.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 30: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Building Gradient-Based Schemes: Basic Old Idea

Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)

That is, leaving the nonsmooth part g(·) untouched.

2 Linearize only + use info on C: e.g., C compact, g := �C

q(x, y) = f (y) + hx� y,rf (y)i

Solve “some how", the resulting approximate model:

xk+1 = argminx

q(x, xk ), k = 0, . . .

.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 31: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Building Gradient-Based Schemes: Basic Old Idea

Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:

q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)

That is, leaving the nonsmooth part g(·) untouched.

2 Linearize only + use info on C: e.g., C compact, g := �C

q(x, y) = f (y) + hx� y,rf (y)i

Solve “some how", the resulting approximate model:

xk+1 = argminx

q(x, xk ), k = 0, . . .

.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 32: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.

For any L � Lf , and a given iterate xk :

QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)

|{z}

untouched

Algorithm:

xk+1 := argminx

QL(x, xk )

= argminx

(

g(x) +L2

x� (xk �1Lrf (xk ))

2)

= prox 1L g

xk �1Lrf (xk )

.

where prox operator:proxh(z) := argminu

h(u) +12ku� zk2

.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 33: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.

For any L � Lf , and a given iterate xk :

QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)

|{z}

untouched

Algorithm:

xk+1 := argminx

QL(x, xk )

= argminx

(

g(x) +L2

x� (xk �1Lrf (xk ))

2)

= prox 1L g

xk �1Lrf (xk )

.

where prox operator:proxh(z) := argminu

h(u) +12ku� zk2

.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 34: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Proximal Gradient Method for (M)

The proximal gradient method with a constant stepsize rule.

Proximal Gradient Method (with Constant Stepsize)Input: L = L(f ) - A Lipschitz constant of rf .Step 0. Take x0 2 E.Step k. (k � 1) Compute prox of g at a gradient step for f

xk = prox 1L g

xk�1 �1Lrf (xk�1)

The Lipschitz constant L(f ) is not always known or not easily computable, thisissue is resolved with an easy backtracking procedure.

1 Can we compute efficiently proxg/L(·)?2 What is the global rate of convergence for PGM?

First, we illustrate some special cases of ProxGrad.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 35: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Special Cases of Proximal Gradient

The Prox-Grad method: xk+1 = prox 1L g�

xk � 1Lrf (xk )

.

g ⌘ 0) the gradient method.

xk+1 = xk �1Lrf (xk )

g = �(·|C)) the gradient projection method

xk+1 = PC

xk �1Lrf (xk )

f = 0. The proximal minimization method.

xk+1 = prox 1L g(xk )

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 36: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Some Calculus Rules for Computing proxtg

proxtg(x) = argminu

g(u) +12tku� xk2

.

g(u) proxt(g)(x)�C(u) ⇧C(x)

�⇤C(u) -support function- x� ⇧C(x)

dC(u)

(

x + (⇧C(x)�x)tdC(x)

if dC(x) > 1/tx otherwise

kAx� bk2/2,A 2 Rm⇥n (I + t�1AT A)�1(x + t�1AT b)kuk1 (-shrinkage-) sgn (xj)max{|xj |� t , 0}

kuk⇢

kxk2/2t ifkxk tkxk � t/2 otherwise

kUk⇤, U 2 Rm⇥n, (m � n) P diag(s)QT

�1(U) � �2(U) � . . . singular values of UNuclear norm kUk⇤ =

P

j �j(U)Singular value decomposition

U = P diag(�)QT , then shrinkage sj = sgn (�j)max{|�j |� t , 0}.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 37: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Rate of Convergence of Prox-Grad for Convex (M)

Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.

F (xk )� F (x⇤) Lkx0 � x⇤k2

2kfor any optimal solution x⇤.

Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").

No need to know the Lipschitz constant (backtracking).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 38: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Rate of Convergence of Prox-Grad for Convex (M)

Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.

F (xk )� F (x⇤) Lkx0 � x⇤k2

2kfor any optimal solution x⇤.

Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").No need to know the Lipschitz constant (backtracking).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 39: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 40: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?

The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 41: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 42: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).

But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 43: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Toward a Faster Scheme

For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/

pk)...Even worse...

Can we find a faster method?The answer is YES!.

Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 44: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Fast Prox-Grad Algorithm - [Beck-Teboulle (2009)]

An equally simple algorithm as prox-grad..

Here with constant step sizeInput: L = L(f ) - A Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute

xk = argminx2E

g(x) +L2kx� (yk �

1Lrf (yk ))k2

xk ⌘ prox

gL(yk �

1Lrf (yk )), - main computation as Prox-Grad

• tk+1 =1 +

q

1 + 4t2k

2,

•• yk+1 = xk +

tk � 1tk+1

(xk � xk�1).

Additional computation in (•) and (••) is clearly marginal.Knowledge of L(f ) is not Necessary, can use backtracking line search.With g = 0, this is the smooth “Optimal" Gradient of Nesterov (83); With tk ⌘ 1 we recover PGM.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 45: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

An O(1/k2) Global Rate of Convergence for (M)

Theorem Let {xk} be generated by FPG. Then for any k � 1

F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2

(k + 1)2 ,

# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").

Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.

| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...

Some examples....

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 46: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

An O(1/k2) Global Rate of Convergence for (M)

Theorem Let {xk} be generated by FPG. Then for any k � 1

F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2

(k + 1)2 ,

# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").

Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.

| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...

Some examples....

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 47: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Back to Application: Linear Inverse Problems

Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A (Blurring matrix) are known.

Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)

min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)(

Nonsmooth regularizers are particularly useful.

The l1 norm being a “starring one"!, as promoting sparsity.The squared l2 convex proximity kb�A(x)k2 will be the smooth part.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 48: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

An Example l1 regularization

minx

{kAx� bk2 + �kxk1} ⌘ minx

{f (x) + g(x)}

The proximal map of g(x) = �kxk1 is simply:

proxt(g)(y) = argminu

12tku� yk2 + �kuk1

= T �t(y),

where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:

T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)

} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm

Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]

} Fast Prox-grad = FISTA

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 49: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

An Example l1 regularization

minx

{kAx� bk2 + �kxk1} ⌘ minx

{f (x) + g(x)}

The proximal map of g(x) = �kxk1 is simply:

proxt(g)(y) = argminu

12tku� yk2 + �kuk1

= T �t(y),

where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:

T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)

} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm

Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]

} Fast Prox-grad = FISTA

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 50: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Numerical Example: l1-Image Deblurring

minx{kAx� bk2 + �kxk1}

Comparing ISTA versus FISTA on Problems• dimension d like d = 256⇥ 256 = 65, 536, or/and 512⇥ 512 = 262, 144.• The d ⇥ d matrix A is dense (Gaussian blurring times inverse of two-stageHaar wavelet transform).• All problems solved with fixed � and Gaussian noise.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 51: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Example l1 Deblurring of the Cameraman

original blurred and noisy

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 52: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

1000 Iterations of ISTA versus 200 of FISTA

ISTA: 1000 Iterations FISTA: 200 Iterations

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 53: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Original Versus Deblurring via FISTA

Original FISTA:1000 Iterations

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 54: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Function Values errors F (xk)� F (x⇤)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101

ISTAFISTA

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 55: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:

prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV

xk+1 = D✓

xk �2L

AT (Axk � b),2�L

.

where D(w, t) = argminx

kx�wk2 + 2tTV(x)

Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].

Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 56: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:

prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV

xk+1 = D✓

xk �2L

AT (Axk � b),2�L

.

where D(w, t) = argminx

kx�wk2 + 2tTV(x)

Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].

Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 57: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

MFISTA: Monotone FISTA

FISTA is not a monotone method. Problematic when the prox is not exactlycomputed.

Input: L � L(f ) - An upper bound on the Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute

zk = prox 1L

yk �1Lrf (yk )

,

tk+1 =1 +

q

1 + 4t2k

2,

xk = argmin{F (x) : x = zk , xk�1}

yk+1 = xk +

tktk+1

(zk � xk ) +

tk � 1tk+1

(xk � xk�1).

With Same Rate of Convergence as FPG!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 58: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Lena and 3 Reconstructions – N=100 Iterations

Blurred and Noisy ISTA(F100 = 0.606)

MFISTA(F100 = 0.466)

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 59: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Extension: Gradient Schemes with Non-EuclideanDistances

All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function

Typical example: Bregman type distances - based on kernel :

D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex

Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:

1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.

Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 60: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Extension: Gradient Schemes with Non-EuclideanDistances

All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function

Typical example: Bregman type distances - based on kernel :

D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex

Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:

1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.

Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 61: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Convex Nonsmooth Composite: Lagrangians Based Methods

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 62: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Nonsmooth Convex with Separable Objective

(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},

Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}

Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:

The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 63: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Nonsmooth Convex with Separable Objective

(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},

Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}

Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:

The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 64: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Nonsmooth Convex with Separable Objective

(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},

Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.

Problem (P) is equivalent to (via the standard splitting variables trick):

(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}

Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:

The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 65: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The PMM– Rockafellar (76)

PMM – The Proximal Method of Multipliers Generate (xk , zk ) and dual mul-tiplier yk via

(xk+1, zk+1) 2 argminx,z

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z)

yk+1 = yk + c(Axk+1 � zk+1).

The Augmented Lagrangian:

Lc(x , z, y) :=

Lagrangianz }| {

f (x) + g(z) + hyk ,Ax � zi+c2kAx � zk2, (c > 0).

qk (x , z) := 12

kx � xkk2M1

+ kz � zkk2M2

is the additional primal proximalterm.The choice of M1 2 Sn

+,M2 2 Sm+ is used to conveniently describe/analyze

several variants of the PMM.M1 = M2 ⌘ 0, recovers the Multiplier Methods (PPA on the dual).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 66: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Proximal Method of Multipliers–Key Difficulty

Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).

The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....

Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 67: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Proximal Method of Multipliers–Key Difficulty

Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:

f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).

The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....

Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 68: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Prototype : Aternating Direction of ProximalMultipliers

Eliminate the coupling (x , z) via alternating minimization steps.

Glowinski-Marocco (75), Gabay-Mercier (76), Fortin-Glowinski (83), Ecsktein-Bertsekas (91) ......the so-called Alternating Direction of Mulipliers (ADM),(based on the Multiplier Methods, i.e.,M1 = M2 ⌘ 0.)

(AD-PMM) Alternating Direction Proximal Method of Multipliers1. Start with any (x0, z0, y0) 2 Rn ⇥ Rm ⇥ Rm and c > 02. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:

xk+1 2 argmin⇢

f (x) +c2kAx � zk + c�1ykk2 +

12kx � xkk2

M1

,

zk+1 = argmin⇢

g(z) +c2kAxk+1 � z + c�1ykk2 +

12kz � zkk2

M2

,

yk+1 = yk + c(Axk+1 � zk+1).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 69: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Global Rate of Convergence ResultsFor AD-PMM and many other variants. [Shefi-Teboulle (2013)]

Let (x⇤, z⇤, y⇤) be a saddle point for the Lagrangian l associated to (P).Then for all N � 1,

l(xN , zN , y)� p⇤ C2(x⇤,z⇤,y)N , 8y 2 Rm

In particular: f (xN) + g(zN) + rkAxN � zNk � p⇤ c1N ,

Residual norm: kAxN � zNk c2pN

Original Primal objective (under g-Lipschitz continuous):

'(xN)� '(x⇤) c3

N,

ci are positive constants (as usual in terms of distances to optimal solution).

For any sequence {xk , zk , yk}, any N � 1, the ergodic sequences {xN , yN , zN}

xN :=1N

N�1X

k=0

xk+1, zN :=1N

N�1X

k=0

zk+1, and yN :=1N

N�1X

k=0

yk+1.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 70: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Non-Convex Smooth Models

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 71: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Sparse PCA

Principal Component Analysis solves

max{xT Ax : kxk2 = 1, x 2 Rn}, (A ⌫ 0)

while Sparse Principal Component Analysis solves

max{xT Ax : kxk2 = 1, kxk0 k, x 2 Rn}, k 2 (1, n] sparsity

kxk0 counts the number of nonzero entries of xIssues:

1 Maximizing a Convex objective.2 Hard Nonconvex Constraint kxk0 k .

Possible Approaches:1 SDP Convex Relaxations [D’aspremont et al. 2008]2 Approximation/Modified formulations: Many proposed approaches

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 72: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Sparse PCA via Penalization/Relaxation/Approx.

� The problem of interest is the difficult sparse PCA problem as is

max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}

� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1

pk}

Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0

SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}

SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 73: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Sparse PCA via Penalization/Relaxation/Approx.

� The problem of interest is the difficult sparse PCA problem as is

max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}

� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1

pk}

Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0

SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}

SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 74: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Sparse PCA via Penalization/Relaxation/Approx.

� The problem of interest is the difficult sparse PCA problem as is

max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}

� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1

pk}

Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0

SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}

SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 75: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Quick Highlight of Simple Algorithms for "ModifiedProblems"

Type Iteration Per-Iteration ReferencesComplexity

l1-constrained xj+1i =

sgn(((A+�2 )xj )i )(|((A+

�2 )xj )i |��j )+qP

h (|((A+�2 )xj )h|��j )2+

O(n2), O(mn) Witten et al. (2009)

l1-constrained xj+1i =

sgn((Axj )i )(|(Axj )i |�sj )+qPh (|(Axj )h|�sj )2+

where O(n2), O(mn) Sigg-Buhman (2008)

sj is (k + 1)-largest entry of vector |Axj |

l0-penalized zj+1 =

Pi [sgn((bT

i zj )2�s)]+(bTi zj )bi

kP

i [sgn((bTi zj )2�s)]+(bT

i zj )bik2O(mn) Shen-Huang (2008),

Journee et al. (2010)

l0-penalized xj+1i =

sgn(2(Axj )i )(|2(Axj )i |�s'0p (|x

ji |))+r

Ph (|2(Axj )h|�s'0

p (|xjh|))

2+

O(n2) Sriperumbudur et al. (2010)

l1-penalized yj+1 = argminy

{X

ikbi � xj yT bik

22 + �kyk2

2 + skyk1} Zou et al. (2006)

xj+1 =(P

i bi bTi )yj+1

k(P

i bi bTi )yj+1k2

l1-penalized zj+1 =

Pi (|b

Ti zj |�s)+sgn(bT

i zj )bikP

i (|bTi zj |�s)+sgn(bT

i zj )bik2O(mn) Shen-Huang (2008),

Journee et al. (2010)

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 76: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]

All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...

1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"

Very recently we have shown that:

All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 77: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]

All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...

1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"

Very recently we have shown that:

All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 78: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]

solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}

xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize

� Here : F is convex, possibly nonsmooth; C is compact but nonconvex

Idea goes back to Mangasarian (96) developed for C a polyhedral set.

ConGradU – Conditional Gradient with Unit Step Size

x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}

Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 79: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]

solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}

xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize

� Here : F is convex, possibly nonsmooth; C is compact but nonconvex

Idea goes back to Mangasarian (96) developed for C a polyhedral set.

ConGradU – Conditional Gradient with Unit Step Size

x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}

Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 80: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Solving Original l0-constrained PCA via ConGradU

Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in

xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)

kTk (Axj)k2

Tk (a) := argminy

{kx � ak22 : kxk0 k}

Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.

Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)

Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 81: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Solving Original l0-constrained PCA via ConGradU

Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in

xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)

kTk (Axj)k2

Tk (a) := argminy

{kx � ak22 : kxk0 k}

Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.

Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)

Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 82: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Solving Original l0-constrained PCA via ConGradU

Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in

xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)

kTk (Axj)k2

Tk (a) := argminy

{kx � ak22 : kxk0 k}

Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.

Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)

Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 83: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

ConGradU for a General Class of Problems

(G) maxx

{f (x) + g(|x |) : x 2 C}

f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn

+ ! R is convex differentiable and montonote decreasing

Particularly useful for handling approximate l0-penalized problems.

CondGradU applied to (G) produces the following simple:

Weighted l1-norm maximization problem:

x0 2 C, xj+1 = argmax{haj , xi �X

i

w ji |xi | : x 2 C}, j = 0, . . . ,

where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.

For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:

xj+1 =Swj (f 0(xj))

kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 84: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

ConGradU for a General Class of Problems

(G) maxx

{f (x) + g(|x |) : x 2 C}

f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn

+ ! R is convex differentiable and montonote decreasing

Particularly useful for handling approximate l0-penalized problems.CondGradU applied to (G) produces the following simple:

Weighted l1-norm maximization problem:

x0 2 C, xj+1 = argmax{haj , xi �X

i

w ji |xi | : x 2 C}, j = 0, . . . ,

where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.

For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:

xj+1 =Swj (f 0(xj))

kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 85: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Non-Convex and NonSmooth

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 86: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Nonsmooth Nonconvex Optimization Model

(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)

Assumption

(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.

(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 87: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Nonsmooth Nonconvex Optimization Model

(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)

Assumption

(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.

(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).

The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 88: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

A Nonsmooth Nonconvex Optimization Model

(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)

Assumption

(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.

(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and

H (x , ·) 2 C1,1L(x).

NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 89: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Goal

Derive a simple scheme.Prove that the whole sequence

zk

k2N := (xk , yk ) converges toa critical point of .

Exploit partial smoothness: Blend alternating minimization withproximal-gradientExploit further data info: to build a general recipe and algorithmicframework to prove convergence for a broad class of nonconvexnonsmooth problems.

J. Bolte, S. Sabach, M. TeboulleProximal alternating linearized minimization for nonconvex and nonsmoothproblems.

Mathematical Programming, Series A. Just published online.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 90: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Algorithm: Proximal Alternating LinearizationMinimization (PALM)

1. Initialization: start with any�

x0, y0� 2 Rn ⇥ Rm.2. For each k = 0, 1, . . . generate a sequence

��

xk , yk�

k2N:

2.1. Take �1 > 1, set ck = �1L1�

yk� and compute

xk+1 2 prox

fck

✓xk � 1

ckrx H

⇣xk , yk

⌘◆.

2.2. Take �2 > 1, set dk = �2L2�xk+1� and compute

yk+1 2 prox

gdk

✓yk � 1

dkry H

⇣xk+1, yk

⌘◆.

Main computational step: prox of a “nonconvex" function.

An interesting example will be given later. (More in our paper...)

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 91: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic

Theorem (Bolte-Sabach-Teboulle (2013))Let

zk

k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence

zk

k2N has finite length, that is,

1X

k=1

�zk+1 � zk�� <1.

(ii) The sequence�

zk

k2N converges to a critical point z⇤ = (x⇤, y⇤) of .

Are there many semi-algebraic functions?How do we prove this convergence result?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 92: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic

Theorem (Bolte-Sabach-Teboulle (2013))Let

zk

k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence

zk

k2N has finite length, that is,

1X

k=1

�zk+1 � zk�� <1.

(ii) The sequence�

zk

k2N converges to a critical point z⇤ = (x⇤, y⇤) of .

Are there many semi-algebraic functions?How do we prove this convergence result?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 93: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

An Informal General Convergence Proof Recipe

Let : RN ! (�1,+1] be a proper, lsc and bounded from below function.

(P) inf�

(z) : z 2 RN .

Suppose A is a generic algorithm which generates a sequence�

zk

k2N via:

z0 2 RN , zk+1 2 A(zk ), k = 0, 1, . . . .

Goal: Prove that the whole sequence�

zk

k2N converges to a criticalpoint of .

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 94: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Recipe

Basically, the “ Recipe" consists of three main steps.

(i) Sufficient decrease property: Find a positive constant ⇢1 such that

⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .

(ii) A subgradient lower bound for the iterates gap: Assume that�

zk

k2N is bounded. Find another positive constant ⇢2, such that�

�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .

These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.

Does the whole

zk

k2N converge to a critical point of Problem (M)?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 95: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Recipe

Basically, the “ Recipe" consists of three main steps.

(i) Sufficient decrease property: Find a positive constant ⇢1 such that

⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .

(ii) A subgradient lower bound for the iterates gap: Assume that�

zk

k2N is bounded. Find another positive constant ⇢2, such that�

�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .

These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.

Does the whole

zk

k2N converge to a critical point of Problem (M)?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 96: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Third Main Step of our Recipe

(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence

zk

k2Nis a Cauchy sequence, and thus converges!

We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!

The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]

Are there many KL functions?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 97: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Third Main Step of our Recipe

(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence

zk

k2Nis a Cauchy sequence, and thus converges!

We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!

The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]

Are there many KL functions?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 98: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Third Main Step of our Recipe

(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence

zk

k2Nis a Cauchy sequence, and thus converges!

We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!

The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]

Are there many KL functions?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 99: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Third Main Step of our Recipe

(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence

zk

k2Nis a Cauchy sequence, and thus converges!

We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!

The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]

Are there many KL functions?

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 100: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Are there Many Functions Satisfying KL?

YES! Semi Algebraic Functions

Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.

Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets

{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}

where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 101: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Are there Many Functions Satisfying KL?

YES! Semi Algebraic Functions

Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.

Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets

{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}

where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 102: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

There is a Wealth of Semi-Algebraic Functions!

Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.

Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 103: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

There is a Wealth of Semi-Algebraic Functions!

Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.

Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 104: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Application to a Broad Class of Matrix FactorizationProblems

Given A 2 Rm⇥n and r ⌧ min {m, n}, find X 2 Rm⇥r and Y 2 Rr⇥n such that8

<

:

A ⇡ XY ,X 2 Km,r \ F ,Y 2 Kr ,n \ G.

Where

Kp,q =�

M 2 Rp⇥q : M � 0

,

F =�

X 2 Rm⇥r : R1 (X ) ↵

,

G =�

Y 2 Rr⇥n : R2 (Y ) �

,

Here R1 and R2 are lsc functions and ↵,� 2 R+ are given parameters.R1 (R2) are often used to describe some additional features of X (Y ).

(MF) covers a very large number of problems in applications...

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 105: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation

(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,

d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .

Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.

(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,

Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.

Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 106: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation

(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,

d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .

Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.

(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,

Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.

Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 107: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

F ⌘ Rm⇥r ; G ⌘ Rr⇥n.

The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)

min⇢

12kA� XYk2

F : X � 0,Y � 0�

.

H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘

�YY T�

F , L2(X ) ⌘�

�X T X�

F .H is C2 on bounded subsets.

Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..

P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 108: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

F ⌘ Rm⇥r ; G ⌘ Rr⇥n.

The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)

min⇢

12kA� XYk2

F : X � 0,Y � 0�

.

H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘

�YY T�

F , L2(X ) ⌘�

�X T X�

F .H is C2 on bounded subsets.

Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..

P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 109: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm

d (A,XY ) := H (X ,Y ) =12kA� XYk2

F , and

F ⌘ Rm⇥r ; G ⌘ Rr⇥n.

The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)

min⇢

12kA� XYk2

F : X � 0,Y � 0�

.

H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘

�YY T�

F , L2(X ) ⌘�

�X T X�

F .H is C2 on bounded subsets.

Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..

P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 110: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by

R1 (X ) = kXk0 :=X

i

kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .

To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!

Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then

prox

f1 (U) = argmin

12kX � Uk2

F : X � 0, kXk0 s�

= Ts (P+ (U))

whereTs (U) := argmin

V2Rm⇥n

n

kU � Vk2F : kUk0 s

o

.

Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 111: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by

R1 (X ) = kXk0 :=X

i

kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .

To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!

Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then

prox

f1 (U) = argmin

12kX � Uk2

F : X � 0, kXk0 s�

= Ts (P+ (U))

whereTs (U) := argmin

V2Rm⇥n

n

kU � Vk2F : kUk0 s

o

.

Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 112: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

PALM for Sparse NMF

1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence

��

X k ,Y k�

k2N:

2.1. Take �1 > 1, set ck = �1

Y k �Y k�T�

Fand compute

Uk = X k � 1ck

X k Y k � A� �

Y k�T; X k+1 2 prox

R1ck

Uk� = T↵�

P+

Uk�� .

2.2. Take �2 > 1, set dk = �2

X k+1 �X k+1�T�

Fand compute

V k = Y k� 1dk

X k+1�T �X k+1Y k � A

; Y k+1 2 prox

R2dk

V k� = T��

P+

V k�� .

Applying our main Theorem we get the global convergence result:

Let��

X k ,Y k�

k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N

�X k�

F ,�

�Y k�

F

> 0. Then,��

X k ,Y k�

k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 113: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

PALM for Sparse NMF

1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence

��

X k ,Y k�

k2N:

2.1. Take �1 > 1, set ck = �1

Y k �Y k�T�

Fand compute

Uk = X k � 1ck

X k Y k � A� �

Y k�T; X k+1 2 prox

R1ck

Uk� = T↵�

P+

Uk�� .

2.2. Take �2 > 1, set dk = �2

X k+1 �X k+1�T�

Fand compute

V k = Y k� 1dk

X k+1�T �X k+1Y k � A

; Y k+1 2 prox

R2dk

V k� = T��

P+

V k�� .

Applying our main Theorem we get the global convergence result:

Let��

X k ,Y k�

k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N

�X k�

F ,�

�Y k�

F

> 0. Then,��

X k ,Y k�

k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 114: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Concluding Remarks on FOM

Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.

THANK YOU FOR YOUR ATTENTION!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 115: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Concluding Remarks on FOM

Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.

THANK YOU FOR YOUR ATTENTION!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms

Page 116: Gradient Based Optimization Algorithms · 2019. 1. 23. · Gradient Based Optimization Algorithms Marc Teboulle School of Mathematical Sciences Tel Aviv University Based on joint

Concluding Remarks on FOM

Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.

THANK YOU FOR YOUR ATTENTION!

Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms