gradient based optimization algorithms · 2019. 1. 23. · gradient based optimization algorithms...
TRANSCRIPT
Gradient Based Optimization Algorithms
Marc Teboulle
School of Mathematical SciencesTel Aviv University
Based on joint works with
Amir Beck (Technion), Jérôme Bolte, (Toulouse I), Ronny Luss (IBM-New York),Shoham Sabach, (Göttingen), Ron Shefi (Tel Aviv)
EPFL - Lausanne, February 4, 2014
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:
Given f , solve for x :
f (x + d)� f (x)d
�
d=0= 0
...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....
Pierre de Fermat
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Opening Remark and CreditAbout more than 380 years ago.....In 1629, Fermat suggested the following:
Given f , solve for x :
f (x + d)� f (x)d
�
d=0= 0
...We can hardly expect to find a more general method to get themaximum or minimum points on a curve.....
Pierre de Fermat
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:
Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
First-Order/Gradient Methods
First-Order methods are iterative methods that only exploit informationon the objective function and its gradient (sub-gradient).
A main drawback: Can be very slow for producing high accuracysolutions....But... also share many advantages:Requires minimal information, e.g., (f , f 0).Often lead to very simple and "cheap" iterative schemes.Suitable for large-scale problems when high accuracy is not crucial.[In many large scale applications, the data is anyway corrupted or knownonly roughly..]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Polynomial versus Gradient Methods
Convex problems are polynomially solvable within " accuracy:
Running Time Poly(Problem’s size, # of accuracy digits).
Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).
Thus, a "single iteration" of IPM can last forever!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Polynomial versus Gradient Methods
Convex problems are polynomially solvable within " accuracy:
Running Time Poly(Problem’s size, # of accuracy digits).
Theoretically: this means that large scale problems can be solved tohigh accuracy with polynomial methods, such as IPM.Practically: Running time is dimension-dependent and growsnonlinearly with problem’s dimension. For IPM which are Newton’s typemethods: ⇠ O(n3).Thus, a "single iteration" of IPM can last forever!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Gradient-Based Algorithms
Widely used in applications....
Clustering Analysis: The k-means algorithmNeuro-computing: The backpropagation algorithmStatistical Estimation: The EM (Expectation-Maximization) algorithm.Machine Learning: SVM, Regularized regression, etc...Signal and Image Processing: Sparse Recovery, Denoising andDeblurring Schemes, Total Variation minimization...Matrix minimization Problems....and much more...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Goals and Outline
Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence
Outline
Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Goals and Outline
Building simple and efficient First Order Methods (FOM),exploiting problem structures/data information and analyzetheir complexity and rate of convergence
Outline
Problem Formulations: Convex and Nonconvex ModelsAlgorithms: Derivation and AnalysisApplications: Linear inverse problems, image processing, dimensionreduction [PCA, NMF]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Algorithms - Historical Development
Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.
Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Algorithms - Historical Development
Old schemes for general problems “resuscitated" and “improved", byexploiting special structures and data info.
Fixed point methods [Babylonian time!/Heron for square root, Picard,Banach, Weisfield’34]Coordinate descent/Alternating Minimization [Gauss-Seidel ’1798]Gradient/subgradient methods [Cauchy’ 1846, Shor’61, Rosen’63,Frank-Wolfe ’56, Polyak’62]Stochastic Gradients [Robbins and Monro ’51]Proximal-Algorithms [Martinet ’70, Rockafellar ’76, Passty’79]Penalty/Barrier methods [Courant’49, Fiacco-McCormick’66]Augmented Lagrangians and Splitting [Arrow-Hurwicz ’58,Hestenes-Powell’69, Goldstein-Treyakov’72, Rockafellar’74,Mercier-Lions ’79, Fortin-Glowinski’76, Bertsekas’82]Extragradient-methods for VI [Korpelevich ’76, Konnov,’80]Optimal/Fast Gradient Schemes [Nemirosvki-Yudin’81, Nesterov’83]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Gradient-Based Methods for Convex Composite ModelsSmooth+Nonsmooth
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Useful Convex Optimization Model
(M) min {F (x) = f (x) + g(x) : x 2 E}
E is a finite dimensional Euclidean spacef : E! R is convex smooth of type C1,1, i.e.,
9L(f ) > 0 : krf (x)�rf (y)k L(f )kx� yk, 8x, y.
g : E! (�1,1] is convex nonsmooth extended valued (allowingconstraints)We assume that (M) is solvable, i.e.,
X⇤ := argmin f 6= ;, and for x⇤ 2 X⇤, set F⇤ := F (x⇤).
The model (M) does already have structural information. Rich enough torecover various classes of smooth/nonsmooth convex minimization problems.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Special Cases of the General Model
g = 0 - smooth unconstrained convex minimization.
minx
f (x)
g = �(·|C) - constrained smooth convex minimization.
minx
{f (x) : x 2 C}
f = 0 - general convex optimization.
minx
g(x)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..
Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...
A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Anecdote: The World’s Simplest Impossible problem
[From C. Moler (1990)]
Problem: Given the average of two numbers is 3. What are thenumbers?
Typical answers: (2,4), (1,5), (-3,9)......These already ask for"structure":..least equal distance from average.. integer numbers..Why not (2.71828, 3.28172) !?....!...A nice one: (3,3) ....is with “minimal norm” and its unique!Simplest: (6,0) or (0,6)?...A sparse one! .... here lack of uniqueness!..
This simple problem captures the essence of many Ill-posed/underdeterminedproblems in applications.
Additional requirements/constraints have to be specified to make it areasonable mathematical/computational task and often lead to convexoptimization models.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)
• Intensive research activities over the last 50 years...Now, more...with SparseOptimization problems..
• Choices for g(·), d(·, ·) depends on the application at hand.Nonsmooth regularizers are particularly useful.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Image Processing
g = �kL(·)k1 - l1-regularized convex problem.
minx
{f (x) + �kLxk1}; f (x) := d(b,Ax)
L - identity, differential operator, wavelet
g = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].
minx
{f (x) + �TV (x)}
1-dim:TV (x) =
P
i |xi � xi+1|2-dim:isotropic: TV(x) =
P
iP
j
p
(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2
anisotropic: TV(x) =P
iP
j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Image Processing
g = �kL(·)k1 - l1-regularized convex problem.
minx
{f (x) + �kLxk1}; f (x) := d(b,Ax)
L - identity, differential operator, waveletg = TV (·) - TV-based regularization. Grid points (i , j) with same intensityas their neighbors [ROF 1992].
minx
{f (x) + �TV (x)}
1-dim:TV (x) =
P
i |xi � xi+1|2-dim:isotropic: TV(x) =
P
iP
j
p
(xi,j � xi+1,j)2 + (xi,j � xi,j+1)2
anisotropic: TV(x) =P
iP
j(|xi,j � xi+1,j |+ |xi,j � xi,j+1|)More on this soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model
1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Building Gradient-Based Schemes: Basic Old Idea
Pick an adequate approximate model1 Linearize + regularize: Given some y, approximate f (x) + g(x) via:
q(x, y) = f (y) + hx� y,rf (y)i+ 12tkx� yk2 + g(x), (t > 0)
That is, leaving the nonsmooth part g(·) untouched.
2 Linearize only + use info on C: e.g., C compact, g := �C
q(x, y) = f (y) + hx� y,rf (y)i
Solve “some how", the resulting approximate model:
xk+1 = argminx
q(x, xk ), k = 0, . . .
.Similarly, for subgradient schemes, e.g., using �(y) 2 @g(y).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.
For any L � Lf , and a given iterate xk :
QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)
|{z}
untouched
Algorithm:
xk+1 := argminx
QL(x, xk )
= argminx
(
g(x) +L2
�
�
�
�
x� (xk �1Lrf (xk ))
�
�
�
�
2)
= prox 1L g
✓
xk �1Lrf (xk )
◆
.
where prox operator:proxh(z) := argminu
⇢
h(u) +12ku� zk2
�
.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Example: The Proximal Gradient MethodThe derivation of the proximal gradient method is similar to the one of the(sub)gradient projection method.
For any L � Lf , and a given iterate xk :
QL(x, xk ) := f (xk ) + hx� xk ,rf (xk )i+L2kx� xkk2+ g(x)
|{z}
untouched
Algorithm:
xk+1 := argminx
QL(x, xk )
= argminx
(
g(x) +L2
�
�
�
�
x� (xk �1Lrf (xk ))
�
�
�
�
2)
= prox 1L g
✓
xk �1Lrf (xk )
◆
.
where prox operator:proxh(z) := argminu
⇢
h(u) +12ku� zk2
�
.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Proximal Gradient Method for (M)
The proximal gradient method with a constant stepsize rule.
Proximal Gradient Method (with Constant Stepsize)Input: L = L(f ) - A Lipschitz constant of rf .Step 0. Take x0 2 E.Step k. (k � 1) Compute prox of g at a gradient step for f
xk = prox 1L g
✓
xk�1 �1Lrf (xk�1)
◆
The Lipschitz constant L(f ) is not always known or not easily computable, thisissue is resolved with an easy backtracking procedure.
1 Can we compute efficiently proxg/L(·)?2 What is the global rate of convergence for PGM?
First, we illustrate some special cases of ProxGrad.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Special Cases of Proximal Gradient
The Prox-Grad method: xk+1 = prox 1L g�
xk � 1Lrf (xk )
�
.
g ⌘ 0) the gradient method.
xk+1 = xk �1Lrf (xk )
g = �(·|C)) the gradient projection method
xk+1 = PC
✓
xk �1Lrf (xk )
◆
f = 0. The proximal minimization method.
xk+1 = prox 1L g(xk )
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Some Calculus Rules for Computing proxtg
proxtg(x) = argminu
⇢
g(u) +12tku� xk2
�
.
g(u) proxt(g)(x)�C(u) ⇧C(x)
�⇤C(u) -support function- x� ⇧C(x)
dC(u)
(
x + (⇧C(x)�x)tdC(x)
if dC(x) > 1/tx otherwise
kAx� bk2/2,A 2 Rm⇥n (I + t�1AT A)�1(x + t�1AT b)kuk1 (-shrinkage-) sgn (xj)max{|xj |� t , 0}
kuk⇢
kxk2/2t ifkxk tkxk � t/2 otherwise
kUk⇤, U 2 Rm⇥n, (m � n) P diag(s)QT
�1(U) � �2(U) � . . . singular values of UNuclear norm kUk⇤ =
P
j �j(U)Singular value decomposition
U = P diag(�)QT , then shrinkage sj = sgn (�j)max{|�j |� t , 0}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Rate of Convergence of Prox-Grad for Convex (M)
Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.
F (xk )� F (x⇤) Lkx0 � x⇤k2
2kfor any optimal solution x⇤.
Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").
No need to know the Lipschitz constant (backtracking).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Rate of Convergence of Prox-Grad for Convex (M)
Theorem - [Rate of Convergence of Prox-Grad]Let {xk} be the sequence generated by the proximal gradientmethod.
F (xk )� F (x⇤) Lkx0 � x⇤k2
2kfor any optimal solution x⇤.
Thus, to solve (M), the proximal gradient method converges at a sublinearrate in function values.# iterations for F (xk )� F (x⇤) " is O(1/").No need to know the Lipschitz constant (backtracking).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?
The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).
But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Toward a Faster Scheme
For Prox-Grad and Gradient methods: a rate of O(1/k)...Rather slow..For Subgradient Methods: rate of O(1/
pk)...Even worse...
Can we find a faster method?The answer is YES!.
Idea: From an old algorithm of Nesterov (1983) designed for minimizing asmooth convex function, and proven to be an “optimal" first order method(Yudin-Nemirovsky (80)).But, here our problem (M) is nonsmooth. Yet, we can derive a fasteralgorithm than PG for the general problem.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Fast Prox-Grad Algorithm - [Beck-Teboulle (2009)]
An equally simple algorithm as prox-grad..
Here with constant step sizeInput: L = L(f ) - A Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute
xk = argminx2E
⇢
g(x) +L2kx� (yk �
1Lrf (yk ))k2
�
xk ⌘ prox
gL(yk �
1Lrf (yk )), - main computation as Prox-Grad
• tk+1 =1 +
q
1 + 4t2k
2,
•• yk+1 = xk +
✓
tk � 1tk+1
◆
(xk � xk�1).
Additional computation in (•) and (••) is clearly marginal.Knowledge of L(f ) is not Necessary, can use backtracking line search.With g = 0, this is the smooth “Optimal" Gradient of Nesterov (83); With tk ⌘ 1 we recover PGM.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
An O(1/k2) Global Rate of Convergence for (M)
Theorem Let {xk} be generated by FPG. Then for any k � 1
F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2
(k + 1)2 ,
# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").
Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.
| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...
Some examples....
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
An O(1/k2) Global Rate of Convergence for (M)
Theorem Let {xk} be generated by FPG. Then for any k � 1
F (xk )� F (x⇤) 2L(f )kx0 � x⇤k2
(k + 1)2 ,
# of iterations to reach F (x̃)� F⇤ " is ⇠ O(1/p").
Improves Prox Grad by a square root factor.On the practical side this theoretical rate is achieved.
| Many computational studies have confirmed the efficiency of FPG forsolving various optimization models in applications: Signal/image recoveryand in Machine learning e.g., image denoising/deblurring, nuclear matrixnorm regularization, matrix completion problems, multi-task learning, matrixclassification, etc...
Some examples....
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Back to Application: Linear Inverse Problems
Problem: Find x 2 C ⇢ E which "best" solves A(x) ⇡ b, A : E! F,where b (observable output), and A (Blurring matrix) are known.
Approach: via Regularization Models• g(x) is a "regularizer" (one – or sum of functions)• d(b,A(x)) some "proximity" measure from b to A(x)
min {g(x) : A(x) = b, x 2 C}min {g(x) : d(b,A(x)) ✏, x 2 C}min {d(b,A(x)) : g(x) �, x 2 C}min {d(b,A(x)) + �g(x) : x 2 C} (� > 0)(
Nonsmooth regularizers are particularly useful.
The l1 norm being a “starring one"!, as promoting sparsity.The squared l2 convex proximity kb�A(x)k2 will be the smooth part.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
An Example l1 regularization
minx
{kAx� bk2 + �kxk1} ⌘ minx
{f (x) + g(x)}
The proximal map of g(x) = �kxk1 is simply:
proxt(g)(y) = argminu
⇢
12tku� yk2 + �kuk1
�
= T �t(y),
where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:
T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)
} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm
Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]
} Fast Prox-grad = FISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
An Example l1 regularization
minx
{kAx� bk2 + �kxk1} ⌘ minx
{f (x) + g(x)}
The proximal map of g(x) = �kxk1 is simply:
proxt(g)(y) = argminu
⇢
12tku� yk2 + �kuk1
�
= T �t(y),
where T ↵ : Rn ! Rn is the shrinkage or soft threshold operator:
T ↵(x)i = (|xi |� ↵)+sgn (xi). (1)
} Prox Grad method = ISTA Iterative Shrinkage/Thresholding Algorithm
Other names in the signal processing literature include for example: thresholdLandweber method, iterative denoising, deconvolution algorithms...[ Chambolle (98);Figueiredo-Nowak (03, 05); Daubechies et al. (04),Elad et al. (06), Hale et al. (07)...]
} Fast Prox-grad = FISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Numerical Example: l1-Image Deblurring
minx{kAx� bk2 + �kxk1}
Comparing ISTA versus FISTA on Problems• dimension d like d = 256⇥ 256 = 65, 536, or/and 512⇥ 512 = 262, 144.• The d ⇥ d matrix A is dense (Gaussian blurring times inverse of two-stageHaar wavelet transform).• All problems solved with fixed � and Gaussian noise.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Example l1 Deblurring of the Cameraman
original blurred and noisy
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
1000 Iterations of ISTA versus 200 of FISTA
ISTA: 1000 Iterations FISTA: 200 Iterations
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Original Versus Deblurring via FISTA
Original FISTA:1000 Iterations
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Function Values errors F (xk)� F (x⇤)
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000010−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
101
ISTAFISTA
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:
prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV
xk+1 = D✓
xk �2L
AT (Axk � b),2�L
◆
.
where D(w, t) = argminx
�
kx�wk2 + 2tTV(x)
Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].
Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Example 2: l1 versus TV RegularizationMain difference between l1 and TV regularization:
prox of l1 - simple and explicit (shrinkage/soft threshold).prox of TV - TV-denoising problem requires an iterative method:g=TV
xk+1 = D✓
xk �2L
AT (Axk � b),2�L
◆
.
where D(w, t) = argminx
�
kx�wk2 + 2tTV(x)
Here:Prox operation, TV-based denoisingNo analytic expression in this case. Still can be solved very efficientlyby solving a smooth dual formulation by a fast gradient method [Beck,Teboulle, 2009], which can be seen as an acceleration of [Chambolle04,05].
Important to note:The fast proximal gradient method is implementable if the prox operationcan be computed efficiently.not a method for solving general nonsmooth convex problems. More onthis soon...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
MFISTA: Monotone FISTA
FISTA is not a monotone method. Problematic when the prox is not exactlycomputed.
Input: L � L(f ) - An upper bound on the Lipschitz constant of rf .Step 0. Take y1 = x0 2 E, t1 = 1.Step k. (k � 1) Compute
zk = prox 1L
✓
yk �1Lrf (yk )
◆
,
tk+1 =1 +
q
1 + 4t2k
2,
xk = argmin{F (x) : x = zk , xk�1}
yk+1 = xk +
✓
tktk+1
◆
(zk � xk ) +
✓
tk � 1tk+1
◆
(xk � xk�1).
With Same Rate of Convergence as FPG!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Lena and 3 Reconstructions – N=100 Iterations
Blurred and Noisy ISTA(F100 = 0.606)
MFISTA(F100 = 0.466)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Extension: Gradient Schemes with Non-EuclideanDistances
All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function
Typical example: Bregman type distances - based on kernel :
D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex
Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:
1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.
Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Extension: Gradient Schemes with Non-EuclideanDistances
All previous schemes were based on using the squared Euclideandistance for measuring proximity of two points in EIt is useful to exploit the geometry of the constraints set XThis is done by selecting a “distance-like” function
Typical example: Bregman type distances - based on kernel :
D (x, y) = (x)� (y)� hx� y,r (y)i, strongly convex
Advantage of using Non Euclidean distance adequately exploiting theconstraints allows to:
1 Simplify the prox computation for the given constraints set2 Often improve the constant in the complexity bound.
Mirror descent algorithms, extragradient-like, lagrangians, smoothing.... Nemirovsky-Yudin (80),Teboulle (92), Beck-Teboulle (03), Nemirovsky (04), Nesterov (05), Auslender-Teboulle (05)...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Convex Nonsmooth Composite: Lagrangians Based Methods
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Nonsmooth Convex with Separable Objective
(P) p⇤ = inf{'(x) ⌘ f (x) + g(Ax) : x 2 Rn},
Here f , g are both nonsmooth, A : Rn ! Rm a given linear map.
Problem (P) is equivalent to (via the standard splitting variables trick):
(P) p⇤ = inf{f (x) + g(z) : Ax = z, x 2 Rn, z 2 Rm}
Rockafellar (’76) has shown that the Proximal Point Algorithm can be appliedto the dual and primal-dual formulation of (P) to produce:
The Multipliers Method (augmented Lagrangian Method).The Proximal Method of Multipliers (PMM).Largely ignored over last 20 years.....Recent strong revival in sparseoptimization: compressive sensing, image processing ect...A very nice recent survey with many machine learning applications:[Boydet al. 2011]
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The PMM– Rockafellar (76)
PMM – The Proximal Method of Multipliers Generate (xk , zk ) and dual mul-tiplier yk via
(xk+1, zk+1) 2 argminx,z
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z)
yk+1 = yk + c(Axk+1 � zk+1).
The Augmented Lagrangian:
Lc(x , z, y) :=
Lagrangianz }| {
f (x) + g(z) + hyk ,Ax � zi+c2kAx � zk2, (c > 0).
qk (x , z) := 12
�
kx � xkk2M1
+ kz � zkk2M2
�
is the additional primal proximalterm.The choice of M1 2 Sn
+,M2 2 Sm+ is used to conveniently describe/analyze
several variants of the PMM.M1 = M2 ⌘ 0, recovers the Multiplier Methods (PPA on the dual).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Proximal Method of Multipliers–Key Difficulty
Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).
The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....
Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Proximal Method of Multipliers–Key Difficulty
Main computational step in PMM: to minimize w.r.t (x , z) the proximalAugmented Lagrangian:
f (x) + g(z) + hyk ,Ax � zi+ c2kAx � zk2 + qk (x , z).
The quadratic coupling term kAx � zk2, destroys the separabilitybetween x and z, preventing separate minimization in (x , z).In many applications, separate minimization is often much easier.....
Strategies to overcome this difficulty:Approximate Minimization – linearized the quad term kAx � zk2 wrt (x , z).Alternating Minimization – à la “Gauss-Seidel" in (x , z).Mixture of the above – Partial Linearization with respect to one variable,combined with Alternating Minimization of the other variable.Result in various interesting schemes, e.g., last one can be shown torecover the recent efficient primal-dual method of [Chambolle-Pock,2010].
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Prototype : Aternating Direction of ProximalMultipliers
Eliminate the coupling (x , z) via alternating minimization steps.
Glowinski-Marocco (75), Gabay-Mercier (76), Fortin-Glowinski (83), Ecsktein-Bertsekas (91) ......the so-called Alternating Direction of Mulipliers (ADM),(based on the Multiplier Methods, i.e.,M1 = M2 ⌘ 0.)
(AD-PMM) Alternating Direction Proximal Method of Multipliers1. Start with any (x0, z0, y0) 2 Rn ⇥ Rm ⇥ Rm and c > 02. For k = 0, 1, . . . generate the sequence {xk , zk , yk} as follows:
xk+1 2 argmin⇢
f (x) +c2kAx � zk + c�1ykk2 +
12kx � xkk2
M1
�
,
zk+1 = argmin⇢
g(z) +c2kAxk+1 � z + c�1ykk2 +
12kz � zkk2
M2
�
,
yk+1 = yk + c(Axk+1 � zk+1).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Global Rate of Convergence ResultsFor AD-PMM and many other variants. [Shefi-Teboulle (2013)]
Let (x⇤, z⇤, y⇤) be a saddle point for the Lagrangian l associated to (P).Then for all N � 1,
l(xN , zN , y)� p⇤ C2(x⇤,z⇤,y)N , 8y 2 Rm
In particular: f (xN) + g(zN) + rkAxN � zNk � p⇤ c1N ,
Residual norm: kAxN � zNk c2pN
Original Primal objective (under g-Lipschitz continuous):
'(xN)� '(x⇤) c3
N,
ci are positive constants (as usual in terms of distances to optimal solution).
For any sequence {xk , zk , yk}, any N � 1, the ergodic sequences {xN , yN , zN}
xN :=1N
N�1X
k=0
xk+1, zN :=1N
N�1X
k=0
zk+1, and yN :=1N
N�1X
k=0
yk+1.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Non-Convex Smooth Models
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Sparse PCA
Principal Component Analysis solves
max{xT Ax : kxk2 = 1, x 2 Rn}, (A ⌫ 0)
while Sparse Principal Component Analysis solves
max{xT Ax : kxk2 = 1, kxk0 k, x 2 Rn}, k 2 (1, n] sparsity
kxk0 counts the number of nonzero entries of xIssues:
1 Maximizing a Convex objective.2 Hard Nonconvex Constraint kxk0 k .
Possible Approaches:1 SDP Convex Relaxations [D’aspremont et al. 2008]2 Approximation/Modified formulations: Many proposed approaches
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Sparse PCA via Penalization/Relaxation/Approx.
� The problem of interest is the difficult sparse PCA problem as is
max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}
� Literature has focused on solving various modifications:l0-penalized PCA max {xT Ax � skxk0 : kxk2 = 1}, s > 0Relaxed l1-constrained PCA max {xT Ax : kxk2 = 1, kxk1
pk}
Relaxed l1-penalized PCA max {xT Ax � skxk1 : kxk2 = 1}Approx-Penalized max {xT Ax � sgp(|xk) : kxk2 = 1} gp(x) ' kxk0
SDP-Convex Relaxations max{tr(AX ) : tr (X ) = 1,X ⌫ 0, kXk1 k}
SDP-relaxations often too computationally expensive for large problems.No algorithm give bounds to the optimal solution of the original problem.Even when "Simple", these algorithms are for modifications:| do not solve the original problem of interest| do require unknown penalty parameter s to be tuned.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Quick Highlight of Simple Algorithms for "ModifiedProblems"
Type Iteration Per-Iteration ReferencesComplexity
l1-constrained xj+1i =
sgn(((A+�2 )xj )i )(|((A+
�2 )xj )i |��j )+qP
h (|((A+�2 )xj )h|��j )2+
O(n2), O(mn) Witten et al. (2009)
l1-constrained xj+1i =
sgn((Axj )i )(|(Axj )i |�sj )+qPh (|(Axj )h|�sj )2+
where O(n2), O(mn) Sigg-Buhman (2008)
sj is (k + 1)-largest entry of vector |Axj |
l0-penalized zj+1 =
Pi [sgn((bT
i zj )2�s)]+(bTi zj )bi
kP
i [sgn((bTi zj )2�s)]+(bT
i zj )bik2O(mn) Shen-Huang (2008),
Journee et al. (2010)
l0-penalized xj+1i =
sgn(2(Axj )i )(|2(Axj )i |�s'0p (|x
ji |))+r
Ph (|2(Axj )h|�s'0
p (|xjh|))
2+
O(n2) Sriperumbudur et al. (2010)
l1-penalized yj+1 = argminy
{X
ikbi � xj yT bik
22 + �kyk2
2 + skyk1} Zou et al. (2006)
xj+1 =(P
i bi bTi )yj+1
k(P
i bi bTi )yj+1k2
l1-penalized zj+1 =
Pi (|b
Ti zj |�s)+sgn(bT
i zj )bikP
i (|bTi zj |�s)+sgn(bT
i zj )bik2O(mn) Shen-Huang (2008),
Journee et al. (2010)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]
All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...
1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"
Very recently we have shown that:
All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Plethora of Models/Algorithms Revisited -[Luss-Teboulle (2013)]
All previous listed algorithms have been derived from various disparateapproaches/motivations to solve modifications of SPCA: ExpectationMaximization; Majoration-Mininimization techniques; DC programming;Alternating minimization etc...
1 Are all these algorithms different? Any connection?2 Is it possible to tackle the difficult sparse PCA problem “as is"
Very recently we have shown that:
All the previously listed algorithms are a particularrealization of a"Father Algorithm": ConGradU(based on the well-known Conditional Gradient Algorithm)ConGradU CAN be applied directly to the originalproblem!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]
solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}
xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize
� Here : F is convex, possibly nonsmooth; C is compact but nonconvex
Idea goes back to Mangasarian (96) developed for C a polyhedral set.
ConGradU – Conditional Gradient with Unit Step Size
x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}
Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Maximizing a Convex function over a CompactNonconvex setClassic Conditional Gradient Algorithm [Frank-Wolfe’56, Polyak’63, Dunn’79..]
solves : max {F (x) : x 2 C}, with F is C1; C convex compactx0 2 C, pj = argmax {hx � xj ,rF (xj)i : x 2 C}
xj+1 = xj + ↵j(pj � xj), ↵j 2 (0, 1] stepsize
� Here : F is convex, possibly nonsmooth; C is compact but nonconvex
Idea goes back to Mangasarian (96) developed for C a polyhedral set.
ConGradU – Conditional Gradient with Unit Step Size
x0 2 C, xj+1 2 argmax{hx � xj ,F 0(xj)i : x 2 C}
Notes:1 F is not assumed to be differentiable and F 0(x) is a subgradient of F at x .2 Useful when max{hx � xj ,F 0(xj )i : x 2 C} is easy to solve
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Solving Original l0-constrained PCA via ConGradU
Applying ConGradU directly to max{xT Ax : kxk2 = 1, kxk0 k , x 2 Rn}results in
xj+1 = argmax{xjT Ax : kxk2 = 1, kxk0 k} =Tk (Axj)
kTk (Axj)k2
Tk (a) := argminy
{kx � ak22 : kxk0 k}
Despite the hard constraint, easy to compute: (Tk (a))i = ai for the k largestentries (in absolute value) of a and (Tk (x))i = 0 otherwise.
Convergence: Every limit point of {xj} converges to a stationary point.Complexity: O(kn) or O(mn)
Thus, original problem can be solved using ConGradU with thesame complexity as when applied to modifications!Penalized/Modified problems require tuning an unknown tradeoffpenalty parameter This can be very computationally expensive and notneeded here.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
ConGradU for a General Class of Problems
(G) maxx
{f (x) + g(|x |) : x 2 C}
f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn
+ ! R is convex differentiable and montonote decreasing
Particularly useful for handling approximate l0-penalized problems.
CondGradU applied to (G) produces the following simple:
Weighted l1-norm maximization problem:
x0 2 C, xj+1 = argmax{haj , xi �X
i
w ji |xi | : x 2 C}, j = 0, . . . ,
where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.
For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:
xj+1 =Swj (f 0(xj))
kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
ConGradU for a General Class of Problems
(G) maxx
{f (x) + g(|x |) : x 2 C}
f : Rn ! R is convex, C ✓ Rn is a compact set.g : Rn
+ ! R is convex differentiable and montonote decreasing
Particularly useful for handling approximate l0-penalized problems.CondGradU applied to (G) produces the following simple:
Weighted l1-norm maximization problem:
x0 2 C, xj+1 = argmax{haj , xi �X
i
w ji |xi | : x 2 C}, j = 0, . . . ,
where wj := �g0(|xj |) > 0 and aj := f 0(xj) 2 Rn.
For penalized/approximate penalized SPCA, C is a unit ball, and above admits aclosed form solution:
xj+1 =Swj (f 0(xj))
kSwj (f 0(xj))k , j = 0, . . . ; Sw (a) := (|a|� w)+sgn(a), (Soft Threshold).
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Non-Convex and NonSmooth
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).
The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
A Nonsmooth Nonconvex Optimization Model
(M) minimizex,y (x , y) := f (x) + g (y) + H (x , y)
Assumption
(i) f : Rn ! (�1,+1] and g : Rm ! (�1,+1] proper and lsc functions.(ii) H : Rn ⇥ Rm ! R is a C1 function.
(iii) Partial gradients of H are Lipshitz continuous: H (·, y) 2 C1,1L(y) and
H (x , ·) 2 C1,1L(x).
NO convexity will be assumed in the objective and the constraints(built-in through f and g extended valued).The choice of two blocks of variables is only for the sake of simplicity ofexposition.The optimization model (M) covers many applications: signal/imageprocessing, machine learning, etc....Vast Literature..
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Goal
Derive a simple scheme.Prove that the whole sequence
�
zk
k2N := (xk , yk ) converges toa critical point of .
Exploit partial smoothness: Blend alternating minimization withproximal-gradientExploit further data info: to build a general recipe and algorithmicframework to prove convergence for a broad class of nonconvexnonsmooth problems.
J. Bolte, S. Sabach, M. TeboulleProximal alternating linearized minimization for nonconvex and nonsmoothproblems.
Mathematical Programming, Series A. Just published online.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Algorithm: Proximal Alternating LinearizationMinimization (PALM)
1. Initialization: start with any�
x0, y0� 2 Rn ⇥ Rm.2. For each k = 0, 1, . . . generate a sequence
��
xk , yk�
k2N:
2.1. Take �1 > 1, set ck = �1L1�
yk� and compute
xk+1 2 prox
fck
✓xk � 1
ckrx H
⇣xk , yk
⌘◆.
2.2. Take �2 > 1, set dk = �2L2�xk+1� and compute
yk+1 2 prox
gdk
✓yk � 1
dkry H
⇣xk+1, yk
⌘◆.
Main computational step: prox of a “nonconvex" function.
An interesting example will be given later. (More in our paper...)
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic
Theorem (Bolte-Sabach-Teboulle (2013))Let
�
zk
k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence
�
zk
k2N has finite length, that is,
1X
k=1
�
�zk+1 � zk�� <1.
(ii) The sequence�
zk
k2N converges to a critical point z⇤ = (x⇤, y⇤) of .
Are there many semi-algebraic functions?How do we prove this convergence result?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Convergence of PALM: If the Data [f , g,H] isSemi-Algebraic
Theorem (Bolte-Sabach-Teboulle (2013))Let
�
zk
k2N be a sequence generated by PALM. The following assertionshold.(i) The sequence
�
zk
k2N has finite length, that is,
1X
k=1
�
�zk+1 � zk�� <1.
(ii) The sequence�
zk
k2N converges to a critical point z⇤ = (x⇤, y⇤) of .
Are there many semi-algebraic functions?How do we prove this convergence result?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
An Informal General Convergence Proof Recipe
Let : RN ! (�1,+1] be a proper, lsc and bounded from below function.
(P) inf�
(z) : z 2 RN .
Suppose A is a generic algorithm which generates a sequence�
zk
k2N via:
z0 2 RN , zk+1 2 A(zk ), k = 0, 1, . . . .
Goal: Prove that the whole sequence�
zk
k2N converges to a criticalpoint of .
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Recipe
Basically, the “ Recipe" consists of three main steps.
(i) Sufficient decrease property: Find a positive constant ⇢1 such that
⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .
(ii) A subgradient lower bound for the iterates gap: Assume that�
zk
k2N is bounded. Find another positive constant ⇢2, such that�
�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .
These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.
Does the whole
�
zk
k2N converge to a critical point of Problem (M)?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Recipe
Basically, the “ Recipe" consists of three main steps.
(i) Sufficient decrease property: Find a positive constant ⇢1 such that
⇢1kzk+1 � zkk2 (zk )� (zk+1), 8k = 0, 1, . . . .
(ii) A subgradient lower bound for the iterates gap: Assume that�
zk
k2N is bounded. Find another positive constant ⇢2, such that�
�wk�� ⇢2kzk � zk�1k, wk 2 @ (zk ), 8k = 0, 1, . . . .
These two steps are typical for any descent type algorithms but leadONLY to convergence of limit points.
Does the whole
�
zk
k2N converge to a critical point of Problem (M)?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Third Main Step of our Recipe
(iii) The Kurdyka-Łojasiewicz property: Assume that is a KL func-tion. Use this property to prove that the generated sequence
�
zk
k2Nis a Cauchy sequence, and thus converges!
We stress that this general recipeSingles out the 3 main ingredients at play to analyze many otheroptimization algorithms in the nonconvex and nonsmooth setting.KL is the key for proving that the sequence generated by algorithm A isCauchy...A rare event in optimization schemes..!
The remaining questions:What is a KL function ?[Łojasiewicz (68), Kurdyka (98), Bolte et al. (06,07,10)]
Are there many KL functions?
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Are there Many Functions Satisfying KL?
YES! Semi Algebraic Functions
Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.
Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets
{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}
where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Are there Many Functions Satisfying KL?
YES! Semi Algebraic Functions
Theorem (Bolte-Daniilidis-Lewis (2006))Let � : Rd ! (�1,+1] be a proper and lsc function. If � is semi-algebraicthen it satisfies the KL property at any point of dom�.
Recall: Semi-algebraic sets and functions(i) A semialgebraic subset of Rd is a finite union of sets
{x 2 Rd : pi(x) = 0, qj(x) < 0, i 2 I, j 2 J}
where pi , qj : Rd ! R are real polynomial functions and I, J are finite.(ii) A function � is semi-algebraic if its graph is a semialgebraic set.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
There is a Wealth of Semi-Algebraic Functions!
Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.
Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
There is a Wealth of Semi-Algebraic Functions!
Operations Preserving Semi-Algebraic PropertyFinite sums and product of semi-algebraic functions.Composition of semi-algebraic functions.Sup/Inf type function, e.g., sup {g (u, v) : v 2 C} is semi-algebraic wheng is a semi-algebraic function and C a semi-algebraic set.
Some Semi-Algebraic Sets/Functions .."Starring" in OptimizationReal polynomial functions.Indicator functions of semi-algebraic sets.In matrix theory: cone of PSD matrices, constant rank matrices, Stiefelmanifolds...The function x ! dist (x ,S)2 is semi-algebraic whenever S is a nonemptysemi-algebraic subset of Rn.k·k0 is semi-algebraic.k·kp is semi-algebraic whenever p > 0 is rational.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Application to a Broad Class of Matrix FactorizationProblems
Given A 2 Rm⇥n and r ⌧ min {m, n}, find X 2 Rm⇥r and Y 2 Rr⇥n such that8
<
:
A ⇡ XY ,X 2 Km,r \ F ,Y 2 Kr ,n \ G.
Where
Kp,q =�
M 2 Rp⇥q : M � 0
,
F =�
X 2 Rm⇥r : R1 (X ) ↵
,
G =�
Y 2 Rr⇥n : R2 (Y ) �
,
Here R1 and R2 are lsc functions and ↵,� 2 R+ are given parameters.R1 (R2) are often used to describe some additional features of X (Y ).
(MF) covers a very large number of problems in applications...
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation
(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,
d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .
Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.
(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,
Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.
Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
The Optimization ApproachWe adopt the Constrained Nonconvex Nonsmooth Formulation
(MF ) min {d (A,XY ) : X 2 Km,r \ F ,Y 2 Kr ,n \ G} ,
d : Rm⇥n ⇥ Rm⇥n ! R+ stands as a proximity function measuringthe quality of the approximation, with d (U,V ) = 0 if and only if U = V .
Another Way: Penalized Version The "Hard" constraints are candidates tobe penalized with µ1 and µ2 > 0 penalty parameters.
(P �MF ) min {µ1R1 (X ) + µ2R2 (Y ) + d (A,XY ) : X 2 Km,r ,Y 2 Kr ,n} ,
Note: The penalty approach requires the tuning of the unknown penaltyparameters which might be a difficult issue.
Both formulations fit our general nonsmooth nonconvex model (M) withobvious identifications for H, f , g. We now illustrate on two important modelswith semi-algebraic data.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Model I – Nonnegative Matrix Factorization ProblemsLet the proximity measure be defined via the Frobenius norm
d (A,XY ) := H (X ,Y ) =12kA� XYk2
F , and
F ⌘ Rm⇥r ; G ⌘ Rr⇥n.
The Problem (MF ) reduces to the so called Nonnegative Matrix Factorization(NMF)
min⇢
12kA� XYk2
F : X � 0,Y � 0�
.
H is a real polynomial function hence semi-algebraic.X ! H (X ,Y ) (for fixed Y ) and Y ! H (X ,Y ) (for fixed X ), are C1,1 withL1(Y ) ⌘
�
�YY T�
�
F , L2(X ) ⌘�
�X T X�
�
F .H is C2 on bounded subsets.
Thus we can PALM it! The two computational steps reduce to projectiononto the nonnegative cone of matrices–Trivial!..
P+(U) := argmin{kU � Vk2F : V 2 Rm⇥n,V � 0} = max{0,U}.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by
R1 (X ) = kXk0 :=X
i
kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .
To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!
Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then
prox
f1 (U) = argmin
⇢
12kX � Uk2
F : X � 0, kXk0 s�
= Ts (P+ (U))
whereTs (U) := argmin
V2Rm⇥n
n
kU � Vk2F : kUk0 s
o
.
Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Model II - Sparse Constraints in NMF ProblemsNow, consider in NMF the overall sparsity measure of a matrix defined by
R1 (X ) = kXk0 :=X
i
kxik0 , (xi column vector of X ) ;R2 (Y ) = kYk0 .
To apply PALM all we need is to compute the prox of f := �X�0 + �kXk0s.It turns out that this can be simply done!
Proposition (Proximal map formula for f )Let U 2 Rm⇥n. Then
prox
f1 (U) = argmin
⇢
12kX � Uk2
F : X � 0, kXk0 s�
= Ts (P+ (U))
whereTs (U) := argmin
V2Rm⇥n
n
kU � Vk2F : kUk0 s
o
.
Computing Ts simply requires determining the s-th largest numbers of mn numbers.This can be done in O(mn) time, and zeroing out the proper entries in one more passof the mn numbers.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
PALM for Sparse NMF
1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence
��
X k ,Y k�
k2N:
2.1. Take �1 > 1, set ck = �1
�
�
�
Y k �Y k�T�
�
�
Fand compute
Uk = X k � 1ck
�
X k Y k � A� �
Y k�T; X k+1 2 prox
R1ck
�
Uk� = T↵�
P+
�
Uk�� .
2.2. Take �2 > 1, set dk = �2
�
�
�
X k+1 �X k+1�T�
�
�
Fand compute
V k = Y k� 1dk
�
X k+1�T �X k+1Y k � A
�
; Y k+1 2 prox
R2dk
�
V k� = T��
P+
�
V k�� .
Applying our main Theorem we get the global convergence result:
Let��
X k ,Y k�
k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N
�
�
�X k�
�
F ,�
�Y k�
�
F
> 0. Then,��
X k ,Y k�
k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
PALM for Sparse NMF
1. Initialization: Select random nonnegative X 0 2 Rm⇥r and Y 0 2 Rr⇥n.2. For each k = 0, 1, . . . generate a sequence
��
X k ,Y k�
k2N:
2.1. Take �1 > 1, set ck = �1
�
�
�
Y k �Y k�T�
�
�
Fand compute
Uk = X k � 1ck
�
X k Y k � A� �
Y k�T; X k+1 2 prox
R1ck
�
Uk� = T↵�
P+
�
Uk�� .
2.2. Take �2 > 1, set dk = �2
�
�
�
X k+1 �X k+1�T�
�
�
Fand compute
V k = Y k� 1dk
�
X k+1�T �X k+1Y k � A
�
; Y k+1 2 prox
R2dk
�
V k� = T��
P+
�
V k�� .
Applying our main Theorem we get the global convergence result:
Let��
X k ,Y k�
k2N be a sequence generated by PALM-Sparse NMF. Ifinfk2N
�
�
�X k�
�
F ,�
�Y k�
�
F
> 0. Then,��
X k ,Y k�
k2N converges to a criticalpoint (X ⇤,Y ⇤) of the Sparse NMF.
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms
Concluding Remarks on FOM
Motivates new model formulations and novel/refined algorithms whichexploit special structuresPowerful for constructing cheap iterationsEfficient algorithms in many applied optimization models with structures.Further research needed for simple and efficient schemes that can copewith curse of dimensionality and Nonconvex/Nonsmooth settings.
THANK YOU FOR YOUR ATTENTION!
Marc Teboulle (Tel Aviv University) Gradient Based Optimization Algorithms