chapter 2 optimization - max planck institute for informatics · optimization gradients, convexity,...
TRANSCRIPT
![Page 1: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/1.jpg)
Chapter 2 OptimizationGradients, convexity, and ALS
![Page 2: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/2.jpg)
DMM, summer 2017 Pauli Miettinen
Contents• Background
• Gradient descent
• Stochastic gradient descent
• Newton’s method
• Alternating least squares
• KKT conditions
2
![Page 3: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/3.jpg)
DMM, summer 2017 Pauli Miettinen
Motivation• We can solve basic least-squares linear systems using
SVD
• But what if we have
• missing values in the data
• extra constraints for feasible solutions
• more complex optimization problems (e.g. regularizers)
• etc
3
![Page 4: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/4.jpg)
DMM, summer 2017 Pauli Miettinen
Gradients, Hessians, and convexity
4
![Page 5: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/5.jpg)
DMM, summer 2017 Pauli Miettinen
Derivatives and local optima
• The derivative of a function f: ℝ → ℝ, denoted f’, explains its rate of change
• If it exists
• The second derivative f’’ is the change of rate of change
5
ƒ 0(�) = limh!0+
ƒ (� + h) � ƒ (�)h
![Page 6: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/6.jpg)
DMM, summer 2017 Pauli Miettinen
Derivatives and local optima
• A stationary point of differentiable f is x s.t. f’(x) = 0
• f achieves its extremes in stationary points or in points where derivative doesn’t exist, or at infinities (Fermat’s theorem)
• Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists)
6
![Page 7: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/7.jpg)
DMM, summer 2017 Pauli Miettinen
Partial derivative• If f is multivariate (e.g. f: ℝ3 → ℝ), we can
consider it as a family of functions
• E.g. f(x, y) = x2 + y has functions fx(y) = x2 + y and fy(x) = x2 + y
• Partial derivative w.r.t. one variable keeps other variables constant
7
�ƒ�� (�, y) = ƒ 0y(�) = 2�
![Page 8: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/8.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient• Gradient is the derivative for
multivariate functions f: ℝn → ℝ
•
• Here (and later), we assume that the derivatives exist
• Gradient is a function ∇f: ℝn → ℝn
• ∇f(x) points “up” in the function at point x
8
�ƒ =� �ƒ��1
, �ƒ��2
, . . . , �ƒ��n
�
![Page 9: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/9.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient
9
![Page 10: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/10.jpg)
DMM, summer 2017 Pauli Miettinen
Hessian• Hessian is a square matrix of all second-order
partial derivatives of a function f: ℝn → ℝ
• As usual, we assume the derivatives exist
10
H(ƒ ) =
0BBBBBB@
�2ƒ��21
�2ƒ��1��2
· · · �2ƒ��1��n
�2ƒ��2��1
�2ƒ��22
�2ƒ��2��n
.... . .
...�2ƒ
��n��1�2ƒ
��n��2· · · �2ƒ
��2n
1CCCCCCA
![Page 11: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/11.jpg)
DMM, summer 2017 Pauli Miettinen
Jacobian matrix• If f: ℝm → ℝn, then its Jacobian (matrix) is an
n𐄂m matrix of partial derivatives in form
• Jacobian is the best linear approximation of f
• H(f(x)) = J(∇f(x))T
11
J =
0BBBB@
�ƒ1��1
�ƒ1��2
· · · �ƒ1��m
�ƒ2��1
�ƒ2��2
�ƒ2��m
.... . .
...�ƒn��1
�ƒn��2
· · · �ƒn��m
1CCCCA
![Page 12: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/12.jpg)
DMM, summer 2017 Pauli Miettinen
Examples
12
ƒ (�, y) = �2 + 2�y + y
�ƒ
��(�, y) = 2� + 2y
�ƒ
�y(�, y) = 2� + 1
�ƒ = (2� + 2y,2� + 1)
H(ƒ ) =Å2 22 0
ã
ƒ (�, y) =✓
�2y5� + siny
◆
J(ƒ ) =✓2�y �25 cosy
◆
Function
Partial derivatives
Gradient
Hessian
Function
Jacobian
![Page 13: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/13.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient’s properties• Linearity: ∇(αf + βg)(x) + α∇f(x) + β∇g(x)
• Product rule: ∇(fg)(x) = f(x)∇g(x) + g(x)∇f(x)
• Chain rule:
• If f: ℝn → ℝ and g: ℝm → ℝn, then ∇(f∘g)(x) = J(g(x))T(∇f(y)) where y = g(x)
• If f is as above and h: ℝ → ℝ, then ∇(h∘f)(x) = h’(f(x))∇f(x)
13
IMPORTANT!
![Page 14: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/14.jpg)
DMM, summer 2017 Pauli Miettinen
Convexity• A function is convex if any line
segment between two points of the function lie above or on the graph
• For univariate f, if f’’(x) ≥ 0 for all x
• For multivariate f, if its Hessian is positive semidefinite
• I.e. zTHz ≥ 0 for any z
• Convex function’s local minimum is its global minimum
14
![Page 15: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/15.jpg)
DMM, summer 2017 Pauli Miettinen
Preserving the convexity• If f is convex and λ > 0, then λf is convex
• If f and g are convex, the f + g is convex
• If f is convex and g is affine (i.e. g(x) = Ax + b), then f∘g is convex (N.B. (f∘g)(x) = f(Ax + b))
• Let f(x) = (h∘g)(x) with g: ℝn → ℝ and h: ℝ → ℝ; f is convex if
• g is convex and h is nondecreasing and convex
• g is concave and h is non-increasing and convex
15
![Page 16: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/16.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient descent
16
![Page 17: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/17.jpg)
DMM, summer 2017 Pauli Miettinen
Idea
• If f is convex, we should find it’s minimum by following its negative gradient
• But the gradient at x points to minimum only at x
• Hence, we need to descent slowly down the gradient
17
![Page 18: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/18.jpg)
DMM, summer 2017 Pauli Miettinen
Example
18
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
t
q(t)
-q*
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
stepfun(px, py)
t
q(t)
-q*
●
●
●
●● ● ● ● ● ●
![Page 19: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/19.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient descent• Start from random point x0
• At step n, update xn ← xn–1 – γ∇f(xn–1)
• γ is some small step size
• Often, γ depends on the iterationxn ← xn–1 – γn∇f(xn–1)
• With suitable f and step size, will converge to local minimum
19
![Page 20: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/20.jpg)
DMM, summer 2017 Pauli Miettinen
Example: least squares• Given A ∈ ℝn×m and b ∈ ℝn, find x ∈ ℝm
s.t. ||Ax – b||2/2 is minimized
• Can be solved using SVD…
• Calculate the gradient of fA,b(x) = ||Ax – b||2/2
• Employ the gradient descent approach
• In this case, the step size can be calculated analytically
20
![Page 21: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/21.jpg)
DMM, summer 2017 Pauli Miettinen
Example: the gradient
21
1
2kA� � bk2 =
1
2
nX
�=1
�(A�)� � b��2
=1
2
nX
�=1
� mX
j=1��j�j � b��2
=1
2
nX
�=1
� mX
j=1��j�j�2 � 2b�
mX
j=1��j�j + b2�ä
=1
2
nX
�=1
� mX
j=1��j�j�2 �
nX
�=1b�
mX
j=1��j�j +
1
2
nX
�=1b2�
Let’s write open:
![Page 22: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/22.jpg)
DMM, summer 2017 Pauli Miettinen
Example: the gradient
22
The partial derivative w.r.t. xj:
Linearity
= 0
�
��j
Ä12kA� � bk2ä=
�
��j
Ä12
nX
�=1
� mX
k=1��k�k�2 �
nX
�=1b�
mX
k=1��k�k +
1
2
nX
�=1b2�ä
=1
2
nX
�=1
�
��j
� mX
k=1��k�k�2 �
nX
�=1b�
�
��j
mX
k=1��k�k +
�
��j
1
2
nX
�=1b2�
=1
2
nX
�=1
�
��j
� mX
k=1��k�k�2 �
nX
�=1b���j
=nX
�=1��j
mX
k=1��k�k �
nX
�=1b���j
=nX
�=1��jÄ mX
k=1��k�k � b�ä
= 0 if k ≠ j
Chain rule
![Page 23: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/23.jpg)
DMM, summer 2017 Pauli Miettinen
Example: the gradient
23
Collecting terms: Matrix product
Another matrix product
�
��j
Ä12kA� � bk2ä=
nX
�=1��jÄ mX
k=1��k�k � b�ä
=nX
�=1��jÄ(A�)� � b�ä
=ÄAT (A� � b)äj
�Ä12kA� � bk2ä= AT(A� � b)
Hence we have:
![Page 24: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/24.jpg)
DMM, summer 2017 Pauli Miettinen
Example: the gradient
24
The other way: Use the chain rule
�Ä12kA� � bk2ä= J(A� � b)T��(
1
2kyk2)�
y = A� � b
= AT (A� � b)
![Page 25: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/25.jpg)
DMM, summer 2017 Pauli Miettinen
Gradient descent & matrices
• How about “Given A, find small B and C s.t. ||A – BC||F is minimized”?
• Not convex for B and C jointly
• Fix some B and solve for C
• C = argminX ||A – BX||F
• Use the found C and solve for B, and repeat until convergence
25
![Page 26: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/26.jpg)
DMM, summer 2017 Pauli Miettinen
How to solve for C?• C = argminX ||A – BX||F still needs some work
• Write the norm as sum of column-wise errors ||A – BX||F = ∑ ||aj – Bxj||2
• Now the problem is a series of standard least-squares problems
• Each can be solved independently
26
![Page 27: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/27.jpg)
DMM, summer 2017 Pauli Miettinen
How to select the step size?
• Recall: xn ← xn–1 – γn∇f(xn–1)
• Selecting correct γn for each n is crucial
• Methods for optimal step size are often slow (e.g. line search)
• Wrong step size can lead to non-convergence
27
![Page 28: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/28.jpg)
DMM, summer 2017 Pauli Miettinen
Stochastic gradient descent
28
![Page 29: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/29.jpg)
DMM, summer 2017 Pauli Miettinen
Basic idea• With gradient descent, we need to calculate
the gradient for c ↦ ||a – Bc|| many times for different a in each iteration
• Instead we can fix one element aij and update the ith row of B and jth column of C accordingly
• When we choose aij randomly, this is stochastic gradient descent (SGD)
29
![Page 30: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/30.jpg)
DMM, summer 2017 Pauli Miettinen
Local gradient• With fixed aij, ||aij – (BC)ij|| = aij – ∑bikckj
• Local gradient for bik is –2ckj(aij – (BC)ij)
• Similarly for ckj
• This allows us to update the factors by only computing one gradient
• Gradient needs to be sufficiently scaled
30
![Page 31: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/31.jpg)
DMM, summer 2017 Pauli Miettinen
SGD process• Initialize with random B
and C
• repeat
• Pick a random element (i, j)
• Update a row of B and a column of C using the local gradients w.r.t. aij
31
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.5
1
1.5
2
2.5
3
3.5
4
4 4.5 4.5
4.5
5 5
5
5.5
5.5
6
6
6.5
6.5
7
7
7.5
− 1.0 − 0.5 0.0 0.5 1.0
−1
.0−
0.5
0.0
0.5
1.0
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
stepfun(px, py)
t
q(t)
-q*
●
●
●●●
●
●●●●●
●●●●●●●
●●●
●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
![Page 32: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/32.jpg)
DMM, summer 2017 Pauli Miettinen
SGD pros and cons• Each iteration is faster to compute
• But can increase the error
• Does not need to know all elements of the input data
• Scalability
• Partially observed matrices (e.g. collaborative filtering)
• The step size still needs to be chosen carefully
32
![Page 33: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/33.jpg)
DMM, summer 2017 Pauli Miettinen
Newton’s method
33
![Page 34: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/34.jpg)
DMM, summer 2017 Pauli Miettinen
Basic idea
• Iterative update rule: xn+1 ← xn – [H(f(xn))]–1∇f(xn)
• Assuming Hessian exists and is invertible…
• Takes curvature information into account
34
![Page 35: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/35.jpg)
DMM, summer 2017 Pauli Miettinen
Pros and cons• Much faster convergence
• But Hessian is slow to compute and takes lots of memory
• Quasi-Newton methods (e.g. L-BFGS) compute the Hessian indirectly
• Often still needs some step size other than 1
35
![Page 36: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/36.jpg)
DMM, summer 2017 Pauli Miettinen
Alternating least squares
36
![Page 37: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/37.jpg)
DMM, summer 2017 Pauli Miettinen
Basic idea
• Given A and B, we can find C that minimizes ||A – BC||F
• In gradient descent, we move slightly towards C
• In alternating least squares (ALS), we replace C with the new one
37
![Page 38: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/38.jpg)
DMM, summer 2017 Pauli Miettinen
Basic ALS algorithm
• Given A, sample a random B
• repeat until convergence
• C ← argminX ||A – BX||F
• B ← argminX ||A – XC||F
38
![Page 39: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/39.jpg)
DMM, summer 2017 Pauli Miettinen
ALS pros and cons
• Can have faster convergence than gradient descent (or SGD)
• The update is slower to compute than in SGD
• About as fast as in gradient descent
• Requires fully-observed matrices
39
![Page 40: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/40.jpg)
DMM, summer 2017 Pauli Miettinen
Adding constraints
40
![Page 41: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/41.jpg)
DMM, summer 2017 Pauli Miettinen
The problem setting• So far, we have done unconstrained
optimization
• What if we have constrains on the optimal solution?
• E.g. all matrices must be nonnegative
• In general, the above approaches won’t admit these constraints
41
![Page 42: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/42.jpg)
DMM, summer 2017 Pauli Miettinen
General case• Minimize f(x)
• Subject to gi(x) ≤ 0, i = 1, …, m hj(x) = 0, j = 1, …, k
• Assuming certain regularity conditions, there exists constraints μi (i=1,…,m) and λj (j=1,…,k) that satisfy Karush–Kuhn–Tucker (KKT) conditions
42
![Page 43: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/43.jpg)
DMM, summer 2017 Pauli Miettinen
KKT conditions• Let x* be the optimal solution
• Stationarity:
• –∇f(x*) = ∑i μi∇gi(x*) + ∑j λj∇hj(x*)
• Primal feasibility:
• gi(x*) ≤ 0 for all i = 1, …, m
• hj(x*) = 0 for all j = 1, …, k
• Dual feasibility:
• μi ≥ 0 for all i = 1, …, m
• Complementary slackness:
• μigi(x*) = 0 for all i = 1, …, m
43
![Page 44: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/44.jpg)
DMM, summer 2017 Pauli Miettinen
When do KKT conditions hold
• KKT conditions hold under certain regularity conditions
• E.g. gi and hj are affine
• Or f is convex and exists x s.t. h(x) = 0 and gi(x) < 0
• Nonnegativity is an example of linear (hence, affine) constraint
44
![Page 45: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/45.jpg)
DMM, summer 2017 Pauli Miettinen
What to do with the KKT conditions?
• μ and λ are new unknown variables
• Must be optimized together with x
• The conditions appear in the optimization
• E.g. in the gradient
• The KKT conditions are rarely solved directly
45
![Page 46: Chapter 2 Optimization - Max Planck Institute for Informatics · Optimization Gradients, convexity, and ALS. DMM, summer 2017 Pauli Miettinen Contents • Background • Gradient](https://reader034.vdocuments.net/reader034/viewer/2022050718/5e16d493c457b14f4409f933/html5/thumbnails/46.jpg)
DMM, summer 2017 Pauli Miettinen
Summary• There are many methods for optimization
• We only scratched the surface
• Methods are often based on gradients
• Can lead into ugly equations
• Next week: applying these techniques for finding nonnegative factorizations… Stay tuned!
46