chapter 2 optimization - max planck institute for informatics · optimization gradients, convexity,...

Chapter 2 OptimizationGradients, convexity, and ALS

DMM, summer 2017 Pauli Miettinen

Contents• Background

• Gradient descent

• Stochastic gradient descent

• Newton’s method

• Alternating least squares

• KKT conditions

2


Motivation• We can solve basic least-squares linear systems using

SVD

• But what if we have

• missing values in the data

• extra constraints for feasible solutions

• more complex optimization problems (e.g. regularizers)

• etc

3


Gradients, Hessians, and convexity

4


Derivatives and local optima

• The derivative of a function f: ℝ → ℝ, denoted f’, explains its rate of change

• If it exists

• The second derivative f’’ is the change of rate of change

5

ƒ 0(�) = limh!0+

ƒ (� + h) � ƒ (�)h


Derivatives and local optima

• A stationary point of differentiable f is x s.t. f’(x) = 0

• f achieves its extremes in stationary points or in points where derivative doesn’t exist, or at infinities (Fermat’s theorem)

• Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists)

6


Partial derivative• If f is multivariate (e.g. f: ℝ3 → ℝ), we can

consider it as a family of functions

• E.g. f(x, y) = x2 + y has functions fx(y) = x2 + y and fy(x) = x2 + y

• Partial derivative w.r.t. one variable keeps other variables constant

7

�ƒ�� (�, y) = ƒ 0y(�) = 2�


Gradient• Gradient is the derivative for

multivariate functions f: ℝn → ℝ

•

• Here (and later), we assume that the derivatives exist

• Gradient is a function ∇f: ℝn → ℝn

• ∇f(x) points “up” in the function at point x

8

�ƒ =� �ƒ��1

, �ƒ��2

, . . . , �ƒ��n

�


Gradient

9


Hessian• Hessian is a square matrix of all second-order

partial derivatives of a function f: ℝn → ℝ

• As usual, we assume the derivatives exist

10

H(ƒ ) =

0BBBBBB@

�2ƒ��21

�2ƒ��1��2

· · · �2ƒ��1��n

�2ƒ��2��1

�2ƒ��22

�2ƒ��2��n

.... . .

...�2ƒ

��n��1�2ƒ

��n��2· · · �2ƒ

��2n

1CCCCCCA


Jacobian matrix• If f: ℝm → ℝn, then its Jacobian (matrix) is an

n𐄂m matrix of partial derivatives in form

• Jacobian is the best linear approximation of f

• H(f(x)) = J(∇f(x))T

11

J =

0BBBB@

�ƒ1��1

�ƒ1��2

· · · �ƒ1��m

�ƒ2��1

�ƒ2��2

�ƒ2��m

.... . .

...�ƒn��1

�ƒn��2

· · · �ƒn��m

1CCCCA


Examples

12

ƒ (�, y) = �2 + 2�y + y

�ƒ

��(�, y) = 2� + 2y

�ƒ

�y(�, y) = 2� + 1

�ƒ = (2� + 2y,2� + 1)

H(ƒ ) =Å2 22 0

ã

ƒ (�, y) =✓

�2y5� + siny

◆

J(ƒ ) =✓2�y �25 cosy

◆

Function

Partial derivatives

Gradient

Hessian

Function

Jacobian


Gradient’s properties• Linearity: ∇(αf + βg)(x) + α∇f(x) + β∇g(x)

• Product rule: ∇(fg)(x) = f(x)∇g(x) + g(x)∇f(x)

• Chain rule:

• If f: ℝn → ℝ and g: ℝm → ℝn, then ∇(f∘g)(x) = J(g(x))T(∇f(y)) where y = g(x)

• If f is as above and h: ℝ → ℝ, then ∇(h∘f)(x) = h’(f(x))∇f(x)

13

IMPORTANT!


Convexity• A function is convex if any line

segment between two points of the function lie above or on the graph

• For univariate f, if f’’(x) ≥ 0 for all x

• For multivariate f, if its Hessian is positive semidefinite

• I.e. zTHz ≥ 0 for any z

• Convex function’s local minimum is its global minimum

14


Preserving the convexity• If f is convex and λ > 0, then λf is convex

• If f and g are convex, the f + g is convex

• If f is convex and g is affine (i.e. g(x) = Ax + b), then f∘g is convex (N.B. (f∘g)(x) = f(Ax + b))

• Let f(x) = (h∘g)(x) with g: ℝn → ℝ and h: ℝ → ℝ; f is convex if

• g is convex and h is nondecreasing and convex

• g is concave and h is non-increasing and convex

15


Gradient descent

16


Idea

• If f is convex, we should find it’s minimum by following its negative gradient

• But the gradient at x points to minimum only at x

• Hence, we need to descent slowly down the gradient

17


Example

18

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

●

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

t

q(t)

-q*

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

stepfun(px, py)

t

q(t)

-q*

●

●

●

●● ● ● ● ● ●


Gradient descent• Start from random point x0

• At step n, update xn ← xn–1 – γ∇f(xn–1)

• γ is some small step size

• Often, γ depends on the iterationxn ← xn–1 – γn∇f(xn–1)

• With suitable f and step size, will converge to local minimum

19


Example: least squares• Given A ∈ ℝn×m and b ∈ ℝn, find x ∈ ℝm

s.t. ||Ax – b||2/2 is minimized

• Can be solved using SVD…

• Calculate the gradient of fA,b(x) = ||Ax – b||2/2

• Employ the gradient descent approach

• In this case, the step size can be calculated analytically

20


Example: the gradient

21

1

2kA� � bk2 =

1

2

nX

�=1

�(A�)� � b��2

=1

2

nX

�=1

� mX

j=1��j�j � b��2

=1

2

nX

�=1

Ä� mX

j=1��j�j�2 � 2b�

mX

j=1��j�j + b2�ä

=1

2

nX

�=1

� mX

j=1��j�j�2 �

nX

�=1b�

mX

j=1��j�j +

1

2

nX

�=1b2�

Let’s write open:



22

The partial derivative w.r.t. xj:

Linearity

= 0

�

��j

Ä12kA� � bk2ä=

�

��j

Ä12

nX

�=1

� mX

k=1��k�k�2 �

nX

�=1b�

mX

k=1��k�k +

1

2

nX

�=1b2�ä

=1

2

nX

�=1

�

��j

� mX

k=1��k�k�2 �

nX

�=1b�

�

��j

mX

k=1��k�k +

�

��j

1

2

nX

�=1b2�

=1

2

nX

�=1

�

��j

� mX

k=1��k�k�2 �

nX

�=1b��j

=nX

�=1��j

mX

k=1��k�k �

nX

�=1b��j

=nX

�=1��jÄ mX

k=1��k�k � b�ä

= 0 if k ≠ j

Chain rule



23

Collecting terms: Matrix product

Another matrix product

�

��j

Ä12kA� � bk2ä=

nX

�=1��jÄ mX

k=1��k�k � b�ä

=nX

�=1��jÄ(A�)� � b�ä

=ÄAT (A� � b)äj

�Ä12kA� � bk2ä= AT(A� � b)

Hence we have:



24

The other way: Use the chain rule

�Ä12kA� � bk2ä= J(A� � b)T��(

1

2kyk2)�

y = A� � b

= AT (A� � b)


Gradient descent & matrices

• How about “Given A, find small B and C s.t. ||A – BC||F is minimized”?

• Not convex for B and C jointly

• Fix some B and solve for C

• C = argminX ||A – BX||F

• Use the found C and solve for B, and repeat until convergence

25


How to solve for C?• C = argminX ||A – BX||F still needs some work

• Write the norm as sum of column-wise errors ||A – BX||F = ∑ ||aj – Bxj||2

• Now the problem is a series of standard least-squares problems

• Each can be solved independently

26


How to select the step size?

• Recall: xn ← xn–1 – γn∇f(xn–1)

• Selecting correct γn for each n is crucial

• Methods for optimal step size are often slow (e.g. line search)

• Wrong step size can lead to non-convergence

27


Stochastic gradient descent

28


Basic idea• With gradient descent, we need to calculate

the gradient for c ↦ ||a – Bc|| many times for different a in each iteration

• Instead we can fix one element aĳ and update the ith row of B and jth column of C accordingly

• When we choose aĳ randomly, this is stochastic gradient descent (SGD)

29


Local gradient• With fixed aĳ, ||aĳ – (BC)ĳ|| = aĳ – ∑bikckj

• Local gradient for bik is –2ckj(aĳ – (BC)ĳ)

• Similarly for ckj

• This allows us to update the factors by only computing one gradient

• Gradient needs to be sufficiently scaled

30


SGD process• Initialize with random B

and C

• repeat

• Pick a random element (i, j)

• Update a row of B and a column of C using the local gradients w.r.t. aĳ

31

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.5

1

1.5

2

2.5

3

3.5

4

4 4.5 4.5

4.5

5 5

5

5.5

5.5

6

6

6.5

6.5

7

7

7.5

− 1.0 − 0.5 0.0 0.5 1.0

−1

.0−

0.5

0.0

0.5

1.0

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

stepfun(px, py)

t

q(t)

-q*

●

●

●●●

●

●●●●●

●●●●●●●

●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


SGD pros and cons• Each iteration is faster to compute

• But can increase the error

• Does not need to know all elements of the input data

• Scalability

• Partially observed matrices (e.g. collaborative filtering)

• The step size still needs to be chosen carefully

32


Newton’s method

33


Basic idea

• Iterative update rule: xn+1 ← xn – [H(f(xn))]–1∇f(xn)

• Assuming Hessian exists and is invertible…

• Takes curvature information into account

34


Pros and cons• Much faster convergence

• But Hessian is slow to compute and takes lots of memory

• Quasi-Newton methods (e.g. L-BFGS) compute the Hessian indirectly

• Often still needs some step size other than 1

35


Alternating least squares

36


Basic idea

• Given A and B, we can find C that minimizes ||A – BC||F

• In gradient descent, we move slightly towards C

• In alternating least squares (ALS), we replace C with the new one

37


Basic ALS algorithm

• Given A, sample a random B

• repeat until convergence

• C ← argminX ||A – BX||F

• B ← argminX ||A – XC||F

38


ALS pros and cons

• Can have faster convergence than gradient descent (or SGD)

• The update is slower to compute than in SGD

• About as fast as in gradient descent

• Requires fully-observed matrices

39


Adding constraints

40


The problem setting• So far, we have done unconstrained

optimization

• What if we have constrains on the optimal solution?

• E.g. all matrices must be nonnegative

• In general, the above approaches won’t admit these constraints

41


General case• Minimize f(x)

• Subject to gi(x) ≤ 0, i = 1, …, m hj(x) = 0, j = 1, …, k

• Assuming certain regularity conditions, there exists constraints μi (i=1,…,m) and λj (j=1,…,k) that satisfy Karush–Kuhn–Tucker (KKT) conditions

42


KKT conditions• Let x* be the optimal solution

• Stationarity:

• –∇f(x*) = ∑i μi∇gi(x*) + ∑j λj∇hj(x*)

• Primal feasibility:

• gi(x*) ≤ 0 for all i = 1, …, m

• hj(x*) = 0 for all j = 1, …, k

• Dual feasibility:

• μi ≥ 0 for all i = 1, …, m

• Complementary slackness:

• μigi(x*) = 0 for all i = 1, …, m

43


When do KKT conditions hold

• KKT conditions hold under certain regularity conditions

• E.g. gi and hj are affine

• Or f is convex and exists x s.t. h(x) = 0 and gi(x) < 0

• Nonnegativity is an example of linear (hence, affine) constraint

44


What to do with the KKT conditions?

• μ and λ are new unknown variables

• Must be optimized together with x

• The conditions appear in the optimization

• E.g. in the gradient

• The KKT conditions are rarely solved directly

45


Summary• There are many methods for optimization

• We only scratched the surface

• Methods are often based on gradients

• Can lead into ugly equations

• Next week: applying these techniques for finding nonnegative factorizations… Stay tuned!

46

chapter 2 optimization - max planck institute for informatics · optimization gradients, convexity,...

Documents