optimization methods - new york universitycfgranda/pages/obda_spring16/material/lecture_3.pdf ·...

55
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_spring16 Carlos Fernandez-Granda 2/8/2016

Upload: others

Post on 25-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Optimization methods

Optimization-Based Data Analysishttp://www.cims.nyu.edu/~cfgranda/pages/OBDA_spring16

Carlos Fernandez-Granda

2/8/2016

Page 2: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Introduction

Aim: Overview of optimization methods that

I Tend to scale well with the problem dimensionI Are widely used in machine learning and signal processingI Are (reasonably) well understood theoretically

Page 3: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 4: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 5: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient

0.08

0.48

0.88

1.28

1.68

2.08

2.48

2.88

3.28

3.68

Direction of maximum variation

Page 6: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent (aka steepest descent)

Method to solve the optimization problem

minimize f (x) ,

where f is differentiable

Gradient-descent iteration:

x (0) = arbitrary initialization

x (k+1) = x (k) − αk ∇f(x (k)

)where αk is the step size

Page 7: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent (1D)

Page 8: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent (2D)

0.51.01.52.02.53.03.54.0

Page 9: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Small step size

0.51.01.52.02.53.03.54.0

Page 10: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Large step size

102030405060708090100

Page 11: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Line search

I Exact

αk := argminβ≥0

f(x (k) − β∇f

(x (k)

))

I Backtracking (Armijo rule)

Given α0 ≥ 0 and β ∈ (0, 1), set αk := α0 βi for the smallest isuch that

f(x (k+1)

)≤ f

(x (k)

)− 1

2αk

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

Page 12: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Backtracking line search

0.51.01.52.02.53.03.54.0

Page 13: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 14: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Lipschitz continuity

A function f : Rn → Rm is Lipschitz continuous with Lipschitzconstant L if for any x , y ∈ Rn

||f (y)− f (x)||2 ≤ L ||y − x ||2

Example:

f (x) := Ax is Lipschitz continuous with L = σmax (A)

Page 15: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Quadratic upper bound

If the gradient of f : Rn → R is Lipschitz continuous with constant L

||∇f (y)−∇f (x)||2 ≤ L ||y − x ||2

then for any x , y ∈ Rn

f (y) ≤ f (x) +∇f (x)T (y − x) +L2||y − x ||22

Page 16: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Consequence of quadratic bound

Since x (k+1) = x (k) − αk∇f(x (k))

f(x (k+1)

)≤ f

(x (k)

)− αk

(1− αkL

2

) ∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

If αk ≤ 1L the value of the function always decreases!

f(x (k+1)

)≤ f

(x (k)

)− αk

2

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

Page 17: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent with constant step size

Conditions:I f is convexI ∇f is L-Lipschitz continuousI There exists a solution x∗ such that f (x∗) is finite

If αk = α ≤ 1L

f(x (k)

)− f (x∗) ≤

∣∣∣∣x (k) − x (0)∣∣∣∣2

22α k

We need O(1ε

)iterations to get an ε-optimal solution

Page 18: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proof

Recall that if α ≤ 1L

f(x (i))≤ f

(x (i−1)

)− α

2

∣∣∣∣∣∣∇f(x (i−1)

)∣∣∣∣∣∣22

By the first-order characterization of convexity

f(x (i−1)

)− f (x∗) ≤ ∇f

(x (i−1)

)T (x (i−1) − x∗

)This implies

f(x (i))− f (x∗) ≤ ∇f

(x (i−1)

)T (x (i−1) − x∗

)− α

2

∣∣∣∣∣∣∇f(x (i−1)

)∣∣∣∣∣∣22

=12α

(∣∣∣∣∣∣x (i−1) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (i−1) − x∗ − α∇f

(x (i−1)

)∣∣∣∣∣∣22

)=

12α

(∣∣∣∣∣∣x (i−1) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (i) − x∗

∣∣∣∣∣∣2

)

Page 19: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proof

Because the value of f never increases,

f(x (k)

)− f (x∗) ≤ 1

k

k∑i=1

f(x (i))− f (x∗)

=1

2α k

(∣∣∣∣∣∣x (0) − x∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣x (k) − x∗

∣∣∣∣∣∣22

)≤∣∣∣∣x (0) − x∗

∣∣∣∣22

2α k

Page 20: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Backtracking line search

If the gradient of f : Rn → R is Lipschitz continuous with constant Lthe step size in the backtracking line search satisfies

αk ≥ αmin := min{α0,

β

L

}

Page 21: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proof

Line search ends when

f(x (k+1)

)≤ f

(x (k)

)− αk

2

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

but we know that if αk ≤ 1L

f(x (k+1)

)≤ f

(x (k)

)− αk

2

∣∣∣∣∣∣∇f(x (k)

)∣∣∣∣∣∣22

This happens as soon as β/L ≤ α0βi ≤ 1/L

Page 22: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent with backtracking

Under the same conditions as before gradient descent withbacktracking line search achieves

f(x (k)

)− f (x∗) ≤

∣∣∣∣x (0) − x∗∣∣∣∣2

22αmin k

O(1ε

)iterations to get an ε-optimal solution

Page 23: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Strong convexity

A function f : Rn is strongly convex if for any x , y ∈ Rn

f (y) ≥ f (x) +∇f (x)T (y − x) + S ||y − x ||2 .

Example:

f (x) := ||Ax − y ||22

where A ∈ Rm×n is strongly convex with S = σmin (A) if m > n

Page 24: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Gradient descent for strongly convex functions

If f is S-strongly convex and ∇f is L-Lipschitz continuous

f(x (k)

)− f (x∗) ≤

ckL∣∣∣∣x (k) − x (0)

∣∣∣∣22

2

c :=LS − 1LS + 1

We need O(log 1

ε

)iterations to get an ε-optimal solution

Page 25: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 26: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Lower bounds for convergence rate

There exist convex functions with L-Lipschitz-continuous gradientssuch that for any algorithm that selects x (k) from

x (0) + span{∇f(x (0)

),∇f

(x (1)

), . . . ,∇f

(x (k−1)

)}we have

f(x (k)

)− f (x∗) ≥

3L∣∣∣∣x (0) − x∗

∣∣∣∣22

32 (k + 1)2

Page 27: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Nesterov’s accelerated gradient method

Achieves lower bound, i.e. O(

1√ε

)convergence

Uses momentum variable

y (k+1) = x (k) − αk∇f(x (k)

)x (k+1) = βky (k+1) + γky (k)

Despite guarantees, why this works is not completely understood

Page 28: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 29: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Projected gradient descent

Optimization problem

minimize f (x)subject to x ∈ S,

where f is differentiable and S is convex

Projected-gradient-descent iteration:

x (0) = arbitrary initialization

x (k+1) = PS(x (k) − αk ∇f

(x (k)

))

Page 30: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Projected gradient descent

12345678

Page 31: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Projected gradient descent

12345678

Page 32: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 33: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 34: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Subgradient method

Optimization problem

minimize f (x)

where f is convex but nondifferentiable

Subgradient-method iteration:

x (0) = arbitrary initialization

x (k+1) = x (k) − αk q(k)

where q(k) is a subgradient of f at x (k)

Page 35: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Least-squares regression with `1-norm regularization

minimize12||Ax − y ||22 + λ ||x ||1

Sum of subgradients is a subgradient of the sum

q(k) = AT(Ax (k) − y

)+ λ sign

(x (k)

)Subgradient-method iteration:

x (0) = arbitrary initialization

x (k+1) = x (k) − αk

(AT(Ax (k) − y

)+ λ sign

(x (k)

))

Page 36: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Convergence of subgradient method

It is not a descent method

Convergence rate can be shown to be O(1/ε2

)Diminishing step sizes are necessary for convergence

Experiment:

minimize12||Ax − y ||22 + λ ||x ||1

A ∈ R2000×1000, y = Ax0 + z where x0 is 100-sparse and z is iid Gaussian

Page 37: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Convergence of subgradient method

0 20 40 60 80 100

k

10-2

10-1

100

101

102

103

104

f(x

(k))−f(x∗)

f(x∗)

α0

α0/√k

α0/k

Page 38: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Convergence of subgradient method

0 1000 2000 3000 4000 5000

k

10-3

10-2

10-1

100

101

f(x

(k))−f(x∗)

f(x∗)

α0

α0/√n

α0/n

Page 39: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 40: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Composite functions

Interesting class of functions for data analysis

f (x) + g (x)

f convex and differentiable, g convex but not differentiable

Example:

Least-squares regression (f ) + `1-norm regularization (g)

12||Ax − y ||22 + λ ||x ||1

Page 41: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Interpretation of gradient descent

Solution of local first-order approximation

x (k+1) := x (k) − αk ∇f(x (k)

)= argmin

x

∣∣∣∣∣∣x − (x (k) − αk ∇f(x (k)

))∣∣∣∣∣∣22

= argminx

f(x (k)

)+∇f

(x (k)

)T (x − x (k)

)+

12αk

∣∣∣∣∣∣x − x (k)∣∣∣∣∣∣2

2

Page 42: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proximal gradient method

Idea: Minimize local first-order approximation + g

x (k+1) = argminx

f(x (k)

)+∇f

(x (k)

)T (x − x (k)

)+

12αk

∣∣∣∣∣∣x − x (k)∣∣∣∣∣∣2

2

+ g (x)

= argminx

12

∣∣∣∣∣∣x − (x (k) − αk ∇f(x (k)

))∣∣∣∣∣∣22+ αk g (x)

= proxαk g

(x (k) − αk ∇f

(x (k)

))Proximal operator:

proxg (y) := argminx

g (x) +12||y − x ||22

Page 43: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proximal gradient method

Method to solve the optimization problem

minimize f (x) + g (x) ,

where f is differentiable and proxg is tractable

Proximal-gradient iteration:

x (0) = arbitrary initialization

x (k+1) = proxαk g

(x (k) − αk ∇f

(x (k)

))

Page 44: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Interpretation as a fixed-point method

A vector x̂ is a solution to

minimize f (x) + g (x) ,

if and only if it is a fixed point of the proximal-gradient iterationfor any α > 0

x̂ = proxαk g (x̂ − αk ∇f (x̂))

Page 45: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Projected gradient descent as a proximal method

The proximal operator of the indicator function

IS (x) :=

{0 if x ∈ S,∞ if x /∈ S.

of a convex set S ⊆ Rn is projection onto S

Proximal-gradient iteration:

x (k+1) = proxαk IS

(x (k) − αk ∇f

(x (k)

))= PS

(x (k) − αk ∇f

(x (k)

))

Page 46: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Proximal operator of `1 norm

The proximal operator of the `1 norm is the soft-thresholding operator

proxβ ||·||1 (y) = Sβ (y)

where β > 0 and

Sβ (y)i :=

{yi − sign (yi )β if |yi | ≥ β0 otherwise

Page 47: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Iterative Shrinkage-Thresholding Algorithm (ISTA)

The proximal gradient method for the problem

minimize12||Ax − y ||22 + λ ||x ||1

is called ISTA

ISTA iteration:

x (0) = arbitrary initialization

x (k+1) = Sαk λ

(x (k) − αk AT

(Ax (k) − y

))

Page 48: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

ISTA can be accelerated using Nesterov’s accelerated gradient method

FISTA iteration:

x (0) = arbitrary initialization

z(0) = x (0)

x (k+1) = Sαk λ

(z(k) − αk AT

(Az(k) − y

))z(k+1) = x (k+1) +

kk + 3

(x (k+1) − x (k)

)

Page 49: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Convergence of proximal gradient method

Without acceleration:I Descent methodI Convergence rate can be shown to be O (1/ε) with constant step or

backtracking line searchWith acceleration:I Not a descent methodI Convergence rate can be shown to be O

(1√ε

)with constant step or

backtracking line search

Experiment: minimize 12 ||Ax − y ||22 + λ ||x ||1

A ∈ R2000×1000, y = Ax0 + z , x0 100-sparse and z iid Gaussian

Page 50: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Convergence of proximal gradient method

0 20 40 60 80 100

k

10-5

10-4

10-3

10-2

10-1

100

101

102

103

104

f(x

(k))−f(x∗)

f(x∗)

Subg. method (α0/√k )

ISTA

FISTA

Page 51: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Differentiable functionsGradient descentConvergence analysis of gradient descentAccelerated gradient descentProjected gradient descent

Nondifferentiable functionsSubgradient methodProximal gradient methodCoordinate descent

Page 52: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Coordinate descent

Idea: Solve the n-dimensional problem

minimize h (x1, x2, . . . , xn)

by solving a sequence of 1D problems

Coordinate-descent iteration:

x (0) = arbitrary initialization

x (k+1)i = argmin

αh(x (k)1 , . . . , α, . . . , x (k)

n

)for some 1 ≤ i ≤ n

Page 53: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Coordinate descent

Convergence is guaranteed for functions of the form

f (x) +n∑

i=1

gi (xi )

where f is convex and differentiable and g1, . . . , gn are convex

Page 54: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Least-squares regression with `1-norm regularization

h (x) :=12||Ax − y ||22 + λ ||x ||1

The solution to the subproblem minxi h (x1, . . . , xi , . . . , xn) is

x̂i =Sλ (γi )

||Ai ||22

where Ai is the ith column of A and

γi :=m∑

l=1

Ali

yl −∑j 6=i

Aljxj

Page 55: Optimization methods - New York Universitycfgranda/pages/OBDA_spring16/material/lecture_3.pdf · Gradientdescent(akasteepestdescent) Methodtosolvetheoptimizationproblem minimize f

Computational experiments

From Statistical Learning with Sparsity The Lasso and Generalizationsby Hastie, Tibshirani and Wainwright