optimization for datascience - robert m. gower · master 2 data science, institut polytechnique de...

1

Optimization for Datascience

Proximal operator and proximal gradient methods

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Master 2 Data Science, Institut Polytechnique de Paris (IPP)

2

References for todays class

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Sébastien Bubeck (2015)Convex Optimization: Algorithms andComplexity

Chapter 1 and Section 5.1

3

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

4

The Training Problem

5Convergence GD I

Theorem

Let f be convex and L-smooth.

Where

6Convergence GD I

Theorem


Where

Is f always differentiable?

7Convergence GD I

Theorem


Where

Not true for many problems

Is f always differentiable?

8Change notation: Keep loss and regularizor separate

Loss function

The Training problem

If L or R is not differentiable

L+R is not differentiable

If L or R is not smooth

L+R is not smooth

9

Non-smooth Example

10

Non-smooth Example

Does not fit. Not smooth

11

Non-smooth Example


The problem

12

Non-smooth Example


Need more tools

The problem

13

Non-smooth Example


Need more tools

The problem

14

Assumptions for this class


What does this mean?

Assume this is easy to solve

15

Examples

SVM with soft margin

Lasso

Low Rank Matrix RecoveryNot smooth, but prox is easy

Not smooth

16

Convexity: Subgradient

17

Convexity: Subgradient

g =0

18

Examples: L1 norm

19

Optimality conditions


20

Working example: Lasso

Lasso

Difficult inclusion, do iteratively.

21

Proximal method I

The w that minimizes the upper bound gives gradient descent

22

Proximal method I


But what about R(w)? Adding on + to upper bound:

23

Proximal method I



24

Proximal method I



Can we minimize the right-hand side?

25Proximal method: iteratively minimizes an upper boundMinimizing the right-hand side of

26Proximal method: minimizes an upperbound viewpointSet y = wt and minimize the right-hand side in w

This suggests an iterative method

27Proximal method: minimizes an upperbound viewpointSet y = wt and minimize the right-hand side in w

This suggests an iterative method

What is this prox operator?

28

EXE: Let

Show that

Gradient Descent using proximal map

A gradient step is also a proximal step

29Proximal Operator: Well defined inclusion

EXE: Is this Proximal operator well defined? Is it even a function?


Rearranging


32Proximal Method: A fixed point viewpoint




Optimal is a fixed point



Optimal is a fixed point

Upperbound viewpoint

38

Proximal Operator: Properties

Exe:

39

Proximal Operator: Soft thresholding

Exe:

41Proximal Operator: Singular value thresholding

Similarly, the prox of the nuclear norm for matrices:

42Proximal method: iteratively minimizes an upper boundMinimizing the right-hand side of

Make iterative method based on this upper bound minimization

43

The Proximal Gradient Method

44Example of prox gradient: Iterative Soft Thresholding Algorithm (ISTA)

Lasso


ISTA:

45Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Then

where


46Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Then

where

Can we do better?


47The FISTA Method

48The FISTA Method

Weird, but it works

49Convergence of FISTATheorem (Beck Teboulle 2009)

Then

Where wt are given by the FISTA algorithm


50Convergence of FISTATheorem (Beck Teboulle 2009)

Then

Where wt are given by the FISTA algorithm

Is this as good as it gets?


51Lab Session 30.09

Room C129 and C130

Bring your laptop Please install:Python, matplotlib, scipy and numpy

52Lab Session 30.09

Room C129 and C130

Bring your laptop Please install:Python, matplotlib, scipy and numpy

53

Introduction to Stochastic Gradient Descent

54

General methods

Recap

● Gradient Descent

Training Problem

Two parts● Proximal gradient

(ISTA)● Fast proximal

gradient (FISTA)

L(w)

55

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Can we use this sum structure?

56


57


58

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

59



Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

60




61




EXE:

62


63


Optimal point

64

Strong Convexity

Assumptions for Convergence

65

Strong Convexity


66

Expected Bounded Stochastic Gradients

Strong Convexity


67


Strong Convexity


68Complexity / Convergence

Theorem


Theorem

71Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

72Stochastic Gradient Descent α =0.01

76


Strong Convexity


77


Strong Convexity


78


Strong Convexity


optimization for datascience - robert m. gower · master 2 data science, institut polytechnique de...

Documents