optimization for datascience - robert m. gower · master 2 data science, institut polytechnique de...

77
1 Optimization for Datascience Proximal operator and proximal gradient methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quenn Bertrand, Nidham Gazagnadou Master 2 Data Science, Institut Polytechnique de Paris (IPP)

Upload: others

Post on 18-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

1

Optimization for Datascience

Proximal operator and proximal gradient methods

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Master 2 Data Science, Institut Polytechnique de Paris (IPP)

Page 2: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

2

References for todays class

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Sébastien Bubeck (2015)Convex Optimization: Algorithms andComplexity

Chapter 1 and Section 5.1

Page 3: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

3

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Page 4: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

4

The Training Problem

Page 5: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

5Convergence GD I

Theorem

Let f be convex and L-smooth.

Where

Page 6: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

6Convergence GD I

Theorem

Let f be convex and L-smooth.

Where

Is f always differentiable?

Page 7: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

7Convergence GD I

Theorem

Let f be convex and L-smooth.

Where

Not true for many problems

Is f always differentiable?

Page 8: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

8Change notation: Keep loss and regularizor separate

Loss function

The Training problem

If L or R is not differentiable

L+R is not differentiable

If L or R is not smooth

L+R is not smooth

Page 9: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

9

Non-smooth Example

Page 10: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

10

Non-smooth Example

Does not fit. Not smooth

Page 11: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

11

Non-smooth Example

Does not fit. Not smooth

The problem

Page 12: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

12

Non-smooth Example

Does not fit. Not smooth

Need more tools

The problem

Page 13: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

13

Non-smooth Example

Does not fit. Not smooth

Need more tools

The problem

Page 14: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

14

Assumptions for this class

The Training problem

What does this mean?

Assume this is easy to solve

Page 15: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

15

Examples

SVM with soft margin

Lasso

Low Rank Matrix RecoveryNot smooth, but prox is easy

Not smooth

Page 16: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

16

Convexity: Subgradient

Page 17: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

17

Convexity: Subgradient

g =0

Page 18: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

18

Examples: L1 norm

Page 19: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

19

Optimality conditions

The Training problem

Page 20: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

20

Working example: Lasso

Lasso

Difficult inclusion, do iteratively.

Page 21: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

21

Proximal method I

The w that minimizes the upper bound gives gradient descent

Page 22: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

22

Proximal method I

The w that minimizes the upper bound gives gradient descent

But what about R(w)? Adding on + to upper bound:

Page 23: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

23

Proximal method I

The w that minimizes the upper bound gives gradient descent

But what about R(w)? Adding on + to upper bound:

Page 24: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

24

Proximal method I

The w that minimizes the upper bound gives gradient descent

But what about R(w)? Adding on + to upper bound:

Can we minimize the right-hand side?

Page 25: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

25Proximal method: iteratively minimizes an upper boundMinimizing the right-hand side of

Page 26: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

26Proximal method: minimizes an upperbound viewpointSet y = wt and minimize the right-hand side in w

This suggests an iterative method

Page 27: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

27Proximal method: minimizes an upperbound viewpointSet y = wt and minimize the right-hand side in w

This suggests an iterative method

What is this prox operator?

Page 28: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

28

EXE: Let

Show that

Gradient Descent using proximal map

A gradient step is also a proximal step

Page 29: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

29Proximal Operator: Well defined inclusion

EXE: Is this Proximal operator well defined? Is it even a function?

Page 30: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

30Proximal Operator: Well defined inclusion

EXE: Is this Proximal operator well defined? Is it even a function?

Page 31: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

31Proximal Operator: Well defined inclusion

Rearranging

EXE: Is this Proximal operator well defined? Is it even a function?

Page 32: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

32Proximal Method: A fixed point viewpoint

The Training problem

Page 33: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

33Proximal Method: A fixed point viewpoint

The Training problem

Page 34: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

34Proximal Method: A fixed point viewpoint

The Training problem

Page 35: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

35Proximal Method: A fixed point viewpoint

The Training problem

Page 36: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

36Proximal Method: A fixed point viewpoint

The Training problem

Optimal is a fixed point

Page 37: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

37Proximal Method: A fixed point viewpoint

The Training problem

Optimal is a fixed point

Upperbound viewpoint

Page 38: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

38

Proximal Operator: Properties

Exe:

Page 39: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

39

Proximal Operator: Soft thresholding

Exe:

Page 40: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

41Proximal Operator: Singular value thresholding

Similarly, the prox of the nuclear norm for matrices:

Page 41: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

42Proximal method: iteratively minimizes an upper boundMinimizing the right-hand side of

Make iterative method based on this upper bound minimization

Page 42: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

43

The Proximal Gradient Method

Page 43: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

44Example of prox gradient: Iterative Soft Thresholding Algorithm (ISTA)

Lasso

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

ISTA:

Page 44: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

45Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Then

where

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 45: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

46Convergence of Prox-GDTheorem (Beck Teboulle 2009)

Then

where

Can we do better?

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 46: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

47The FISTA Method

Page 47: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

48The FISTA Method

Weird, but it works

Page 48: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

49Convergence of FISTATheorem (Beck Teboulle 2009)

Then

Where wt are given by the FISTA algorithm

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 49: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

50Convergence of FISTATheorem (Beck Teboulle 2009)

Then

Where wt are given by the FISTA algorithm

Is this as good as it gets?

Amir Beck and Marc Teboulle (2009), SIAM J. IMAGING SCIENCES, A Fast Iterative Shrinkage-Thresholding Algorithmfor Linear Inverse Problems.

Page 50: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

51Lab Session 30.09

Room C129 and C130

Bring your laptop Please install:Python, matplotlib, scipy and numpy

Page 51: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

52Lab Session 30.09

Room C129 and C130

Bring your laptop Please install:Python, matplotlib, scipy and numpy

Page 52: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

53

Introduction to Stochastic Gradient Descent

Page 53: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

54

General methods

Recap

● Gradient Descent

Training Problem

Two parts● Proximal gradient

(ISTA)● Fast proximal

gradient (FISTA)

L(w)

Page 54: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

55

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Can we use this sum structure?

Page 55: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

56

The Training Problem

Page 56: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

57

The Training Problem

Page 57: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

58

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Page 58: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

59

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 59: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

60

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 60: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

61

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

EXE:

Page 61: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

62

Stochastic Gradient Descent

Page 62: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

63

Stochastic Gradient Descent

Optimal point

Page 63: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

64

Strong Convexity

Assumptions for Convergence

Page 64: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

65

Strong Convexity

Assumptions for Convergence

Page 65: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

66

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 66: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

67

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 67: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

68Complexity / Convergence

Theorem

Page 68: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

69Complexity / Convergence

Theorem

Page 69: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

70Complexity / Convergence

Theorem

Page 70: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

71Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

Page 71: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

72Stochastic Gradient Descent α =0.01

Page 72: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

73Stochastic Gradient Descent α =0.1

Page 73: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

74Stochastic Gradient Descent α =0.2

Page 74: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

75Stochastic Gradient Descent α =0.5

Page 75: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

76

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 76: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

77

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 77: Optimization for Datascience - Robert M. Gower · Master 2 Data Science, Institut Polytechnique de Paris (IPP) 2 References for todays class Amir Beck and Marc Teboulle (2009), SIAM

78

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence