optimization for machine learning - approximation theory

41
Optimization for Machine Learning Approximation Theory and Machine Learning Sven Leyffer Argonne National Laboratory September, 30 2018

Upload: others

Post on 26-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Machine Learning - Approximation Theory

Optimization for Machine LearningApproximation Theory and Machine Learning

Sven Leyffer

Argonne National Laboratory

September, 30 2018

Page 2: Optimization for Machine Learning - Approximation Theory

Outline

1 Data Analysis at DOE Light Sources

2 Optimization for Machine Learning

3 Mixed-Integer Nonlinear OptimizationOptimal Symbolic RegressionDeep Neural Nets as MIPsSparse Support-Vector Machines

4 Robust OptimizationRobust Optimization for SVMs

5 Conclusions and Extension

2 / 31

Page 3: Optimization for Machine Learning - Approximation Theory

Motivation: Datanami from DOE Lightsource Upgrades

Data size and speed to outpace Moore’s law (source Ian Foster)

3 / 31

Page 4: Optimization for Machine Learning - Approximation Theory

Challenges at DOE Lightsources

Math, Stats, and CS Challenges from APS Upgrade

10x increase in data rates and size ⇒ HPC & CS

Heterogeneous experiments & requirements ⇒ hotchpotch of math/CS solution

Multi-modal data analysis, movies, ... ⇒ more complex reconstruction

New experimental design ⇒ less regular data

4 / 31

Page 5: Optimization for Machine Learning - Approximation Theory

Example: Learning Cell Identification from Spectral Data

Identify cell-type from concentration maps of P, Mn, Fe, Zn ...5 / 31

Page 6: Optimization for Machine Learning - Approximation Theory

Learning Cell Identification via Nonnegative Matrix Factorization

minimizeW ,H

‖A−WH‖2F subject to W ≥ 0, H ≥ 0

where “data” A is 1, 000× 1, 000 image ×2, 000 channels

W are weight ' additive elemental spectra

H are images ' additive elemental maps

Solve using (cheap) gradient steps ... need good initialization of W !

Insight from Data

Repeat analysis hundreds of times to, e.g., classify/identify cancerous cells etc.

6 / 31

Page 7: Optimization for Machine Learning - Approximation Theory

Result: Learning Cell Identification from Spectral Data

Raw data ... ... identify cell ... ... classify cells

Traditional Cell Identification at APS

Ask student/postdoc to “mark” potential cell locations by hand & test

Opportunities for Applied Math & CS Light Sources

ML plus physical/statistical models, large-scale streaming data, ...

7 / 31

Page 8: Optimization for Machine Learning - Approximation Theory

Outline

1 Data Analysis at DOE Light Sources

2 Optimization for Machine Learning

3 Mixed-Integer Nonlinear OptimizationOptimal Symbolic RegressionDeep Neural Nets as MIPsSparse Support-Vector Machines

4 Robust OptimizationRobust Optimization for SVMs

5 Conclusions and Extension

8 / 31

Page 9: Optimization for Machine Learning - Approximation Theory

Optimization for Machine Learning [Sra, Nowozin, & Wright (eds.)]

Convexity & Sparsity-Inducing Norms

Nonsmooth Optimization: Gradient,Subgradient & Proximal Methods

Newton & Interior-Point Methods for ML

Cutting-Pane Methods in ML

Augmented Lagrangian Methods & ADMM

Uncertainty & Robust optimization in ML

(Inverse) Covariance Selection

Important Argonne Legalese Disclaimer

I made zero contributions to this fantastic book!Worse: Until yesterday, I had no clue about this!!!

9 / 31

Page 10: Optimization for Machine Learning - Approximation Theory

Optimization for Machine Learning [Sra, Nowozin, & Wright (eds.)]

Convexity & Sparsity-Inducing Norms

Nonsmooth Optimization: Gradient,Subgradient & Proximal Methods

Newton & Interior-Point Methods for ML

Cutting-Pane Methods in ML

Augmented Lagrangian Methods & ADMM

Uncertainty & Robust optimization in ML

(Inverse) Covariance Selection

Important Argonne Legalese Disclaimer

I made zero contributions to this fantastic book!

Worse: Until yesterday, I had no clue about this!!!

9 / 31

Page 11: Optimization for Machine Learning - Approximation Theory

Optimization for Machine Learning [Sra, Nowozin, & Wright (eds.)]

Convexity & Sparsity-Inducing Norms

Nonsmooth Optimization: Gradient,Subgradient & Proximal Methods

Newton & Interior-Point Methods for ML

Cutting-Pane Methods in ML

Augmented Lagrangian Methods & ADMM

Uncertainty & Robust optimization in ML

(Inverse) Covariance Selection

Important Argonne Legalese Disclaimer

I made zero contributions to this fantastic book!Worse: Until yesterday, I had no clue about this!!!

9 / 31

Page 12: Optimization for Machine Learning - Approximation Theory

The Four Lands of Learning [Moritz Hardt, UC Berkeley]

Non-Convex Non-Optimization (2018 INFORMS Optimization Conference)

Convexico Gradientina

https://mrtz.org/gradientina.html#/

10 / 31

Page 13: Optimization for Machine Learning - Approximation Theory

The Four Lands of Learning [Moritz Hardt, UC Berkeley]

Non-Convex Non-Optimization (2018 INFORMS Optimization Conference)

Convexico

Gradientina

https://mrtz.org/gradientina.html#/

10 / 31

Page 14: Optimization for Machine Learning - Approximation Theory

The Four Lands of Learning [Moritz Hardt, UC Berkeley]

Non-Convex Non-Optimization (2018 INFORMS Optimization Conference)

Convexico Gradientina

https://mrtz.org/gradientina.html#/

10 / 31

Page 15: Optimization for Machine Learning - Approximation Theory

The Four Lands of Learning [Moritz Hardt, UC Berkeley]

Non-Convex Non-Optimization (2018 INFORMS Optimization Conference)

Optopia

Messiland

https://mrtz.org/gradientina.html#/

11 / 31

Page 16: Optimization for Machine Learning - Approximation Theory

The Four Lands of Learning [Moritz Hardt, UC Berkeley]

Non-Convex Non-Optimization (2018 INFORMS Optimization Conference)

Optopia Messiland

https://mrtz.org/gradientina.html#/

11 / 31

Page 17: Optimization for Machine Learning - Approximation Theory

Outline

1 Data Analysis at DOE Light Sources

2 Optimization for Machine Learning

3 Mixed-Integer Nonlinear OptimizationOptimal Symbolic RegressionDeep Neural Nets as MIPsSparse Support-Vector Machines

4 Robust OptimizationRobust Optimization for SVMs

5 Conclusions and Extension

12 / 31

Page 18: Optimization for Machine Learning - Approximation Theory

Mixed-Integer Nonlinear Optimization

Mixed-Integer Nonlinear Program (MINLP)

minimizex

f (x)

subject to c(x) ≤ 0x ∈ Xxi ∈ Z for all i ∈ I

... see survey, [Belotti et al., 2013]feasible

UBD

dominated

by UBD

x=0 x=1

infeasible

integer

X bounded polyhedral set, e.g. X = {x : l ≤ AT x ≤ u}f : Rn → R and c : Rn → Rm twice continuously differentiable (maybe convex)

I ⊂ {1, . . . , n} subset of integer variables

MINLPs are NP-hard, see [Kannan and Monma, 1978]

Worse: MINLP are undecidable, see [Jeroslow, 1973]

13 / 31

Page 19: Optimization for Machine Learning - Approximation Theory

Optimal Symbolic Regression

Goal in Optimal Symbolic Regression

Find symbolic mathematical expression that explains dependent variable in terms ofindependent variables without assuming functional form!

[Austel et al., 2017] propose MINLP model

Find simplest symbolic mathematical expression ... objective

Constrain the “grammar” of expressions ... constraints

Match data (observations) to expression ... continuous variables

Select “best” possible expression ... binary variables

... model mathematical expressions as a directed acyclic graph (DAG)

14 / 31

Page 20: Optimization for Machine Learning - Approximation Theory

Factorable Functions and Expression Trees

Definition (Factorable Function)

f (x) is factorable iff expressed as sum of products of unary functions of a finite setOunary = {sin, cos, exp, log, | · |} whose arguments are variables, constants, or otherfunctions, which are factorable.

Combination of functions from set of operatorsO = {+,×, /, , sin, cos, exp, log, | · |}.Excludes integrals

∫ xξ=x0

h(ξ)dξ and black-box functions

Can be represented as expression trees

Forms basis for automatic differentiation& global optimization of nonconvex functions... see, e.g. [Gebremedhin et al., 2005]

Expression Tree

log

^

31

2

2x x

x

*

+

f (x1, x2) = x1 log(x2) + x32

15 / 31

Page 21: Optimization for Machine Learning - Approximation Theory

Optimal Symbolic Regression [Austel et al., 2017]

Build and solve optimal symbolic regression as MINLP

Form “supertree” of all possible expression trees

Use binary variables to switch parts of tree on/off

Compute data mismatch by propagating data values through tree

Minimize complexity (size) of expression tree with bound on data mismatch

⇒ large nonconvex MINLP model ... solved using Baron, SCIP, Couenne

Example: Kepler’s Law on planetary motion from NASA data with depth 3

Data 2% Noise 10% Noise 30% Noise

Ex13√cτ2M

3√τ2(M + c)

√cτ2

Ex23√cτ2M

3√τ2c

√τ

Ex33√cτ2M 3

√τM + τ

√cτ + c

Correct answer: d = 3√

cτ2(M + m) major semi-axis of m orbiting M at period τ

16 / 31

Page 22: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]

Model DNN as MIP

Model ReLU activation function with binary variables

Model output of DNN as function of inputs (variable!)

Solvable for DNNs of moderate size with MIP solversImage from Arden Dertad

WARNING: Do not use for training of DNN!

MIP-model is totally unsuitable for training ... cumbersome & expensive to evaluate!

Where can we use MIP models?

Use MIP for building adversarial examples that fool the DNN ... flexible!

17 / 31

Page 23: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]

Model DNN as MIP

Model ReLU activation function with binary variables

Model output of DNN as function of inputs (variable!)

Solvable for DNNs of moderate size with MIP solversImage from Arden Dertad

WARNING: Do not use for training of DNN!

MIP-model is totally unsuitable for training ... cumbersome & expensive to evaluate!

Where can we use MIP models?

Use MIP for building adversarial examples that fool the DNN ... flexible!

17 / 31

Page 24: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]

Model DNN as MIP

Model ReLU activation function with binary variables

Model output of DNN as function of inputs (variable!)

Solvable for DNNs of moderate size with MIP solversImage from Arden Dertad

WARNING: Do not use for training of DNN!

MIP-model is totally unsuitable for training ... cumbersome & expensive to evaluate!

Where can we use MIP models?

Use MIP for building adversarial examples that fool the DNN ... flexible!

17 / 31

Page 25: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]

DNN with K + 1 layers: input= 0, . . . ,K =output

nk nodes/units per layer UNIT(j,k) with output xkj ← UNIT(j,k)

UNIT(j,k), e.g. ReLU: xk = max(0,W k−1xk−1 + bk−1

),

where W k , bk DNN known parameters (from training)

Key Insight (not new): Use Implication Constraints!

Model x = max(0,wT y + b) using implications, or binary variables:

x = max(0,wT y + b) ⇔{wT y + b = x − s, x ≥ 0, s ≥ 0z ∈ {0, 1}, with z = 1⇒ x ≤ 0 and z = 0⇒ s ≤ 0

... alternative 0 ≤ s ⊥ x ≥ 0 complementarity constraint

Also model MaxPool: x = max(y1, . . . , yt) using t binary vars & SOS-1 constraint

18 / 31

Page 26: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]Gives MIP model with flexible objective (DNN outputs xK , binary vars x)

minimizex ,s,z

cT x + dT z

subject to(wk−1j

)Txk−1 + bk−1

j = xkj − skj , xkj , skj ≥ 0

zkj ∈ {0, 1}, with zkj = 1⇒ xkj ≤ 0 and zkj = 0⇒ skj ≤ 0

l0 ≤ x0 ≤ u0

... for given input = x0, just compute output = xK expensive!

Modeling Implication Constraints

z ∈ {0, 1}, with z = 1⇒ x ≤ 0 and z = 0⇒ s ≤ 0⇔ z ∈ {0, 1}, with x ≤ Mx(1− z) and s ≤ Msz

Use MIP for Building Adversarial Example

Fix weights W , b from training data

Find smallest perturbation to inputs x0 that results in mis-classification19 / 31

Page 27: Optimization for Machine Learning - Approximation Theory

Deep Neural Nets (DNNs) as MIPs [Fischetti and Jo, 2018]

Example: DNN for digit classification as MIP

Misclassify all digits: d = (d + 5) mod 10, i.e. 0→ 5, 1→ 6, ...

Require activation of “wrong” digit in final layer is 20% above others

Need tight bnds Mx ,Ms in implications: propagate bnds forward through DNN

Results with CPLEX Solver and Tight Bounds (300s max CPU)# Hidden # Nodes % Solved # Nodes CPU

3 8 100 552 0.64 20/8 100 20,309 12.15 20/10 67 76,714 171.1

20 / 31

Page 28: Optimization for Machine Learning - Approximation Theory

Sparse Support-Vector Machines

Standard SVM Training

Data S = {xi , yi}mi=1: features xi ∈ Rn labelsyi ∈ {−1, 1}ξ ≥ 0 slacks, b bias, c > 0 penalty parameter

minimizew ,b,ξ

12‖w‖2

2 + c‖ξ‖1 = 12‖w‖2

2 + c1T ξ

subject to Y (Xw − b1) + ξ ≥ 1ξ ≥ 0,

where Y = diag(y) and X = [x1, . . . , xm]T

Find MINLP Model for Feature Selection in SVMs

Given labeled training data find maximum margin classifier that minimizes hinge-lossand cardinality of weight-vector, ‖w‖0

21 / 31

Page 29: Optimization for Machine Learning - Approximation Theory

Sparse Support-Vector Machines

Standard SVM Training

Data S = {xi , yi}mi=1: features xi ∈ Rn labelsyi ∈ {−1, 1}ξ ≥ 0 slacks, b bias, c > 0 penalty parameter

minimizew ,b,ξ

12‖w‖2

2 + c‖ξ‖1 = 12‖w‖2

2 + c1T ξ

subject to Y (Xw − b1) + ξ ≥ 1ξ ≥ 0,

where Y = diag(y) and X = [x1, . . . , xm]T

Find MINLP Model for Feature Selection in SVMs

Given labeled training data find maximum margin classifier that minimizes hinge-lossand cardinality of weight-vector, ‖w‖0

21 / 31

Page 30: Optimization for Machine Learning - Approximation Theory

Sparse Support-Vector Machines[Guan et al., 2009] consider `0-norm penalty on w as MINLP

minimizew ,b,ξ

12‖w‖2

2 + a‖w‖0 + c1T ξ

subject to Y (Xw − b1) + ξ ≥ 1, ξ ≥ 0,

Model `0 with Perspective & Binary zj Counter

minimizeu,w ,b,ξ,z

1Tu + a1T z + c1T ξ

subject to Y (Xw − b1) + ξ ≥ 1, ξ ≥ 0w2j ≤ zjuj , u ≥ 0, zj ∈ {0, 1}

... conic-MIP, see, e.g. [Gunluk and Linderoth, 2008]

... w2j ≤ zjuj violates CQs ⇒ weaker big-M formulation ...

0 ≤ uj ≤ Muzj , w2j ≤ uj

22 / 31

Page 31: Optimization for Machine Learning - Approximation Theory

Sparse Support-Vector Machines[Goldberg et al., 2013] rewrite w2

j ≤ zjuj as

‖ (2wj , uj − zj) ‖2 ≤ uj + zj

... second-order cone constraint ... and relax integrality ... add∑

zj ≤ r

... good classification accuracy & small ‖w‖0!

23 / 31

Page 32: Optimization for Machine Learning - Approximation Theory

Sparse Support-Vector Machines [Maldonado et al., 2014]

Mixed-Integer Linear SVM

[Maldonado et al., 2014] formulate MILP: min ‖ξ‖1 subj. to ‖w‖0 ≤ B

minimizew ,b,ξ,z

1T ξ classification error

subject to Y (Xw − b1) + ξ ≥ 1 classifier c/s

Lzj ≤ wj ≤ Uzj on/off wj∑j

cjzj ≤ B budget constraint

ξ ≥ 0, zj ∈ {0, 1}

for bounds L < U and budget B > 0

24 / 31

Page 33: Optimization for Machine Learning - Approximation Theory

Outline

1 Data Analysis at DOE Light Sources

2 Optimization for Machine Learning

3 Mixed-Integer Nonlinear OptimizationOptimal Symbolic RegressionDeep Neural Nets as MIPsSparse Support-Vector Machines

4 Robust OptimizationRobust Optimization for SVMs

5 Conclusions and Extension

25 / 31

Page 34: Optimization for Machine Learning - Approximation Theory

Nonlinear Robust Optimization

Nonlinear Robust Optimization

minimizex

f (x)

subject to c(x ; u) ≥ 0, ∀ u ∈ Ux ∈ X

Small Example

minimizex≥0

(x1 − 4)2 + (x2 − 1)2

subject to x1√u − x2u ≤ 2,

. . . ∀u ∈[

14 , 2] •

0 2 4 6 8 100

2

4

6

8

10

x1

x2

Assumptions (e.g. [Leyffer et al., 2018]) ... wlog assume f (x) is deterministic

u ∈ U uncertain parameters closed convex set, independent of x

c(x ; u) ≥ 0 ∀ u ∈ U robust constraints ... semi-infinite optimization problem

X ⊂ Rn standard (certain) constraints; f (x) and c(x ; u) smooth functions

26 / 31

Page 35: Optimization for Machine Learning - Approximation Theory

Linear Robust Optimization [Ben-Tal and Nemirovski, 1999]

Robust linear constraints are easy! E.g. aT x + b ≥ 0, ∀a ∈ U := {BTa ≥ c}... rewrite semi-infinite constraint as a minimum

⇔{

minimizea

aT x + b

subject to BTa ≥ c

}≥ 0

... apply duality: L(a, λ) := aT x + b − λT (BTa− c)

⇔{

maximizea,λ

L(a, λ) = b + λT c

subject to 0 = ∇aL(a, λ) = x − Bλ, λ ≥ 0

}≥ 0

... only need feasible point ≥ 0 ... becomes standard polyhedral set

b + λT c ≥ 0, x = Bλ, λ ≥ 0

27 / 31

Page 36: Optimization for Machine Learning - Approximation Theory

Duality Trick for Conic and Linear Robust Optimization

Duality trick generalizes to other conic uncertainty sets

(P) minimizex

f (x) subject to c(x ; u) ≥ 0, ∀ u ∈ U , x ∈ X

... creates classes of tractable extended formulations

Robust Constraints Uncertainty Set Extended Formulationc(x ; u) ≥ 0 ULinear Polyhedral Linear Program

Linear Ellipsoidal Conic QP

Conic Conic SDP

28 / 31

Page 37: Optimization for Machine Learning - Approximation Theory

Robust Optimization for Support Vector Machines (SVMs)

Standard SVM Training

Data S = {xi , yi}mi=1: features xi ∈ Rn labels yi ∈ {−1, 1}ξ ≥ 0 slacks, b bias, c > 0 penalty parameter

minimizew ,b,ξ

12‖w‖2

2 + c1T ξ

subject to Y (Xw − b1) + ξ ≥ 1, ξ ≥ 0,

where Y = diag(y) and X = [x1, . . . , xm]T

SVMs with Additive Location Errors

See survey article [Caramanis et al., 2012] & use duality trick!

Location errors x truei = xi + ui & ellipsoid uncertainty U = {ui | uTi Σui ≤ 1}:

yi(wT (xi + ui )− b

)+ ξ ≥ 1, ∀ui : uTi Σui ≤ 1

⇔ yi(wT xi − b

)+ ξ + ‖Σ1/2w‖2 ≥ 1 SOC constraint

29 / 31

Page 38: Optimization for Machine Learning - Approximation Theory

Robust Optimization for Support Vector Machines (SVMs)

General Case of Location Errors: “Worst-Case SVM”

minimizew ,b

maximizeu∈U

1

2‖w‖2

2 + c∑j

max{

1− yj

(wT (xj + uj)− b

), 0}

for uncertainty set U ={

(u1, . . . , um) | ∑j ‖uj‖ ≤ d}

equivalent to

minimizew ,b

1

2‖w‖2

2 + d‖w‖D + c∑j

max{

1− yj

(wT (xj + uj)− b

), 0}

where ‖ · ‖D is dual norm of ‖ · ‖, e.g. `2 ↔ `2 or `∞ ↔ `1, ... follows from duality

[Caramanis et al., 2012] argue that derivation shows that:

Regularized classifiers are more robust: satisfy worst-case principle

Provide probabilistic interpretation if viewed as chance constraints30 / 31

Page 39: Optimization for Machine Learning - Approximation Theory

Conclusions and Extension: Optimization for Machine Learning

Conclusions

Mixed-Integer Optimization for Machine Learning

Optimal symbolic regression, expression trees, nonconvex MIPMIPs of deep neural nets for building adversarial examplesSupport-vector machines & `0 regularizers & constraints

Robust Optimization for Machine Learning

Best “worst-case” SVM ⇒ equivalent tractable formulation

Extensions and Challenges

Extending use of integer variables into design of DNNs

Realistic stochastic interpretation of regularizers in SVM, DNN, ...

31 / 31

Page 40: Optimization for Machine Learning - Approximation Theory

Austel, V., Dash, S., Gunluk, O., Horesh, L., Liberti, L., Nannicini, G., and Schieber, B. (2017).Globally optimal symbolic regression.arXiv preprint arXiv:1710.10720.

Belotti, P., Kirches, C., Leyffer, S., Linderoth, J., Luedtke, J., and Mahajan, A. (2013).Mixed-integer nonlinear optimization.Acta Numerica, 22:1–131.

Ben-Tal, A. and Nemirovski, A. (1999).Robust solutions of uncertain linear programs.Operations Research Letters, 25(1):1–13.

Caramanis, C., Mannor, S., and Xu, H. (2012).14 robust optimization in machine learning.Optimization for machine learning, page 369.

Fischetti, M. and Jo, J. (2018).Deep neural networks and mixed integer linear optimization.Constraints, pages 1–14.

Gebremedhin, A. H., Manne, F., and Pothen, A. (2005).What color is your jacobian? graph coloring for computing derivatives.SIAM review, 47(4):629–705.

Goldberg, N., Leyffer, S., and Munson, T. (2013).A new perspective on convex relaxations of sparse svm.In Proceedings of the 2013 SIAM International Conference on Data Mining, pages 450–457. SIAM.

31 / 31

Page 41: Optimization for Machine Learning - Approximation Theory

Guan, W., Gray, A., and Leyffer, S. (2009).Mixed-integer support vector machine.In NIPS workshop on optimization for machine learning.

Gunluk, O. and Linderoth, J. (2008).Perspective relaxation of mixed integer nonlinear programs with indicator variables.In Lodi, A., Panconesi, A., and Rinaldi, G., editors, IPCO 2008: The Thirteenth Conference on IntegerProgramming and Combinatorial Optimization, volume 5035, pages 1–16.

Jeroslow, R. G. (1973).There cannot be any algorithm for integer programming with quadratic constraints.Operations Research, 21(1):221–224.

Kannan, R. and Monma, C. (1978).On the computational complexity of integer programming problems.In Henn, R., Korte, B., and Oettli, W., editors, Optimization and Operations Research,, volume 157 ofLecture Notes in Economics and Mathematical Systems, pages 161–172. Springer.

Leyffer, S., Menickelly, M., Munson, T., Vanaret, C., , and Wild, S. M. (2018).Nonlinear robust optimization.Technical report, Argonne National Laboratory, Mathematics and Computer Science Division.

Maldonado, S., Perez, J., Weber, R., and Labbe, M. (2014).Feature selection for support vector machines via mixed integer linear programming.Information sciences, 279:163–175.

31 / 31