deep learning as optimal control problems...learning, campus jussieu, sorbonne université, paris...

Post on 06-Aug-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Regularisation for Inverse Problems and Machine Learning, Campus Jussieu, Sorbonne Université, Paris 19.11.2019

Deep learning as optimal control problems

Martin Benning, Queen Mary University of London (QMUL)

Models and numerical methods

This is joint work with Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren and Carola-Bibiane Schönlieb

m.benning@qmul.ac.uk

This is joint work with

2

MB, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, and Carola-Bibiane Schönlieb. "Deep learning as optimal control problems: models and numerical methods." arXiv preprint arXiv:1904.05657 (2019). To appear in Journal of Computational Dynamics in December 2019

m.benning@qmul.ac.uk

Deep learning as optimal control problems

• Introduction

• Runge-Kutta (RK) networks

• Numerical results

• Regularisation properties of RK network

3

m.benning@qmul.ac.uk

INTRODUCTION

4

m.benning@qmul.ac.uk

Introduction

5

In this part we consider supervised machine learning problems, e.g. classification

Goal of classification is to find mapping given input and output samples .

g : ℝn → {c0, c1, …, cK−1}{(xi, ci)}m

i=1

Here every class label for indicates that the input belongs to the corresponding class, e.g.

ci ∈ {c0, c1, …, cK−1} i ∈ {1,…, m}xi

for , then belongs to class ci = cl l ∈ {0,1,…, K − 1} xi l

m.benning@qmul.ac.uk 6

Introduction

image source

Training data

xi 2 Rd

Example:

source

m.benning@qmul.ac.uk

Introduction

7

Goal of classification is to find mapping given input and output samples .

g : ℝn → {c0, c1, …, cK−1}{(xi, ci)}m

i=1

g(x) := 𝒞 (Wh(x, u) + μ)• is the hypothesis function mapping from to • is a weight vector • is a bias factor • is a model function parametrised by parameters

𝒞 ℝ {c0, c1, …, cK−1}W ∈ ℝ1×n

μ ∈ ℝh : ℝn × P → ℝn u ∈ P

A potential model for such a classifier function isg

In this part we consider supervised machine learning problems, e.g. classification

m.benning@qmul.ac.uk

Introduction

8

Goal of classification is to find mapping given input and output samples .

g : ℝn → {c0, c1, …, cK−1}{(xi, ci)}m

i=1

g(x) := 𝒞 (Wh(x, u) + μ)A potential model for such a classifier function isg

We can train the model parameters by minimising a cost function of the formm

∑i=1

L (𝒞(W h(xi, u) + μ), ci) + ℛ(u)

with respect to , and u W μ

In this part we consider supervised machine learning problems, e.g. classification

m.benning@qmul.ac.uk

Introduction

9

Goal of classification is to find mapping given input and output samples .

g : ℝn → {c0, c1, …, cK−1}{(xi, ci)}m

i=1

g(x) := 𝒞 (Wh(x, u) + μ)A potential model for such a classifier function isg

We can train the model parameters by minimising a cost function of the form

12

m

∑i=1

𝒞 (Wh(xi, u) + μ) − ci2

+ ℛ(u)

with respect to , and u W μ

In this part we consider supervised machine learning problems, e.g. classification

m.benning@qmul.ac.uk

Introduction

10

Goal of classification is to find mapping given input and output samples .

g : ℝn → {c0, c1, …, cK−1}{(xi, ci)}m

i=1

g(x) := 𝒞 (Wh(x, u) + μ)A potential model for such a classifier function isg

We can train the model parameters by minimising a cost function of the formm

∑i=1

[log (1 + exp(Wh(xi, u) + μ)) − ci (Wh(xi, u) + μ)] + ℛ(u)

with respect to , and u W μ

In this part we consider supervised machine learning problems, e.g. classification

m.benning@qmul.ac.uk

Introduction

11

We can train the model parameters by minimising a cost function of the form

12

m

∑i=1

𝒞 (Wh(xi, u) + μ) − ci2

+ ℛ(u)

with respect to , and u W μHow do we choose the model function ?h

For example as a deep neural network such as the residual network (ResNet), i.e.

y[ j+1]i = y[ j]

i + f (y[ j]i , u[ j]), j = 0,…, N − 1, y[0]

i = xi .

h(xi, u) = y[N ]i , u = (u[0], …, u[N−1]) ,

K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770–778.

In this part we consider supervised machine learning problems, e.g. classification

m.benning@qmul.ac.uk

Introduction

12

How do we choose the model function ?hFor example as a deep neural network such as the residual network (ResNet), i.e.

y[ j+1]i = y[ j]

i + f (y[ j]i , u[ j]), j = 0,…, N − 1, y[0]

i = xi .

h(xi, u) = y[N ]i , u = (u[0], …, u[N−1]) ,

K. He, X. Zhang, S. Ren and J. Sun, Deep Residual Learning for Image Recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770–778.

Possible choices for and :u[ j] f

u[ j] := (K[ j], β[ j]), j = 0,…, N − 1,

f (y[ j]i , u[ j]) := σ(K[ j]y[ j]

i + β[ j])

K[ j] ∈ ℝn×n β[ j] ∈ ℝn×1

σ(x) = tanh(x)

m.benning@qmul.ac.uk

Introduction

13

Training the ResNet with (mean) squared error cost function therefore reads as

y[ j+1]i = y[ j]

i + f (y[ j]i , u[ j]), j = 0,…, N − 1, y[0]

i = xi .

miny,u,W,μ

m

∑i=1

𝒞 (Wy[N ]i + μ) − ci

2

+ ℛ(u),

subject to

m.benning@qmul.ac.uk

Introduction

14

y[ j+1]i = y[ j]

i + f (y[ j]i , u[ j]), j = 0,…, N − 1, y[0]

i = xi .

miny,u,W,μ

m

∑i=1

𝒞 (Wy[N ]i + μ) − ci

2

+ ℛ(u),

subject to

Δt

E. Haber and L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems, 34 (2017), 014004.

By introducing a step-size parameter we can view the constraint as a forward Euler discretisation of the differential equation

Δt

·yi = f (yi, u), t ∈ [0,T ], yi(0) = xi .

Training the ResNet with (mean) squared error cost function therefore reads as

m.benning@qmul.ac.uk

Introduction

15

In the continuum we therefore obtain the following optimal control problem

miny,u,W,μ

m

∑i=1

𝒞 (Wyi(T ) + μ) − ci2

+ ℛ(u),

subject to

B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert and E. Holtham, Reversible architectures for arbitrarily deep residual neural networks, in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

·yi = f (yi, u), t ∈ [0,T ], yi(0) = xi .

www.qmul.ac.uk m.benning@qmul.ac.uk

A lot of research in this direction

16

• Yann LeCun. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.

• Weinan E, A Proposal on Machine Learning via Dynamical Systems, Comm. in Math. and Stat. 2017. • B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert and E. Holtham, Reversible Architectures for Arbitrarily Deep Residual

Neural Networks, arXiv: 1709.03698v1, AAAI (National Conference on Artificial Intelligence). • E. Haber, L. Ruthotto, Stable Architectures for Deep Neural Networks, arXiv: 1705.03341v2 • Qianxiao Li Shuji Hao, An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks • Qianxiao Li, Long Chen, Cheng Tai, Weinan E, Maximum Principle Based Algorithms for Deep Learning, Journal of Machine

Learning Research 18 (2018). • L. Ruthotto and E. Haber, Deep Neural Networks Motivated by Partial Differential Equations, arXiv:1804.04272 • Lu, Y., Zhong, A., Li, Q., Bin Dong. (2017, October 27). Beyond Finite Layer Neural Networks: Bridging Deep Architectures

and Numerical Differential Equations. arXiv.org. • Chen, T. Q., Rubanova, Y., Bettencourt, J., Duvenaud, D. (2018). Neural Ordinary Differential Equations. Presented at

NeurIPS. • Gholami, A., Keutzer, K., Biros, G. (2019, February 26). ANODE: Unconditionally Accurate Memory-Efficient Gradients for

Neural ODEs. • Zhang, T., Yao, Z., Gholami, A., Keutzer, K., Gonzalez, J., Biros, G., Mahoney, M. (2019, June 9). ANODEV2: A Coupled

Neural ODE Evolution Framework.

www.qmul.ac.uk m.benning@qmul.ac.uk

A lot of research in this direction

17

• Yann LeCun. A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.

• Weinan E, A Proposal on Machine Learning via Dynamical Systems, Comm. in Math. and Stat. 2017. • B. Chang, L. Meng, E. Haber, L. Ruthotto, D. Begert and E. Holtham, Reversible Architectures for Arbitrarily Deep Residual

Neural Networks, arXiv: 1709.03698v1, AAAI (National Conference on Artificial Intelligence). • E. Haber, L. Ruthotto, Stable Architectures for Deep Neural Networks, arXiv: 1705.03341v2 • Qianxiao Li Shuji Hao, An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks • Qianxiao Li, Long Chen, Cheng Tai, Weinan E, Maximum Principle Based Algorithms for Deep Learning, Journal of Machine

Learning Research 18 (2018). • L. Ruthotto and E. Haber, Deep Neural Networks Motivated by Partial Differential Equations, arXiv:1804.04272 • Lu, Y., Zhong, A., Li, Q., Bin Dong. (2017, October 27). Beyond Finite Layer Neural Networks: Bridging Deep Architectures

and Numerical Differential Equations. arXiv.org. • Chen, T. Q., Rubanova, Y., Bettencourt, J., Duvenaud, D. (2018). Neural Ordinary Differential Equations. Presented at

NeurIPS. • Gholami, A., Keutzer, K., Biros, G. (2019, February 26). ANODE: Unconditionally Accurate Memory-Efficient Gradients for

Neural ODEs. • Zhang, T., Yao, Z., Gholami, A., Keutzer, K., Gonzalez, J., Biros, G., Mahoney, M. (2019, June 9). ANODEV2: A Coupled

Neural ODE Evolution Framework.

m.benning@qmul.ac.uk

DEEP LEARNING AS OPTIMAL CONTROL PROBLEMS

18

m.benning@qmul.ac.uk

Deep learning as optimal control problems

19

miny,u,W,μ

m

∑i=1

𝒞 (Wyi(T ) + μ) − ci2

+ ℛ(u),

subject to·yi = f (yi, u), t ∈ [0,T ], yi(0) = xi .

Deep learning optimal control problem

m.benning@qmul.ac.uk

Deep learning as optimal control problems

20

subject to·y = f (y, u), t ∈ [0, T ], y(0) = x .

Deep learning optimal control problem

miny, u

𝒥(y(T ))

Variational calculus: consider variations

y := y + εv , u := w + εw .

How does the system behave for ?·y = f (y, u) ε → 0

m.benning@qmul.ac.uk

Deep learning as optimal control problems

21

·y = f (y, u), t ∈ [0, T ], y(0) = x .

Variational calculus: consider variations

y := y + εv , u := w + εw .

How does the system behave for ?·y = f (y, u) ε → 0

Differential equation constraint

ddε ( d

dty(t))

ε=0

= ,

ddε

f (y(t), u(t))ε=0

= .

ddt

v(t)

∂y f (y(t), u(t)) v(t) + ∂u f (y(t), u(t)) w(t)

m.benning@qmul.ac.uk

Deep learning as optimal control problems

22

subject to·y = f (y, u), t ∈ [0, T ], y(0) = x .

Deep learning optimal control problem

miny, u

𝒥(y(T ))

Variational equation

Jacobian of w.r.t. to f uJacobian of w.r.t. to f y

ddt

v(t) ∂y f (y(t), u(t)) v(t) + ∂u f (y(t), u(t)) w(t)=

m.benning@qmul.ac.uk 23

Adjoint equationThe adjoint of the variational equation is a system of differential equations for a variable , obtained assumingp(t)

⟨p(t), v(t)⟩ = ⟨p(0), v(0)⟩, ∀t ∈ [0,T ] .

This implies⟨p(t), ·v(t)⟩ = − ⟨ ·p(t), v(t)⟩,

which yields ddt

p(t) = − (∂y f (y(t), u(t)))⊤

p(t) ,

with constraint (∂u f (y(t), u(t)))⊤ p(t) = 0 .

m.benning@qmul.ac.uk

First-order necessary conditions for optimality

24

·y = f (y, u), y(0) = x ,Then, the boundary value problem system

miny, u

𝒥(y(T )) ,

·p = − (∂y f (y, u))⊤

p , p(T ) = ∂y𝒥(y)y=y(T )

,

0 = (∂u f (y, u))⊤ p , t ∈ [0, T ] ,

expresses the first order necessary conditions for optimality of

Sketch of proof: set up a Lagrange functional and compute the optimality conditions.

m.benning@qmul.ac.uk 25

Associated Hamiltonian systemFor this optimal control problem, there is an associated Hamiltonian system with Hamiltonian

ℋ(y, p, u) := p⊤ f (y, u)

with·y = ∂pℋ, ·p = − ∂yℋ, ∂uℋ = 0.

The constraint implies that this is a differential algebraic equation of index one

0 = (∂u f (y, u))⊤ p

Provided the Hessian is invertible, we can regard this system as an ODE when applying numerical methods.

∂u,uℋ

m.benning@qmul.ac.uk

Numerical discretisation of the optimal control problem

26

How do we discretise the boundary value problem system

·y = f (y, u), y(0) = x ,

·p = − (∂y f (y, u))⊤

p , p(T ) = ∂y𝒥(y)y=y(T )

,

0 = (∂u f (y, u))⊤ p , t ∈ [0, T ] ?

Symplectic Runge-Kutta methods are suited to this problem!

J. M. Sanz-Serna, Symplectic Runge-Kutta schemes for adjoint equations automatic differentiation, optimal control and more, SIAM Rev., 58 (2016), 3–33.

m.benning@qmul.ac.uk 27

y[ j+1] = y[ j] + Δts

∑i=1

bi f [ j]i

f [ j]i = f (y[ j]

i , u[ j]i ), i = 1,…, s,

y[ j]i = y[ j] + Δt

s

∑l=1

ai,l f [ j]l , i = 1,…, s

Symplectic Runge-Kutta methods are suited to this problem!

p[ j+1] = p[ j] + Δts

∑i=1

bi ℓ[ j]i

Numerical discretisation of the optimal control problem

ℓ[ j]i = − ∂y f (y[ j]

i , u[ j]i )T p[ j]

i , i = 1,…, s,

p[ j]i = p[ j] + Δt

s

∑l=1

ai,lℓ[ j]l , i = 1,…, s

Forward pass

Backward pass

y[0] = x

p[N ] := ∂𝒥(y[N ])

(∂u f (y[ j]i , u[ j]

i ))T

p[ j]i = 0

biai, j + bjai, j − bibj = 0bi = bi

This is the symplectic part!

m.benning@qmul.ac.uk 28

y[ j+1] = y[ j] + Δt f (y[ j], u[ j]),

Symplectic Runge-Kutta methods are suited to this problem!

p[ j+1] = p[ j] − Δt (∂y f (y[ j], u[ j]))T

p[ j+1],

Numerical discretisation of the optimal control problem

y[0] = x

p[N ] := ∂y𝒥(y[N ])

0 = (∂u f (y[ j]i , u[ j]

i ))T

p[ j+1]i

Special case: symplectic Euler

Has the advantage that bilinear forms are preserved after discretisation, i.e. will be preserved after discretisation.⟨p(t), v(t)⟩

m.benning@qmul.ac.uk 29

Numerical discretisation of the optimal control problem

Proposition: if the continuous optimal control system is discretised with a symplectic partitioned Runge-Kutta method with , then the first order optimality for the discrete control problem

bi ≠ 0

min{u[ j]

i }N−1j=0 ,

{y[ j]}Nj=1, {y[ j]

i }Nj=1,

𝒥(y[N ]),

subject toy[ j+1] = y[ j] + Δt

s

∑i=1

bi f [ j]i

f [ j]i = f (y[ j]

i , u[ j]i ), i = 1,…, s,

y[ j]i = y[ j] + Δt

s

∑m=1

ai,m f [ j]m , i = 1,…, s,

is satisfied.

m.benning@qmul.ac.uk 30

Numerical discretisation of the optimal control problem

What does this mean in practice?

If we use symplectic partitioned Runge-Kutta methods, then first-optimise-then-discretise and first-discretise-then-optimise

lead to the same optimal control problems.

·y = f (y, u)

y[ j]i = y[ j] + Δt

s

∑m=1

ai,m f [ j]m min

{u [ j]i }N−1

j=0 ,{y[ j]}Nj=1,{y[ j]

i }Nj=1,

𝒥(y[N ])

miny, u

𝒥(y(T ))optimise

optimise

discretise discretise

m.benning@qmul.ac.uk

RUNGE-KUTTA NETWORKS

31

m.benning@qmul.ac.uk

Runge-Kutta networks

32

The discretisation of the optimal control problem with symplectic partitioned Runge-Kutta methods inspires a range of different network architectures.

Activation function: with parameters f (y, u) = σ(Ky + β) u = (K, β)

y[ j+1] = y[ j] + Δts

∑i=1

bi f [ j]i

f [ j]i = σ(K[ j]

i y[ j]i + β[ j]

i ), i = 1,…, s,

y[ j]i = y[ j] + Δt

s

∑l=1

ai,l f [ j]l , i = 1,…, s

Runge-Kutta networks:

m.benning@qmul.ac.uk

Runge-Kutta networks

33

The discretisation of the optimal control problem with symplectic partitioned Runge-Kutta methods inspires a range of different network architectures.

Runge-Kutta networks: special case , the ResNets = 1

y[ j+1] = y[ j] + Δt σ (K[ j]y[ j] + β[ j])Backpropagation: γ[ j] = σ′ (K[ j]y[ j] + β[ j]) ⊙ p[ j+1]

p[ j+1] = p[ j] − Δt (K[ j])⊤γ[ j]

Activation function: with parameters f (y, u) = σ(Ky + β) u = (K, β)

m.benning@qmul.ac.uk

Runge-Kutta networks

34

The discretisation of the optimal control problem with symplectic partitioned Runge-Kutta methods inspires a range of different network architectures.

Runge-Kutta networks: special case , the ResNets = 1

y[ j+1] = y[ j] + Δt σ (K[ j]y[ j] + β[ j])γ[ j] = σ′ (K[ j]y[ j] + β[ j]) ⊙ p[ j+1]

p[ j+1] = p[ j] − Δt (K[ j])⊤γ[ j]

Parameter gradient: ∂K[ j]𝒥(y[N ]) = Δt γ[ j] (y[ j])⊤

∂β[ j]𝒥(y[N ]) = Δt γ[ j]

Activation function: with parameters f (y, u) = σ(Ky + β) u = (K, β)

m.benning@qmul.ac.uk

ODENet

35

The discretisation of the optimal control problem with symplectic partitioned Runge-Kutta methods inspires a range of different network architectures.

Activation function: with parameters f (y, u) = α σ(Ky + β) u = (K, α, β)

Example: ResNet with varying time steps, i.e.

y[ j+1] = y[ j] + Δt α[ j] σ (K[ j]y[ j] + β[ j]) ;

we refer to this model as the ODENet.

Parameter gradient: ∂α[ j]𝒥(y[N ]) = Δt ⟨p[ j+1], σ (K[ j]y[ j] + β[ j])⟩

m.benning@qmul.ac.uk

ODENet

36

The discretisation of the optimal control problem with symplectic partitioned Runge-Kutta methods inspires a range of different network architectures.

Example: ResNet with varying time steps, i.e.

y[ j+1] = y[ j] + Δt α[ j] σ (K[ j]y[ j] + β[ j]) ;

we refer to this model as the ODENet.

In some applications it can make sense to assume that the learned time steps have to lie in the simplex

S := α ∈ ℝN α[ j] ≥ 0,N

∑j=1

α[ j] = 1 .

Activation function: with parameters f (y, u) = α σ(Ky + β) u = (K, α, β)

m.benning@qmul.ac.uk

NUMERICAL RESULTS

37

m.benning@qmul.ac.uk

Setup

38

Binary classification: ci ∈ {0,1}

Hypothesis function: 𝒞(x) =1

1 + exp(−x)

miny,u,W,μ

m

∑i=1

𝒞 (Wyi(T ) + μ) − ci2

+ ℛ(u),

subject to·yi = f (yi, u), t ∈ [0,T ], yi(0) = xi .

Deep learning optimal control problem

Varying number of layers after discretisation

Activation function: σ(x) = tanh(x)

www.qmul.ac.uk /QMUL @QMUL

Datasets2D data sets

donut1d donut2d

spiral squares

www.qmul.ac.uk /QMUL @QMUL

Results for donut1d

www.qmul.ac.uk /QMUL @QMUL

Results for squares

www.qmul.ac.uk /QMUL @QMUL

Transformation of features

www.qmul.ac.uk /QMUL @QMUL

Transformations for fixed classifier

www.qmul.ac.uk /QMUL @QMUL

Robustness to initialisations

www.qmul.ac.uk /QMUL @QMUL

Optimisation

www.qmul.ac.uk /QMUL @QMUL

Adaptive time steps for ODENet

www.qmul.ac.uk /QMUL @QMUL

Comparison of Runge-Kutta NetworksWe now train different Runge-Kutta networks for the same classification tasks

Butcher tableau ResNet/Euler Improved Euler

Kutta(3) Kutta(4)

www.qmul.ac.uk /QMUL @QMUL

Comparison of Runge-Kutta Networks

www.qmul.ac.uk /QMUL @QMUL

Comparison of Runge-Kutta Networks

www.qmul.ac.uk /QMUL @QMUL

Comparison of Runge-Kutta Networks

www.qmul.ac.uk /QMUL @QMUL

ODENet for MNIST

www.qmul.ac.uk /QMUL @QMUL

ODENet for MNIST

www.qmul.ac.uk /QMUL @QMUL

ODENet for MNIST

m.benning@qmul.ac.uk

REGULARISATION PROPERTIES

54

m.benning@qmul.ac.uk

Regularisation properties

55

Let us consider the continuous formulation of the neural network constraint

ddt

y(t) = σ (K(t)y(t) + β(t))

m.benning@qmul.ac.uk

Regularisation properties

56

Let us consider the continuous formulation of the neural network constraint

ddt

y(t) = σ (K(t)y(t) + β(t))

and modify it as suggested in *

−K(t)⊤

Then we can verify the following stability estimate if is chosen to be monotone: σ

∥y(t) − yδ(t)∥ ≤ C ∥y(0) − yδ(0)∥ , for t ≥ 0 .

Here is a constant and and denote solutions of the flow for initial values and .

C > 0 y(t) yδ(t)y(0) yδ(0)

*Lars Ruthotto and Eldad Haber. "Deep neural networks motivated by partial differential equations." arXiv preprint arXiv:1804.04272 (2018).

m.benning@qmul.ac.uk

Regularisation properties

57

Then we can verify the following stability estimate if is chosen to be monotone: σ

∥y(t) − yδ(t)∥ ≤ C ∥y(0) − yδ(0)∥ , for t ≥ 0 .

Proof:ddt ( 1

2∥y(t) − yδ(t)∥)

2

= − ⟨K(t)⊤(σ (K(t)y(t) + β(t)) − σ (K(t)yδ(t) + β(t))), y(t) − yδ(t)⟩= − ⟨σ (K(t)y(t) + β(t)) − σ (K(t)yδ(t) + β(t)), K(t)y(t) + β(t) − (K(t)yδ(t) + β(t))⟩

= ⟨ ddt

y(t) −ddt

yδ(t), y(t) − yδ(t)⟩

m.benning@qmul.ac.uk

Regularisation properties

58

Then we can verify the following stability estimate if is chosen to be monotone: σ

∥y(t) − yδ(t)∥ ≤ C ∥y(0) − yδ(0)∥ , for t ≥ 0 .

Proof:

= − ⟨σ (K(t)y(t) + β(t)) − σ (K(t)yδ(t) + β(t)), K(t)y(t) + β(t) − (K(t)yδ(t) + β(t))⟩= − ⟨σ(z(t)) − σ(zδ(t)), z(t) − zδ(t)⟩

for and .z(t) := K(t)y(t) + β(t) zδ(t) := K(t)yδ(t) + β(t)

ddt ( 1

2∥y(t) − yδ(t)∥)

2

= ⟨ ddt

y(t) −ddt

yδ(t), y(t) − yδ(t)⟩

≤ 0 ,

m.benning@qmul.ac.uk

Regularisation properties

59

Then we can verify the following stability estimate if is chosen to be monotone: σ

∥y(t) − yδ(t)∥ ≤ C ∥y(0) − yδ(0)∥ , for t ≥ 0 .

Proof:ddt ( 1

2∥y(t) − yδ(t)∥)

2

Integrating from 0 to now yieldst12

∥y(t) − yδ(t)∥2 −12

∥y(0) − yδ(0)∥2 ≤ 0 ,

which concludes the proof. ∎

≤ 0 .

m.benning@qmul.ac.uk 60

Regularisation propertiesIs stability a desirable property in neural networks?

Explaining and Harnessing Adversarial Examples, Goodfellow et al, ICLR 2015.

Adversarial examples in image classification:

m.benning@qmul.ac.uk 61

Regularisation propertiesHardening of deep neural networks:

suppose we have inputs and with , then our network produces outputs with

y(0) = x yδ(0) = xδ

∥x − xδ∥ ≤ δ

∥y(t) − yδ(t)∥ ≤ C δ .

limδ→0

∥y(t) − yδ(t)∥ = 0 .Hence, we observe

This continuous dependency is also a feature of regularisation operators;

We should discretise or train such that this stability estimate is preserved.

m.benning@qmul.ac.uk 62

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)

Let and denote solutions of this flow for data and with and identical initial values. For this model we can derive the following stability estimate:

y(t) yδ(t) f f δ ∥f − f δ∥ ≤ δ

ddt ( 1

2∥y(t) − yδ(t)∥2) ≤ − λ ⟨A⊤(Ay(t) − f ) − A⊤(Ayδ(t) − f δ), y(t) − yδ(t)⟩

= − λ ⟨A(y(t) − yδ(t)) − ( f − f δ), A(y(t) − yδ(t))⟩

m.benning@qmul.ac.uk 63

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)

Let and denote solutions of this flow for data and with and identical initial values. For this model we can derive the following stability estimate:

y(t) yδ(t) f f δ ∥f − f δ∥ ≤ δ

ddt ( 1

2∥y(t) − yδ(t)∥2) = − λ ⟨A(y(t) − yδ(t)) − ( f − f δ), A(y(t) − yδ(t))⟩

≤ λ ⟨f − f δ, A(y(t) − yδ(t))⟩≤ λ∥f − f δ∥∥A(y(t) − yδ(t))∥

m.benning@qmul.ac.uk 64

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)

Let and denote solutions of this flow for data and with and identical initial values. For this model we can derive the following stability estimate:

y(t) yδ(t) f f δ ∥f − f δ∥ ≤ δ

ddt ( 1

2∥y(t) − yδ(t)∥2) ≤ λ∥f − f δ∥∥A(y(t) − yδ(t))∥

≤ λδ∥A∥∥y(t) − yδ(t)∥

m.benning@qmul.ac.uk 65

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)Let and denote solutions of this flow for data and with and identical initial values. For this model we can derive the following stability estimate:

y(t) yδ(t) f f δ ∥f − f δ∥ ≤ δ

ddt ( 1

2∥y(t) − yδ(t)∥2) ≤ λδ∥A∥∥y(t) − yδ(t)∥

⟹ ∥y(t) − yδ(t)∥ ≤ ∥y(0) − yδ(0)∥ + λδ∥A∥ t

= λ∥A∥t δ

m.benning@qmul.ac.uk 66

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)Let and denote solutions of this flow for data and with and identical initial values. For this model we can derive the following stability estimate:

y(t) yδ(t) f f δ ∥f − f δ∥ ≤ δ

∥y(t) − yδ(t)∥ ≤ λ∥A∥t δ .

Hence, we need to choose a stopping time to guarantee convergence of this stability estimate, i.e. .

t*limδ↓0

∥y(t) − yδ(t)∥ = 0

m.benning@qmul.ac.uk 67

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)When do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

ddt ( 1

2∥yδ(t) − y†∥2) = ⟨ d

dtyδ(t), yδ(t) − y†⟩

= − ⟨K(t)⊤σ (K(t)yδ(t) + β(t)) + λA⊤ (Ayδ(t) − f δ), y(t) − y†⟩−⟨K(t)⊤σ (K(t)yδ(t) + β(t)), y(t) − y†⟩

first inner product

−λ ⟨A⊤ (Ayδ(t) − f δ), y(t) − y†⟩second inner product

=

m.benning@qmul.ac.uk 68

Regularisation propertiesWhen do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

−⟨K(t)⊤σ (K(t)yδ(t) + β(t)), y(t) − y†⟩first inner product

= −⟨K(t)⊤σ (K(t)yδ(t) + β(t)) − K(t)⊤σ (K(t)y† + β(t)), y(t) − y†⟩+⟨K(t)⊤σ (K(t)y† + β(t)), y(t) − y†⟩

≤ ⟨K(t)⊤σ (K(t)y† + β(t)), y(t) − y†⟩Monotonic increase of σ

m.benning@qmul.ac.uk 69

Regularisation propertiesWhen do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

−⟨K(t)⊤σ (K(t)yδ(t) + β(t)), y(t) − y†⟩first inner product

≤ ⟨K(t)⊤σ (K(t)y† + β(t)), y(t) − y†⟩

K(t)⊤σ (K(t)y† + β(t)) y(t) − y†≤

m.benning@qmul.ac.uk 70

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)When do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

ddt ( 1

2∥yδ(t) − y†∥2) = ⟨ d

dtyδ(t), yδ(t) − y†⟩

= − ⟨K(t)⊤σ (K(t)yδ(t) + β(t)) + λA⊤ (Ayδ(t) − f δ), y(t) − y†⟩−λ ⟨A⊤ (Ayδ(t) − f δ), y(t) − y†⟩

second inner product

K(t)⊤σ (K(t)y† + β(t)) y(t) − y†≤

m.benning@qmul.ac.uk 71

Regularisation propertiesWhen do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

= −λ ⟨Ayδ(t) − f δ, Ay(t) − Ay†⟩

−λ ⟨A⊤ (Ayδ(t) − f δ), y(t) − y†⟩second inner product

= − λ ⟨Ayδ(t) − f δ, Ay(t) − f⟩−λ ⟨Ayδ(t) − f + f − f δ, Ay(t) − f⟩=

= −λ Ayδ(t) − f2

− λ ⟨f − f δ, Ayδ(t) − f⟩≤ λ f − f δ Ayδ(t) − f λδ∥A∥ yδ(t) − y†≤

m.benning@qmul.ac.uk 72

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)When do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

ddt ( 1

2∥yδ(t) − y†∥2) = ⟨ d

dtyδ(t), yδ(t) − y†⟩

= − ⟨K(t)⊤σ (K(t)yδ(t) + β(t)) + λA⊤ (Ayδ(t) − f δ), y(t) − y†⟩K(t)⊤σ (K(t)y† + β(t)) y(t) − y†≤ + λδ∥A∥ yδ(t) − y†

= ( K(t)⊤σ (K(t)y† + β(t)) + λδ∥A∥) yδ(t) − y†

m.benning@qmul.ac.uk 73

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)When do we converge to a solution of the inverse problem, i.e. ?y† Ay† = f

ddt ( 1

2∥yδ(t) − y†∥2) ≤ ( K(t)⊤σ (K(t)y† + β(t)) + λδ∥A∥) yδ(t) − y†

Assumption (source condition): is bounded for all .∥K(t)⊤σ (K(t)y† + β(t)) ∥ t ≥ 0

⟹ddt ( 1

2∥yδ(t) − y†∥2) ≤ (C + λδ∥A∥) yδ(t) − y†

m.benning@qmul.ac.uk 74

Regularisation properties

ddt

y(t) = −K(t)⊤σ (K(t)y(t) + β(t))

Connection to inverse problems for example with additional reaction term:

−λA⊤ ( Ay(t) − f)Assumption (source condition): is bounded for all .∥K(t)⊤σ (K(t)y† + β(t)) ∥ t ≥ 0

ddt ( 1

2∥yδ(t) − y†∥2) ≤ (C + λδ∥A∥) yδ(t) − y†

∥yδ(t) − y†∥ ≤ ∥yδ(0) − y†∥ + (C + λδ∥A∥) t

m.benning@qmul.ac.uk 75

Regularisation propertiesAssumption (source condition): is bounded for all .∥K(t)⊤σ (K(t)y† + β(t)) ∥ t ≥ 0

ddt ( 1

2∥yδ(t) − y†∥2) ≤ (C + λδ∥A∥) yδ(t) − y†

∥yδ(t) − y†∥ ≤ ∥yδ(0) − y†∥ + (C + λδ∥A∥) t

Suppose and , then we could estimateA = Id yδ(0) = f δ

∥yδ(t) − y†∥ ≤ ∥f δ − f ∥ +(C + λδ∥A∥) t

m.benning@qmul.ac.uk 76

Regularisation propertiesAssumption (source condition): is bounded for all .∥K(t)⊤σ (K(t)y† + β(t)) ∥ t ≥ 0

ddt ( 1

2∥yδ(t) − y†∥2) ≤ (C + λδ∥A∥) yδ(t) − y†

∥yδ(t) − y†∥ ≤ ∥yδ(0) − y†∥ + (C + λδ∥A∥) t

Suppose and , then we could estimateA = Id yδ(0) = f δ

∥yδ(t) − y†∥ ≤ δ +(C + λδ∥A∥) t

and conclude for .limδ↓0

∥yδ(t*) − y†∥ = 0 t* ∼ δ

www.qmul.ac.uk /QMUL @QMUL

Thank you for your attention!Acknowledgements:

top related