differential dynamic programming neural optimizer · opens up new avenues for principled...

Differential Dynamic Programming Neural Optimizer

Guan-Horng Liu 1 Tianrong Chen 1 Evangelos A. Theodorou 1

AbstractInterpretation of Deep Neural Networks (DNNs)training as an optimal control problem with non-linear dynamical systems has received consider-able attention recently, yet the algorithmic devel-opment remains relatively limited. In this work,we make an attempt along this line by reformu-lating the training procedure from the trajectoryoptimization perspective. We first show that mostwidely-used algorithms for training DNNs can belinked to the Differential Dynamic Programming(DDP), a celebrated second-order trajectory op-timization algorithm rooted in the ApproximateDynamic Programming. In this vein, we proposea new variant of DDP that can accept batch opti-mization for training feedforward networks, whileintegrating naturally with the recent progress incurvature approximation. The resulting algorithmfeatures layer-wise feedback policies which im-prove convergence rate and reduce sensitivity tohyper-parameter over existing methods. We showthat the algorithm is competitive against state-of-the-art first and second order methods. Our workopens up new avenues for principled algorithmicdesign built upon the optimal control theory.

1. IntroductionIn this work, we consider the following optimal controlproblem (OCP) in the discrete-time setting:

minuJ(u;x0) :=

[φ(xT ) +

T−1∑t=0

`t(xt,ut)

]s.t. xt+1 =ft(xt,ut) ,

(OCP)

where xt ∈ Rn and ut ∈ Rm represent the state and con-trol at each time step t. ft(·, ·), `t(·, ·) and φ(·) respectivelydenote the nonlinear dynamics, intermediate cost and ter-minal cost functions. OCP aims to find a control trajectory,u , {ut}T−1

t=0 , such that the accumulated cost J over the

1Georgia Institute of Technology, Atlanta, GA 30332, USA.Correspondence to: Guan-Horng Liu <[email protected]>.

Preliminary work.

Table 1. Terminology Mapping

DEEP LEARNING OPTIMAL CONTROL

J Total Loss Trajectory Costxt Activation Vector at Layer t State at Time Step tut Weight at Layer t Control at Time Step tf Layer Dynamical Systemφ End-goal Loss Terminal Costl Weight Regularization Intermediate Cost

Figure 1. Computational graphs in the backward pass (upper) andweight update (lower). DDP differs from Back-propagation in that(1) the value derivatives∇xV,∇xxV , as opposed to the loss gra-dient∇xJ , are computed backward, and (2) the weight, denotedas ut, is updated using layer-wise feedback policies (shown in redas functions of state) with additional forward propagation.

finite horizon T is minimized. Problems with the form ofOCP appear in multidisciplinary areas since it describesa generic multi-stage decision making problem, and havegained commensurate interest recently in deep learning.

Central to the research along this line is the interpreta-tion of DNNs as discrete-time nonlinear dynamical sys-tems, in which each layer is viewed as a distinct time step(Weinan, 2017). The dynamical system perspective pro-vides mathematically-sound explanation for the success ofcertain DNN architectures (Lu et al., 2019). It also en-ables principled architecture design by bringing rich anal-ysis from numerical differential equations (Lu et al., 2017;Long et al., 2017) and discrete mechanism (Greydanus et al.,2019; Zhong et al., 2019) when learning problems inherit

arX

iv:2

002.

0880

9v1

[cs

.LG

] 2

0 Fe

b 20

20


physical structures. In the continuum limit of depth, Chenet al. (2018) parametrized an ordinary differential equation(ODE) directly using DNNs, later with Liu et al. (2019)extending the framework to accept stochastic dynamics.

From the optimization viewpoint, iterative algorithms forsolving OCP have been shown to admit a control interpre-tation (Hu & Lessard, 2017). When the network weight isrecast as the control variable, OCP describes without anyloss of generalization the training objective composed oflayer-wise loss (e.g. weight regularization) and terminalloss (e.g. cross-entropy). This connection has been math-ematically explored in many recent works (Weinan et al.,2018; Seidman et al., 2020; Liu & Theodorou, 2019). De-spite they provide theoretical statements for convergenceand generalization, the algorithmic development remainsrelatively limited. Specifically, previous works primarily fo-cus on applying first-order optimality conditions, providedby the Pontryagin principle (Boltyanskii et al., 1960), to ar-chitectures restricted to residual blocks or discrete weights(Li & Hao, 2018; Li et al., 2017).

In this work, we take a parallel path from the ApproximateDynamic Programming (ADP, Bertsekas et al. (1995)), atechnique particularly designed to solve complex Marko-vian decision processes. For this kind of problems, ADP hasbeen shown numerically superior to direct optimization suchas Newton method since it takes into account the temporalstructure inherit in OCP (Liao & Shoemaker, 1992). Theresulting update features a locally optimal feedback policyat each stage (see Fig. 1), which is in contrast to the one de-rived from the Pontryagin’s principle. In the application ofDNNs, we will show through experiments that the policieshelp improve training convergence and reduce sensitivity tohyper-parameter change.

Of our particular interest among practical ADP algorithmsis the Differential Dynamic Programming (DDP, Jacobson& Mayne (1970)). DDP is a second order method that hasbeen used extensively for complex trajectory optimizationin robotics (Tassa et al., 2014; Posa et al., 2016), and in thiswork we further show that existing first and second ordermethods for training DNNs can be derived from DDP asspecial cases (see Table 2). Such an intriguing connectioncan be beneficial to both sides. While we can leveragerecent advances in efficient curvature factorization of theloss Hessian (Martens & Grosse, 2015; George et al., 2018)to relieve the computational burden in DDP, on the otherhand, computing feedback policies stands as an independentmodule; thus it can be integrated into existing first and

Notation: We will always use t as the time step of dynam-ics and denote a trajectory as x , {xt}T−1

t=0 . Given a functionFt : X 7→ U, we denote its Jacobian and Hessian respectively as∇xFt ∈ RX×U and∇xxFt ∈ RX×X×U, and sometimes abbrevi-ate them by F t

x and F txx. Given a matrix A, its element at (i, j)

is denoted as [A](i,j), with its sub-matrix denoted as [A](i:j,k:p).

second order methods.

The concept of feedback mechanism has already shownup in the study of network design, where the terminologytypically refers to connections between modules over train-ing (Chung et al., 2015; Wen et al., 2018) or successiveprediction for vision applications (Zamir et al., 2017; Liet al., 2019). Conceptually perhaps Shama et al. (2019);Huh et al. (2019) are most related to our work, where the au-thors proposed to reuse the discriminator from a GenerativeAdversarial Network as a feedback module during trainingor test time. We note that, however, neither the problemformulation nor the mapping space of the feedback moduleis the same as ours. Our feedback policy is originated fromthe optimal control theory in which control update needs tocompensate the state disturbance throughout propagation.

The paper is organized as follows: In section 2 we go overbackground on optimality conditions to OCP and reviewthe DDP algorithm. Connection between DNNs trainingand trajectory optimization is solidified in section 3, with apractical algorithm demonstrated in section 4. We provideempirical results and future directions in section 5 and 6.

2. Preliminaries2.1. Optimality Conditions to OCP

Development of the optimality conditions to OCP can bedated back to 1960s, characterized by the Pontryagins Maxi-mum Principle (PMP) and the Dynamic Programming (DP).We detail the two different approaches below.

Theorem 1 (Discrete-time PMP (Todorov, 2016)). Let u∗

be the optimal control trajectory for OCP and x∗ be thecorresponding state trajectory. Then, there exists a co-statetrajectory p∗ , {p∗t }Tt=1, such that

x∗t+1 = ∇pH(x∗t ,p

∗t+1,u

∗t

), x∗0 = x0 , (1)

p∗t = ∇xH(x∗t ,p

∗t+1,u

∗t

), p∗T = ∇xφ (x∗T ) , (2)

u∗t = arg minv∈Rm

H(x∗t ,p

∗t+1,v

). (3)

Eq. (2) is called adjoint equation, and the discrete-timeHamiltonian H : Rn × Rn × Rm 7→ R is given by

H (xt,pt+1,ut) , pTt+1ft(xt,ut) + `t(xt,ut) .

The discrete-time PMP theorem can be derived using KKTconditions, in which the co-state pt is equivalent to the La-grange multiplier. As we will see in section 3.1, the adjointdynamics Eq. (2) has a direct link to the Back-propagation.Note that the solution to Eq. (3) admits an open-loop processin the sense that it does not depend on state variables. Thisis in contrast to the Dynamic Programming (DP) principle,in which a feedback policy is considered.


Theorem 2 (DP (Bellman, 1954)). Define a value functionVt : Rn 7→ R at each time step, computed backward in timeusing the Bellman equation

Vt(xt) = minut∈Γ

{`t(xt,ut) + Vt+1(ft(xt,ut))} , (4)

with VT (xT ) = φ(xT ). Γ : Rn 7→ Rm denotes a setof mapping from state to control space. Then, we haveV0(x0) = J∗(x0) be the optimal objective value to OCP.Furthermore, if u∗t = µ∗t (xt) minimizes the RHS of Eq. (4),then the policy π∗ = {µ∗0, · · · , µ∗T−1} is optimal.

Hereafter we denote the objective involved at each decisionstage in Eq. (4) as the Bellman objective Qt(xt,ut):

Qt(xt,ut) , `t(xt,ut) + Vt+1(ft(xt,ut)) . (5)

The principle of DP recasts the problem of minimizing overa control sequence to a sequence of minimization over eachcontrol. The value function Vt summarizes the optimalcost-to-go at each stage, provided the following controlsubsequence also being optimal. u∗(x) is an optimalfeedback law in a globally convergent closed loop system.

2.2. Differential Dynamic Programming

Trajectory optimization algorithms typically rely on solvingthe optimality equations from PMP or DP. Unfortunately,solving the Bellman equation in high-dimensional problemsis infeasible without any approximation, well-known as theBellman curse of dimensionality. To mitigate the computa-tional burden from the minimization involved at each stage,one can replace the Bellman objective in Eq. (4) with itssecond order approximation. Such an approximation is cen-tral to the Differential Dynamic Programming (DDP), asecond-order method that inherits a similar Bellman opti-mality structure while being computationally efficient.

Given a nominal trajectory (x, u), DDP iteratively opti-mizes its accumulated cost, where each iteration consists abackward and forward pass, detailed below.

Backward pass: At each time step, DDP expands Qt at(xt,ut) up to second-order, i.e. Qt(xt + δxt,ut + δut) ≈Qt(xt,ut) + δQt(δxt, δut), where δQt is expanded as1

δQt =1

2

1δxtδut

T 0 QxT QT

u

Qx Qxx Qxu

Qu Qux Quu

1δxtδut

with each term given by

Qx = `x + fTxV′x′ (6a)

Qu = ù + fTuV′x′ (6b)

Qxx = `xx + fTxV′x′x′fx + V ′x′ · fxx (6c)

1 We drop the time step t in all derivatives of Q for simplicity.

Quu = ùu + fTuV′x′x′fu + V ′x′ · fuu (6d)

Qux = ùx + fTuV′x′x′fx + V ′x′ · fux (6e)

V ′x′ , ∇xVt+1(xt+1) denotes the derivative of the valuefunction at the next state. The dot notation in Eq. (6c) - (6e)represents the product of a vector with a 3D tensor. Theoptimal perturbation to this approximate Bellman objectiveadmits an analytic form given by

δu∗t (δxt) = kt + Ktδxt , (7)

where δxt = x∗t − xt. kt , −Q−1uuQu and Kt ,

−Q−1uuQux respectively denote the open and feedback gains.

Note that this policy is only optimal locally around the nomi-nal trajectory where the second order approximation remainsvalid. The backward pass repeats the computation Eq. (6) -(7) backward in time, with the first and second derivative ofvalue function computed also recursively by

V tx = Qx −QTuxQ

−1uuQu

V txx = Qxx −QTuxQ

−1uuQux .

(8)

Forward pass: In the forward pass, we simply simulate atrajectory by applying the feedback policy computed at eachstage. Then, a new Bellman objective Q is expanded alongthe updated trajectory, and the whole procedure repeatsuntil the termination condition is met. Under mild assump-tions, DDP admits quadratic convergence for systems withsmooth dynamics (Mayne, 1973) and is numerically supe-rior to Newton’s method (Liao & Shoemaker, 1992). Wesummarize the DDP algorithm in Alg. 1.

3. Training DNNs as Trajectory Optimization3.1. Optimal Control Formulation

Recall that DNNs can be interpreted as dynamical systems inwhich each layer is viewed as a distinct time step. Considerfor instance the mapping in feedforward networks,

xt+1 = σt(ht) ht = gt(xt,ut) ≡Wtxt + bt . (9)

xt ∈ Rnt and xt+1 ∈ Rnt+1 represent the activation atlayer t and t+ 1, with ht ∈ Rnt+1 being the pre-activation.σt and gt respectively denote the nonlinear activation func-tion and the affine transform parametrized by the weightut , [vec(Wt), bt]

T. Eq. (9) can be seen as a dynamicalsystem propagating the activation vector xt using ut.

It is natural to ask whether the necessary condition in thePMP theorem relates to first-order optimization methods inDNN training. This is indeed the case as pointed out in Liet al. (2017) and Liu & Theodorou (2019).

Lemma 3 ((Li et al., 2017)). Back-propagation with gra-dient descent satisfies Eq. (2) and iteratively solves theminimization described by Eq. (3).


Algorithm 1 Differential Dynamic ProgrammingInput: nominal trajectory {xnom, unom}repeat

Set V Tx = ∇xφ and V Txx = ∇xxφfor t = T − 1 to 0 do

Compute Qtx, Qtu,Qtxx,Qtux,Qtuu with Eq. (6)Compute feedback policy δu∗t (δxt) with Eq. (7)Compute V tx, V txx with Eq. (8)

end forSet x∗0 = x0

for t = 0 to T − 1 dou∗t = ut + kt + Kt(δxt) , where δxt = x∗t − xtx∗t+1 = ft(x

∗t ,u∗t )

end for{xnom, unom} ← {x∗, u∗}

until terminate condition is met

Lemma 3 follows by expanding the RHS of Eq. (1) and (2):

∇pt+1H(xt,pt+1,ut) = ft(xt,ut) ,

∇xtH(xt,pt+1,ut) = ∇xt

ft(xt,ut)Tpt+1

+∇xt`t(xt,ut) = ∇xt

J(u;x0) .

Thus, the first two equations in PMP correspond to theforward propagation and the chain rule used for Back-propagation. Notice that we now have the subscript t inthe gradient since the dimension of xt and pt may changewith time. When the Hamiltonian is differentiable wrt ut,one can attempt to solve Eq. (3) by iteratively taking thegradient descent. This will lead to the familiar update:

u(k+1)t = u

(k)t − η∇ut

H(xt,pt+1,ut)

= u(k)t − η(∇ut

ft(xt,ut)Tpt+1 +∇ut

`t(xt,ut))

= u(k)t − η∇ut

J(u;x0) , (10)

where k and η denote the update iteration and step size.

We now extend Lemma 3 to accept the batch setting. Thefollowing proposition will become useful as we proceed tothe algorithm design in the next section.

Proposition 4. (Informal; see Appendix A for full ver-sion) Let Xt , {x(i)

t }Bi=1 ∈ RBnt denote the augmentedstate with the batch size B. Consider the new objectiveJ(u;X0) , 1

B

∑B J(u;x

(i)0 ). Then, iteratively solving

the “augmented” PMP equations is equivalent to applyingmini-batch gradient descent with Back-propagation. Specifi-cally, the derivative of the augmented Hamiltonian H takesthe exact form with the mini-batch gradient update:

∇utH(Xt,Pt+1,ut) = 1B

∑B ∇utJ(u;x

(i)0 ) , (11)

where Pt+1 = 1B [p

(1)t+1, · · · ,p

(B)t+1]T.

Algorithm 2 Back-propagation with Gradient DescentInput: nominal trajectory {xnom, unom}, learning rate ηSet pT = ∇xφfor t = T − 1 to 0 doδut = ∇ut

H = ∇ut{σt(gt(xt,ut))}Tpt+1 +∇ut

`tpt = ∇xtH = ∇xt{σt(gt(xt,ut))}Tpt+1 +∇xt`t

end forut ← ut − ηδut , ∀t ∈ {0, 1, · · · , T − 1}

Proposition 4 suggests that in the batch setting, we aim tofind an ultimate open-loop control that can drive every initialstate among the batch data to its designed target. Despiteseemly trivial, this is a distinct formulation to OCP sincethe optimal policy typically varies at different initial state.

It should be stressed that despite the appealing connectionbetween the optimal control and DNN training, the twomethodologies are not completely interchangeable. Wenote that the dimension of the activation typically changesthroughout layers to extract effective latent representation.This unique property poses difficulties when one wish toadopt analysis from the continuous-time framework. De-spite the recent attention on treating networks as an dis-cretization of ODE (Lu et al., 2017; Weinan et al., 2018),we note that this formulation restricts the applicability ofthe optimal control theory to networks with residual archi-tectures, as opposed to the generic dynamics discussed here.

3.2. Trajectory Optimization Perspective

In this part we draw a new connection between the train-ing procedure of DNNs and trajectory optimization. Let usfirst revisit the Back-propagation with gradient descent fromAlg. 2 . During the forward propagation, we treat the weightas the nominal control u that simulates the state trajectoryx. Then, the loss gradient is propagated backward, implic-itly moving in the direction suggested by the Hamiltonian.The control update, in this case the first-order derivative, iscomputed simultaneously and later applied to the system.

It should be clear at this point that Alg. 2 resembles DDPin several ways. Starting from the nominal trajectory, bothprocedures carry certain information wrt the objective, eitherpt or {V tx, V txx}, backward in time in order to compute thecontrol update. Since DDP computes the feedback policyat each time step, additional forward simulation is requiredin order to apply the update. The computation graph for thetwo processes is summarized in Fig. 1. In the followingproposition we make this connection formally and provideconditions when the two algorithms become equivalent.

Proposition 5. Assume Qtux = 0 at all stages, then thedynamics of the value derivative can be described by theadjoint dynamics, i.e. V tx = pt,∀t. In this case, DDP is


Table 2. Connection between existing algorithms and DDP

1ST-ORDERADAPTIVE STAGE-WISE1ST-ORDER NEWTON

Q−1uu ηI η diag(g) J−1

uu

Qux 0 0 0

equivalent to the stage-wise Newton by having

Qtu = Htu = ∇ut

J , Qtuu = ∇ututJ ,∀t . (12)

If furthermore we have (Qtuu)−1 = ηImt , then DDP degen-erates to the Back-propagation with gradient descent.

We leave the full proof in Appendix C and provide someintuitions for the connection between first-order derivatives.First, notice that we always have pT = V Tx = ∇xφ atthe horizon T without any assumption. It can be shown byinduction that when Qtux = 0 for all the remaining stages,pt will take the same update equation with V tx,

pt = Htx = `tx + f tTx pt+1

V tx = Qtx −QtTux(Qtuu)−1Qtu = `tx + f tTx V ′x′ ,

and Eq. (12) follows immediately by

Qtu = `tu + f tTu V ′x′ = `tu + f tTu pt+1 = Htu = ∇utJ .

The feedback policy under this assumption degenerates to

δu∗t (δxt) = −(Qtuu)−1(Qtu +Qtuxδxt)

= −∇ututJ−1∇ut

J ,

which is equivalent to the stage-wise Newton method(DE O. PANTOJA, 1988). It can be readily seen that we willrecover the gradient descent update by having an isotropicinverse covariance, (Qtuu)−1 = ηImt .

Proposition 5 states that the backward pass in DDP col-lapses to the Back-propagation when Qux vanishes. Tobetter explain the role of Qux during optimization, consideran illustrated 2D example in Fig. 2. Given an arbitrary ob-jectiveL expanded at (x, u), standard second-order methodscompute the Hessian wrt u and apply the update −L−1

uuLu.DDP differs from them in that it also computes the mixedpartial derivatives, i.e. Lux. The resulting update law hasthe same update but with an additional term linear in state.This feedback term, shown in the red arrow, compensateswhen the state moves apart from x during update. In Sec.5.2, we will conduct ablation study and discuss the effect ofthis deviation δx on DNNs training. Connection betweenDDP and existing methods is summarized in Table 2.

Figure 2. A toy illustration on the difference between the openloop and closed loop policy, denoted with green and red. The con-tour represents the objective L expanded at (x, u). The feedbackpolicy in this case is a line lying at the valley of the objective.

4. DDP Optimizer for Feedforward NetworksIn this section we discuss a practical implementation of DDPon training feedforward networks. Due to space constraint,we leave the complete derivation in the Appendix B.

4.1. Batch DDP Formulation

We first derive the differential Bellman objective δQ whenfeedforward networks are used as the dynamics. Recallthe propagation formula in Eq. (9) and substitute this newdynamics to the OCP by setting ft ≡ σt ◦ gt. After somealgebra, we can show that2

Qx = `x + gTxVh , Qu = ù + gTuVh

Qxx = `xx + gTx(Vhh + V ′x′ · σhh)gx + Vh · gxxQuu = ùu + gTu(Vhh + V ′x′ · σhh)gu + Vh · guu

Qux = ùx + gTu(Vhh + V ′x′ · σhh)gx + Vh · gux

(13)

where Vh , σThV′x′ and Vhh , σT

hV′x′x′σh denote the

derivatives of the value function at pre-activation. Comput-ing the layer-wise feedback policy and the value functionremain the same as in Eq. (7) and (8).

The computational overhead in Eq. (13) can be mitigated byleveraging the structure of feedforward networks. Since theaffine transform is bilinear in x and u, the terms gxx andguu vanish. The tensor gux admits a sparse structure. Forfully-connected layers, computation can be simplified to

[gux](i,j,k) = 1 iff j = (k − 1)nt+1 + i ,[Vh · gux]((k−1)nt+1:knt+1,k) = Vh .

(14)

For the coordinate-wise nonlinear transform, σh and σhh

are diagonal matrix and tensor. In most learning instances,stage-wise losses typically involved with weight decayalone; thus the terms `x, `xx, ùx also vanish.

Note that Eq. (13) describes the δQ expansion along a sin-gle trajectory with the dynamics given by the feedforward

2 We omit the subscript t again for simplicity.


networks. To modify it in the setting of batch trajectories op-timization, we augment the activation space to Xt ∈ RBnt

(recall B is the batch size), in the spirit of Proposition 4. Itshould be stressed, however, that despite drawing inspirationfrom the augmented Hamiltonian framework, the resultingrepresentation does not admit a clean form such as averag-ing over individual updates in Eq. (11). To see why this isthe case, observe that at the horizon T , the derivative of theaugmented value function can indeed be expressed by theindividual ones by VT

X = 1B [· · ·V T

x(i) · · · ]T. Such an inde-pendent structure does not preserve through the backwardpass since at t = T − 1, VT−1

X instead takes the form

VT−1X =

1

B

...

QT−1x(i)

...

− 1

B

...

QT−1ux(i)

...

T

(QT−1uu )−1QT−1

u ,

where QT−1uu and QT−1

u are the averaging over Qtuu|x(i)t

and Qtu|x(i)t . The intuition is that when optimizing batch

trajectories with the same control law, the Bellman objectiveof each sample couples with others through the second orderexpansion of the augmented Q. Consequently, the valuefunction will no longer be independent soon after leaving theterminal horizon. We highlight this trait which distinguishesbatch DDP from its original representation.

4.2. Practical Implementation

Directly applying the batch DDP to optimize DNNs canbe impractical. For one, training inherits stochasticity dueto mini-batch sampling. This differs from standard trajec-tory optimization where the initial state remains unalteredthroughout optimization. Secondly, the control dimensionis typically order of magnitude higher than the one for state;thus inverting the Hessian Quu soon becomes computa-tionally infeasible. In this part we discuss several practicalimplementation adopted from both literature that stabilizesthe optimization process and enables real-time training.

First, we apply Tikhonov regularization on Quu and linesearch since both play key roles in the convergence of DDP(Liao & Shoemaker, 1991). We note that these regulariza-tion have shown up already in robustifying DNNs training,with the form of l2-norm and in Vaswani et al. (2019). Fromthe perspective of trajectory optimization, we should empha-size that without regularization, Quu will lose its positivedefiniteness whenever fTuV

′x′x′fu has a low rank. This is

indeed the case in feedforward networks. The observationon low-rank matrices also applies to fTxV

′x′x′fx when the

dimension of the activation reduces during forward propa-gation. Thus we also apply Tikhonov regularization to Vxx.This can be seen as placing a quadratic state-cost and hasbeen shown to improve robustness on optimizing complexhumanoid (Tassa et al., 2012). Next, when using DDP for

trajectory optimization, one typically has the option of ex-panding the dynamics up to first or second order. Note thatboth are still considered second-order methods and generatelayer-wise feedback policies, except the former simplifiesthe problem similar to Gauss-Newton. The computationalbenefit and stability obtained by keeping only the linearizeddynamics have been discussed thoroughly in robotics lit-erature (Todorov & Li, 2005; Tassa et al., 2012). Thus,hereafter we will refer the DDP optimizer to this version.

Tremendous efforts have been spent recently on efficientcurvature estimation of the loss landscape during training.This is particularly crucial in enabling the applicability ofthe second-order methods, since inverting the full HessianJuu, even in a layer-wise fashion, can be computationallyintractable, and DDP has no exception. Fortunately, DDPcan be integrated with these advances. We first notice thatmany popular curvature factorization methods contain mod-ules that collect certain statistics during training to computethe preconditioned update. Take EKFAC (George et al.,2018) for instance, the update for each layer is approximatedby J−1

ututJut≈ CurvApprox(Jut

; {xt, Jht}Kk=0), where

{·}Kk=0 denotes the history until the most recent iterationK. The module bypasses the computation of Hessian andits inverse by directly estimating the matrix-vector product.We can integrate CurvApprox with DDP by

(Qtuu)−1Qtu ≈ CurvApprox(Qtu; {xt, V th}Kk=0) , (15)

and compute Q−1uuQux by applying Eq. (15) column-wise.

Notice that we replace Jht with Vht since here we aim toapproximate the landscape of the value function. Hereafterwe name this approximation by DDP-EKFAC.

Integrating DDP with existing optimizers is not restrictedto second-order methods. Recall the connection we madein Proposition 5. When Q−1

uu is isotropic, DDP inheritsthe same structure as gradient descent. In a similar vein,adaptive first order methods, such as RMSprop (Hinton et al.,2012), approximate Q−1

uu by diag(g), where g adapting tothe diagonal of the inverse covariance. We will refer thesetwo variants to DDP-SGD and DDP-RMSprop.

5. Experiments5.1. Batch Trajectory Optimization on Synthetic Data

We first demonstrate the effectiveness of our DDP optimizerin batch trajectories optimization. Recall from the derivationin Sec. 2.2 that the feedback policy is optimal locally aroundthe region whereQ is expanded; thus only samples travelingthrough this region benefits. Since conceptually a DNNclassifier can be thought of as a dynamical system guidingeach group of trajectories toward the region belong to theirclasses, we hypothesize that for the DDP optimizer to showits effectiveness on batch training, the feedback policy must


Table 3. Performance Comparison on Train Loss and Test Accuracy (%).

DATA SET SGD RMSPROP KFAC EKFAC KWNG DDPSGD

DDPRMSPROP

DDPEKFAC DDP

WINE 0.5654 0.5515 0.5601 0.5610 0.5565 0.5657 0.5515 0.5610 0.5649(94.35) (98.10) (97.78) (94.60) (98.67) (94.29) (98.10) (94.60) (98.00)

IRIS 0.0514 0.0942 0.0588 0.0514 0.0673 0.0505 0.0515 0.0567 0.0552(93.00) (91.48) (94.28) (94.86) (94.07) (93.79) (94.81) (92.72) (94.44)

DIGITS 0.0400 0.0562 0.4474 0.0864 0.1321 0.0364 0.0450 0.0666 0.2197(95.28) (94.52) (84.61) (94.53) (93.83) (95.43) (94.71) (95.36) (91.94)

MNIST 0.2321 0.2742 0.2964 0.2392 0.3689 0.2242 0.2291 0.2365 N/A(92.60) (91.30) (92.02) (92.79) (88.92) (92.84) (92.42) (92.67)FASHION 0.4695 0.4202 0.4572 0.4257 0.4518 0.4717 0.4191 0.4204 N/AMNIST (82.48) (84.24) (84.27) (84.14) (84.02) (82.46) (84.2) (84.39)

0 10 200.0

0.5

1.0

×10−7 5 Clusters

50

150

250

0 10 200.0

0.5

1.0×10−7 8 Clusters

50

150

250

0 10 200

1

×10−7 12 Clusters

100

150

250

0 10 200

1

×10−7 15 Clusters

100

150

250

Eig

enva

lues

Order of Largest Eigenvalues

Figure 3. Spectrum distribution of the last-layer feedback policyon synthetic data, with the legend denoted each training snapshot.The vertical dashed line represents our predicted cutoff numberwhen the policy degenerates.

act as an ensemble to capture the feedback behavior foreach class. To this end, we randomly generate syntheticdata points from k ∈ {5, 8, 12, 15} Gaussian clusters in theinput space R30 and train a 3-layers feedforward networks(30-20-20-k) from scratch using DDP optimizer. For allsettings, training error drops to nearly zero. The spectrumdistribution, sorted in a descending order, of the feedbackpolicy in the prediction layer is shown in Fig. 3. Theresult shows that the number of nontrivial eigenvalues inQ−1

uuQuX matches exactly the number of classes in eachproblem (indicated by the vertical dashed line). As thestate distribution concentrates to k bulks through training,the eigenvalues also increase, providing stronger feedbackdirection to the weight update. Thus, we consolidate ourbatch DDP formulation in Sec. 4.1.

5.2. Performance on Classification Data Set

Next, we validate the performance of the DDP optimizer,along with several variants proposed in the previous sec-tion, on classification tasks. In addition to the baselinessuch as SGD and RMSprop (Hinton et al., 2012), we com-

0 500 1000 1500Iterations

0.0

0.5

1.0

1.5

2.0

Tra

inin

gL

oss

DIGITS

SGD

RMSprop

DDP-SGD

DDP-RMSprop

DDP-EKFAC

DDP

0 500 1000 1500Iterations

0.2

0.4

0.6

0.8

1.0

Acc

ura

cy%

DIGITS

SGD

RMSprop

DDP-SGD

DDP-RMSprop

DDP-EKFAC

DDP

0 500 1000 1500 2000 2500 3000Iterations

0.5

1.0

1.5

2.0

Tra

inin

gL

oss

MNIST

SGD

EKFAC

DDP-SGD

DDP-RMSprop

DDP-EKFAC

0 500 1000 1500 2000 2500 3000Iterations

0.2

0.4

0.6

0.8

Acc

ura

cy%

MNIST

SGD

EKFAC

DDP-SGD

DDP-RMSprop

DDP-EKFAC

Figure 4. Performance on DIGIS. We report DDP-integrated opti-mizers and the other top two methods.

pare with state-of-the-art second order optimizers, includingKFAC (Martens & Grosse, 2015), EKFAC (George et al.,2018), and KWNG (Arbel & Montufar., 2020). We use(fully-connected) feedforward networks for all experiments.The complete experiment setup and additional results (e.g.accuracy curve) are provided in the Appendix D. Table 3summarizes the performance with the top two results high-lighted. Without any factorization, we are not able to obtainresult for DDP on (Fashion)-MNIST due to the explodingcomputation when inverting Quu.

On all data set, DDP achieves better or comparable resultsagainst existing methods. Notably, when comparing thevanilla DDP with its approximate variants (see Fig. 4), thelatter typically admit smaller variance and sometimes con-verge to better local minima. The instability in the formercan be caused by the overestimation of the value Hessianwhen using only mini-batch data, which is mitigated by theamortized method used in EKFAC, or diagonal approxima-tion in first-order methods. This shed light on the benefitgained by bridging two seemly disconnected methodologies.

5.3. Effect of Feedback Policies

Sensitivity to Hyper-parameter: For optimizers that arecompatible with DDP, we observe minor improvements overthe final training loss. To better understand the effect of thefeedback mechanism, we conduct ablation analysis and com-


0 200 400 600 800 1000 1200

0.5

1.0

1.5

2.0

learning rate:8e-1

SGD

DDP-SGD

0 200 400 600 800 1000 12000.0

0.5

1.0

1.5

2.0

learning rate:5e-1

SGD

DDP-SGD

0 200 400 600 800 1000 12000.0

0.5

1.0

1.5

2.0

learning rate:1e-1

EKFAC

DDP-EKFAC

0 200 400 600 800 1000 1200

0.0

0.5

1.0

1.5

2.0

learning rate:5e-2

EKFAC

DDP-EKFAC

0 200 400 600 800 1000 12000.0

0.5

1.0

1.5

2.0

learning rate:5e-4

RMSprop

DDP-RMSprop

0 200 400 600 800 1000 12000.0

0.5

1.0

1.5

2.0

learning rate:1e-3

RMSprop

DDP-RMSprop

Tra

inin

gL

oss

Training Iterations

Figure 5. Ablation study on existing optimizers and their DDPversions over different learning rate. The feedback mechanismhelps stabilize the training when the learning rate is perturbed fromthe best tuned value (shown in the left column).

pare their training performance among different learningrate setup, while keeping all the other hyper-parameters thesame. The result on DIGITS is shown in Fig. 5, wherethe left and right columns correspond to the best-tuned andperturbed learning rate. While all optimizers are able toconverge to low training error when best-tuned, the onesintegrated with DDP tend to stabilize the training when thelearning rate changes by some fraction; sometimes they alsoachieve lower training error in the end. The performanceimprovement induced by the feedback policy typically be-comes obvious when the learning rate increases.

Variance Reduction and Faster Convergence: Table 4shows the variance differences in training loss and test accu-racy after the optimizers are integrated with DDP feedbackmodules. Specifically, we report the value (VarDDP-Alg −VarAlg)/VarAlg, with Alg being optimizers compatiblewith DDP, including SGD, RMSprop, and EKFAC. We keepall hyper-parameters the same for each experiment so thatthe performance difference only comes from the existenceof feedback policies. As shown in Table 4, for most caseshaving additional updates from DDP stabilizes the trainingdynamics by reducing its variance, and similar reductioncan be observed in the variance of testing accuracy over opti-mization, sometimes by a large margin. Empirically, we findthat such a stabilization may also lead to faster convergence.In Fig. 7, we report the training dynamics of the top threeexperiment instances in Table 4 that have large variancereduction when using DDP. In all cases, the optimization

50 100 150Batch Size

1.00

1.25

1.50

1.75

2.00

2.25

2.50

2.75

100 %

Hid Dim=32, #layers=6

50 100 150Hidden State Dimension

2

4

6

8

10

12

Batch size=16, #layers=6

50 100 150

1.0

1.2

1.4

100 %

50 100 150Number of Layers

1.0

1.5

2.0

2.5

3.0

3.5

100 %

Hid Dim=32, Batch size=16

DDP DDP-EKFAC EKFAC Component

(T−T

EK

FA

CM

od

ule

)/T

EK

FA

CM

od

ule

Figure 6. Overhead in the DDP backward pass compared with thecomputation spent on EKFAC curvature approximation (red line).We leave out the DDP curve in the middle inner plot to bettervisualize the overhead in DDP-EKFAC wrt hidden dimension.

paths also admit faster convergence.

To give an explanation from the trajectory optimizationviewpoint, recall that DDP computes layer-wise policieswith each composed of two terms. The open-loop controlis state-independent and, in this setup, equivalent to theoriginal update computed by each optimizer. The feed-back term is linear in δxt, which can be expressed byδxt = f0:t(x0,u

(k+1)0:t ) − f0:t(x0,u

(k)0:t ), where we abuse

the notation and denote f0:t as the compositional dynamicspropagating x0 with u0:t , {ui}ti=0. In other words, δxtcaptures the state differential when new control is applieduntil layer t, and we should expect it to increase when afurther step size, thus a larger difference between controls,is taken at each layer. This implies the control update inthe later layer may not be optimal (at least from the trajec-tory optimization viewpoint) since the gradient is evaluatedat xt instead of x∗t , xt + δxt. The difference between∇utJ(xt,ut) and∇utJ(x∗t ,ut) cannot be neglected espe-cially during early training when the objective landscapecontains nontrivial curvature everywhere (Alain et al., 2019).The feedback direction is designed to mitigate this gap. Infact we can show that Ktδxt approximately solves

arg minΓ(δxt)

‖∇utJ(x∗t ,ut + δut)−∇utJ(xt,ut)‖ , (16)

with the derivation provided in the Appendix C. Thus, byhaving this additional term throughout updates, we stabilizethe training dynamics by traveling through the landscapecautiously without going off the cliff.

5.4. Remarks on Computation Overhead

Lastly, Fig. 6 summarizes the computational overhead dur-ing DDP backward pass on DIGITS. Specifically, we com-pare the wall-clock processing time of DDP-EKFAC andDDP to the computation spent on the curvature approximatemodule in the DDP-EKFAC optimizer. The value thus sug-


Table 4. Variance differences on training loss (test accuracy) after optimizers are integrated with DDP.Negative value indicates a reduction of variance over optimization and vice versa.

DATA SETSGD VS

DDP-SGDRMSPROP VS

DDP-RMSPROPEKFAC VS

DDP-EKFAC

WINE −1.312% (−0.732%) 9.263% (−1.813%) 0.179% (0.177%)

IRIS −2.474% (−4.021%) 1.038% (−1.854%) −4.601% (−0.713%)

DIGITS −1.514% (−6.743%) −3.165% (−6.468%) −9.020% (−39.971%)

MNIST 0.331% (−5.263%) −1.352% (−32.503%) −0.216% (6.344%)

FMNIST −1.324% (−4.767%) 3.464% (20.254%) −2.327% (4.443%)

gests the additional overhead required for EKFAC to adoptthe DDP framework, which includes second-order expan-sion on the value function, computing layer-wise feedbackpolicies, etc. While the vanilla batch DDP scales poorly withthe dimension of the hidden state under the current batchformulation, the overhead in DDP-EKFAC increases onlyby a constant (from 1.1 to 1.6) wrt different architecturesetup, such as batch size, hidden dimension, and networkdepth. We stress that our current implementation is not fullyoptimized, so there is still room for further acceleration.

6. ConclusionIn this work, we introduce Differential Dynamic Program-ming Neural Optimizer (DDP), a new class of algorithmsarising from bridging DNNs training to the optimal controltheory and trajectory optimization. This new perspectivesuggests existing methods stand as special cases of DDPand can be extended to adapt the framework painlessly. Theresulting optimizer features layer-wise feedback policieswhich help reduce sensitivity to hyper-parameter and some-times improve convergence. We wish this work providesnew algorithmic insight and bridges communities from bothdeep learning and optimal control. There are numerous fu-ture directions left for explored. For one, deriving updatelaws for other architectures (e.g. convolution and batch-norm) and applying factorization to other gigantic matricesare essential to the applicability of DDP optimizer.

ReferencesAlain, G., Roux, N. L., and Manzagol, P.-A. Negative

eigenvalues of the hessian in deep neural networks. arXivpreprint arXiv:1902.02366, 2019.

Arbel, Michael, A. G. W. L. and Montufar., G. Kernelizedwasserstein natural gradient. In International Conferenceon Learning Representations, 2020.

Bellman, R. The theory of dynamic programming. Technicalreport, Rand corp santa monica ca, 1954.

Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and

Bertsekas, D. P. Dynamic programming and optimalcontrol, volume 1. Athena scientific Belmont, MA, 1995.

Boltyanskii, V. G., Gamkrelidze, R. V., and Pontryagin, L. S.The theory of optimal processes. i. the maximum prin-ciple. Technical report, TRW SPACE TECHNOLOGYLABS LOS ANGELES CALIF, 1960.

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud,D. K. Neural ordinary differential equations. In Ad-vances in Neural Information Processing Systems, pp.6572–6583, 2018.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Gatedfeedback recurrent neural networks. In Internationalconference on machine learning, pp. 2067–2075, 2015.

DE O. PANTOJA, J. Differential dynamic programmingand newton’s method. International Journal of Control,47(5):1539–1553, 1988.

George, T., Laurent, C., Bouthillier, X., Ballas, N., andVincent, P. Fast approximate natural gradient descent ina kronecker factored eigenbasis. In Advances in NeuralInformation Processing Systems, pp. 9550–9560, 2018.

Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonianneural networks. In Advances in Neural InformationProcessing Systems, pp. 15353–15363, 2019.

Hinton, G., Srivastava, N., and Swersky, K. Neural networksfor machine learning lecture 6a overview of mini-batchgradient descent. 2012.

Hu, B. and Lessard, L. Control interpretations for first-order optimization methods. In 2017 American ControlConference (ACC), pp. 3114–3119. IEEE, 2017.

Huh, M., Sun, S.-H., and Zhang, N. Feedback adversar-ial learning: Spatial feedback for improving generativeadversarial networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pp.1476–1485, 2019.

Jacobson, D. H. and Mayne, D. Q. Differential dynamicprogramming. 1970.


Li, Q. and Hao, S. An optimal control approach to deeplearning and applications to discrete-weight neural net-works. arXiv preprint arXiv:1803.01299, 2018.

Li, Q., Chen, L., Tai, C., and Weinan, E. Maximum princi-ple based algorithms for deep learning. The Journal ofMachine Learning Research, 18(1):5998–6026, 2017.

Li, Z., Yang, J., Liu, Z., Yang, X., Jeon, G., and Wu, W.Feedback network for image super-resolution. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 3867–3876, 2019.

Liao, L.-Z. and Shoemaker, C. A. Convergence in uncon-strained discrete-time differential dynamic programming.IEEE Transactions on Automatic Control, 36(6):692–706,1991.

Liao, L.-z. and Shoemaker, C. A. Advantages of differentialdynamic programming over newton’s method for discrete-time optimal control problems. Technical report, CornellUniversity, 1992.

Liu, G.-H. and Theodorou, E. A. Deep learning theoryreview: An optimal control and dynamical systems per-spective. arXiv preprint arXiv:1908.10920, 2019.

Liu, X., Xiao, T., Si, S., Cao, Q., Kumar, S., and Hsieh,C.-J. Neural sde: Stabilizing neural ode networks withstochastic noise. arXiv preprint arXiv:1906.02355, 2019.

Long, Z., Lu, Y., Ma, X., and Dong, B. Pde-net: Learningpdes from data. arXiv preprint arXiv:1710.09668, 2017.

Lu, Y., Zhong, A., Li, Q., and Dong, B. Beyond fi-nite layer neural networks: Bridging deep architecturesand numerical differential equations. arXiv preprintarXiv:1710.10121, 2017.

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L.,and Liu, T.-Y. Understanding and improving transformerfrom a multi-particle dynamic system point of view. arXivpreprint arXiv:1906.02762, 2019.

Martens, J. and Grosse, R. Optimizing neural networks withkronecker-factored approximate curvature. In Interna-tional conference on machine learning, pp. 2408–2417,2015.

Mayne, D. Q. Differential dynamic programming–a unifiedapproach to the optimization of dynamic systems. InControl and Dynamic Systems, volume 10, pp. 179–254.Elsevier, 1973.

Posa, M., Kuindersma, S., and Tedrake, R. Optimizationand stabilization of trajectories for constrained dynamicalsystems. In 2016 IEEE International Conference onRobotics and Automation (ICRA), pp. 1366–1373. IEEE,2016.

Seidman, J. H., Fazlyab, M., Preciado, V. M., and Pappas,G. J. Robust deep learning as optimal control: Insightsand convergence guarantees. Proceedings of MachineLearning Research vol xxx, 1:14, 2020.

Shama, F., Mechrez, R., Shoshan, A., and Zelnik-Manor,L. Adversarial feedback loop. In Proceedings of theIEEE International Conference on Computer Vision, pp.3205–3214, 2019.

Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabi-lization of complex behaviors through online trajectoryoptimization. In 2012 IEEE/RSJ International Confer-ence on Intelligent Robots and Systems, pp. 4906–4913.IEEE, 2012.

Tassa, Y., Mansard, N., and Todorov, E. Control-limiteddifferential dynamic programming. In 2014 IEEE Inter-national Conference on Robotics and Automation (ICRA),pp. 1168–1175. IEEE, 2014.

Todorov, E. Optimal control theory. Bayesian brain: proba-bilistic approaches to neural coding, pp. 269–298, 2016.

Todorov, E. and Li, W. A generalized iterative lqg methodfor locally-optimal feedback control of constrained non-linear stochastic systems. In Proceedings of the 2005,American Control Conference, 2005., pp. 300–306. IEEE,2005.

Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel,G., and Lacoste-Julien, S. Painless stochastic gradient:Interpolation, line-search, and convergence rates. In Ad-vances in Neural Information Processing Systems, pp.3727–3740, 2019.

Weinan, E. A proposal on machine learning via dynamicalsystems. Communications in Mathematics and Statistics,5(1):1–11, 2017.

Weinan, E., Han, J., and Li, Q. A mean-field optimalcontrol formulation of deep learning. arXiv preprintarXiv:1807.01083, 2018.

Wen, H., Han, K., Shi, J., Zhang, Y., Culurciello, E., and Liu,Z. Deep predictive coding network for object recognition.arXiv preprint arXiv:1802.04762, 2018.

Zamir, A. R., Wu, T.-L., Sun, L., Shen, W. B., Shi, B. E.,Malik, J., and Savarese, S. Feedback networks. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 1308–1317, 2017.

Zhong, Y. D., Dey, B., and Chakraborty, A. Symplecticode-net: Learning hamiltonian dynamics with control.arXiv preprint arXiv:1909.12077, 2019.


Supplementary MaterialA. Derivation of Batch PMPA.1. Problem Formulation and Notation

Recall the original OCP for single trajectory optimization. In its batch setting, we consider the following state-augmentedoptimal control problem:

minu

J(u;X0) :=

[Φ(XT ) +

T−1∑t=0

Lt(Xt,ut)

], s.t. Xt+1 = Ft(Xt,ut) , (17)

where Xt , [· · ·x(i)Tt · · · ]T ∈ RBnt denotes the state-augmented vector over each x

(i)t ∈ Rnt , ∀i ∈ {1, 2, · · · , B},

and B denotes the batch size. Lt(Xt,ut) and Φ(XT ) respectively represent the average intermediate and terminal costover `t(x

(i)t ,ut) and φ(x

(i)T ). Ft consists of B independent mappings from each dynamics ft(x

(i)t , ·). Consequently, its

derivatives can be related to the ones for each sample by

∇utFt(Xt,ut) ≡ Ftu ∈ RBnt′×mt , where [Ftu]

(B(i)nt′,:)

= f tu|(i)

∇XtFt(Xt,ut) ≡ FtX ∈ RBnt′×Bnt , where [FtX ](B

(i)nt′,B

(j)nt )

= δijftx|(i) ,

∇utXtFt(Xt,ut) ≡ FtuX ∈ RBnt′×mt×Bnt , where [FtuX ]

(B(i)nt′,:,B

(j)nt )

= δijftux|(i) , (18)

∇XtXtFt(Xt,ut) ≡ FtXX ∈ RBnt′×Bnt×Bnt , where [FtXX ]

(B(i)nt′,B

(j)nt ,B

(k)nt )

= δijkftxx|(i) ,

∇ututFt(Xt,ut) ≡ Ftuu ∈ RBnt′×mt×mt , where [Ftuu](B

(i)nt′,:,:)

= f tuu|(i) .

δij and δijk represent the Kronecker delta. For simplicity, we denote t′ , t+ 1, B(i)d , ((i− 1)d : id) as the indices for

i-th block, each with the dimension d, and f tux|(i) , ∇utxtft(x(i)t ,ut) as the derivative of the dynamics of the sample i.

Notice again that t appears in the subscript of the gradient since we allow changing the dimension of Xt at each layer.

A.2. Formal Statement for Proposition 4

Proposition 6. Let u∗ be the optimal control trajectory for the augmented problem described in Eq. (17) and X∗ be thecorresponding augmented state trajectory. Then, there exists a co-state trajectory P ∗ , {P ∗t }Tt=1, such that the following

‘augmented’ PMP equations are satisfied:

X∗t+1 = ∇Pt+1H(X∗t ,P

∗t+1,u

∗t

), X∗0 = X0 , (19a)

P ∗t = ∇XtH(X∗t ,P

∗t+1,u

∗t

), P ∗T = ∇XT

Φ (X∗T ) , (19b)

u∗t = arg minv

H(X∗t ,P

∗t+1,v

), ∀v ∈ Rmt , (19c)

where the augmented Hamiltonian H : RBnt × RBnt+1 × Rmt 7→ R at each time step is given by

H (Xt,Pt+1,ut) = P Tt+1Ft(Xt,ut) + Lt(Xt,ut) . (20)

Furthermore,solving Eq. (19c) iteratively by u(k+1)t = u

(k)t −η∇ut

H(Xt,Pt+1,u(k)t ) is equivalent to applying mini-batch

gradient descent with Back-propagation.

Proof. It is obvious to see that Eq. (19a) forward propagates each x(i)t . To bridge Eq. (19b) to the mini-batch Back-

propagation, we first notice that the augmented co-state at the terminal time admits a simple form:

PT = ∇XTΦ (XT ) =

1

B

∇xT

φ(x(1)T )

...∇xT

φ(x(B)T )

=1

B

p

(1)T...

p(B)T

. (21)


In other words, PT is a collection over each co-state p(i)T and normalizes by the batch size. We can show by induction that

such a structure is preserved throughout the adjoint dynamics in Eq. (19b):

Pt = ∇XtH(Xt,Pt+1,ut)

= LtX + FtXTPt+1

=1

B

∇xt

`t(x(1)t ,ut) +∇xt

ft(x(1)t ,ut)

Tp

(1)t+1

...

∇xt`t(x

(B)t ,ut) +∇xt

ft(x(B)t ,ut)

Tp

(B)t+1

=1

B

p

(1)t...

p(B)t

.

(22)

Finally, the update direction is given by

∇utH(Xt,Pt+1,ut) = Ltu + FtuTPt+1

= 1B

∑i∇ut`t(x

(i)t ,ut) + 1

B

∑i[F

tu]T

(B(i)nt+1

,:)p

(i)t+1

= 1B

∑i∇ut

`t(x(i)t ,ut) +∇ut

ft(x(i)t ,ut)

Tp(i)t+1

= 1B

∑i∇ut

H(x(i)t ,ut,p

(i)t+1) = 1

B

∑i∇ut

J(ut;x(i)0 ) ,

(23)

which is the exact mini-batch gradient update.

Remarks: It should be stressed that the relation between Pt and pt by simply taking the average is only true when first-orderderivatives are involved. In Appendix B.2, we will show that the backward pass for the augmented value function does notadmit this property and therefore can be much complex.

B. Derivation of Batch DDP for Feedforward NetworksIn this part we provide the complete derivation in Sec. 4 for optimizing feedforward networks with batch DDP optimizer.

B.1. Feedforward Networks as Dynamical Systems

We first consider the original DDP formulation when feedforward networks are used as dynamics. Here we providederivations for the new δQ expansion (i.e. Eq. (13)) and how it can be computed efficiently (i.e. Eq. (14)).

Derivation of Eq. (13). It is sufficient to derive Qu and Quu only, which are given respectively by

Qu = ù + fTuV′x′ = ù + gTuσ

ThV′x′ , (24)

Quu = ùu +∂

∂u{gTuσT

hV′x′}

= ùu + gTuσTh

∂

∂u{V ′x′}+ gTu(

∂

∂u{σh})TV ′x′ + (

∂

∂u{gu})TσT

hV′x′

= ùu + gTuσThV′x′x′σhgu + gTu(V ′Tx′ σhhgu) + gTuuσ

ThV′x′

= ùu + gTu(Vhh + V ′x′ · σhh)gu + Vh · guu (25)

The last equation follows by recalling Vh , σThV′x′ and Vhh , σT

hV′x′x′σh.

Derivation of Eq. (14). First, recall the affine transform gt = Wtxt + bt, where Wt ∈ Rnt+1×nt , and bt ∈ Rnt+1 . Thei-th element of the vector gt can be written as

[g]i =

nt∑k=1

[W](i,k)[x]k + [b]i =

nt∑k=1

[u](k−1)nt+1+i[x]k + [u]ntnt+1+i , (26)

where we omit the subscript t for simplicity and introduce the vectorized variable u = [vec(W), b]T. It can be readilyverified that

[gux](i,j,k) = 1 iff j = (k − 1)nt+1 + i , ∀k ∈ {1, · · · , nt} . (27)


Thus, gux is a sparse tensor with each block [gux](:,B

(k)nt+1

,k)being an identity matrix Int+1

, and its tensor operation can be

computed by

[Vh · gux](j,k) =

nt+1∑i=1

[gux](i,j,k)[Vh]i = [Vh]j−(k−1)nt+1, ∀k ∈ {1, · · · , nt} . (28)

In other words, Vh · gux consists of nt vectors Vh, with the indices given by Eq. (14).

B.2. Batch DDP Formulation

Next, we extend the formulation in Appendix B.1 to the batch setting by adopting the same notation in Appendix A.1. Giventhe augmented Bellman objective Q(Xt,ut) , Lt(Xt,ut) + Vt+1(Ft(Xt,ut)), we modify Eq. (13) by substitutingFt ≡ σt ◦Gt,

QtX = LtX + FtTXV′X′ = LtX + GtT

XVtH (29a)

Qtu = Ltu + FtTu V′X′ = Ltu + GtT

u VtH (29b)

QtXX = LtXX + FtTXV′X′X′F

tX + V′X′ · FtXX

= LtXX + GtTX (Vt

HH + V′X′ · σtHH)GtX + Vt

H ·GtXX (29c)

Qtuu = Ltuu + FtTu V′X′X′F

tu + V′X′ · Ftuu

= Ltuu + GtTu (Vt

HH + V′X′ · σtHH)Gtu + Vt

H ·Gtuu (29d)

QtuX = LtuX + FtTu V′X′X′F

tX + V′X′ · FtuX

= LtuX + GtTu (Vt

HH + V′X′ · σtHH)GtX + Vt

H ·GtuX , (29e)

where VtH and Vt

HH are given by

VtH , σtTHV′X′ , Vt

HH , σtTHV′X′X′σtH . (30)

Once we have all the Q derivatives explicitly written, computing layer-wise feedback policies and backward passing thederivatives of the augmented value follow by the same equations as in the original DDP, i.e.

δu∗t (δXt) = kt + KtδXt = −(Qtuu)−1(Qt

u + QtuXδXt) , (31)

VtX = Qt

X −QtTuX(Qt

uu)−1Qtu = Qt

X + QtTuXkt ,

VtXX = Qt

XX −QtTuX(Qt

uu)−1QtuX = Qt

XX + QtTuXKt .

(32)

To see whether the backward pass of the augmented value function inherits any implicit structure, similar to the averagerelation in Eq. (22). Let us rewrite Eq. (29) in a block-wise fashion by recalling Eq. (18).

Qtu = 1

B

∑i `tu|(i) +

∑i f

tu|(i)

T[V′X′ ]B(i)

nt′

(33a)

[QtX ]

B(i)nt

= 1B `

tx(i) + f tx|(i)

T[V′X′ ]B(i)

nt′

(33b)

[QtuX ]

(:,B(i)nt )

= 1B `

tux(i) +

∑j f

tu|(j)

T[V′X′X′ ](B(j)

nt′,B

(i)nt′

)f tx|(i) + [V′X′ ]B(i)

nt′· f tux|(i) (33c)

[QtXX ]

(B(i)nt ,B

(j)nt )

= 1B δij`

tx(i)x(j) + f tx|(i)

T[V′X′X′ ](B(i)

nt′,B

(j)nt′

)f tx|(j) + δij([V

′X′ ]B(i)

nt′· f txx|(j)) (33d)

Qtuu = 1

B

∑i `tuu|(i) +

∑ij f

tu|(i)

T[V′X′X′ ](B(i)

nt′,B

(j)nt′

)f tu|(j) +

∑i [V′X′ ]B(i)

nt′· f tuu|(i) (33e)

Notice that at t = T , the derivatives of the augmented value function are related to the ones for each sample by:

[VTX ]

B(i)nT

= 1BV

Tx(i) , and [VT

XX ](B

(i)nT,B

(j)nT

)= 1

B δijVTx(i)x(j) . (34)


Substitute them into Eq. (33), one can show that the derivatives of the augmented Q at t = T − 1 can indeed be expressedby the composition of each Q(i)

QT−1u = 1

B

∑iQ

T−1u |(i) , [QT−1

X ]B

(i)nt

= 1BQ

T−1x(i) ,

[QT−1uX ]

(:,B(i)nt )

= 1BQ

T−1ux(i) , [QT−1

XX ](B

(i)nt ,B

(j)nt )

= δij1BQ

T−1x(i)x(j) , QT−1

uu = 1B

∑iQ

T−1uu |(i) .

(35)

However, VT−1X and VT−1

XX will no longer preserve the structure described in Eq. (34). In fact, we will have

[VT−1X ]

B(i)nT−1

= 1B

(QT−1

x(i) − (QT−1ux(i))

T(QT−1

uu )−1QT−1u

), and (36)

[VT−1XX ]

(B(i)nT−1

,B(j)nT−1

)= 1

B

(δijQ

T−1x(i)x(j) − (QT−1

ux(i))T

(QT−1uu )−1QT−1

ux(j)

). (37)

Neither VT−1X nor VT−1

XX can be expressed by the composition of V T−1x(i) or V T−1

xx(i) . Furthermore, VT−1XX is no longer a

block-diagonal matrix.

C. Others Derivations from the Main TextC.1. Proof of Proposition 5

First observe the below lemma which connects the backward passes between two frameworks in the degenerate case.

Lemma 7. Assume Qtux = 0 at all stages, then we have

V tx = Htx = pt = ∇xt

J , and V txx = ∇xtxtJ , ∀t . (38)

Proof. It is obvious to see that Eq. (38) holds at t = T . Now, assume the relation holds at t+ 1 and observe that at the timet, the backward passes take the form of

pt = Htx = `tx + f tTx pt+1 = ∇xtJ ,

V tx = Qtx −QtTux(Qtuu)−1Qtu = `tx + f tTx V t+1

x ,

V txx = Qtxx −QtTux(Qtuu)−1Qtux = ∇xt

{`tx + f tTx V t+1x } = ∇xtxt

J ,

which concludes the proof.

Now, Eq. (12) follows by substituting Eq. (38) to the definition of Qtu and Qtuu

Qtu = `tu + f tTu V t+1x = `tu + f tTu ∇xt+1J = ∇utJ ,

Qtuu = `tuu + f tTu V t+1xx f tu + V t+1

x · f tuu

= `tuu + f tTu (∇xt′xt′J)f tu +∇xt′J · f tuu

= ∇ut{`tu + f tTu ∇xt′J} = ∇utut

J ,

where we denote t′ = t+ 1 in the last two equations for simplicity. Consequently, the policy generates to layer-wise Newtonupdate.

C.2. Derivation of Eq. (16)

Eq. (16) follows by a simple observation that the feedback term Ktδxt , −(Qtuu)−1Qtuxδxt stands as the minimizer ofthe following objective up to first order approximation:

Ktδxt = arg minδut∈Γ(δxt)

‖∇utQ(xt + δxt,ut + δut)−∇ut

Q(xt,ut)‖ , (39)

with ‖·‖ being any proper norm in the Euclidean space. Eq. (39) can be shown by Taylor expanding Q(xt + δxt,ut + δut)up to its first order,

∇utQ(xt + δxt,ut + δut) = ∇ut

Q(xt,ut) +Qtuxδxt +Qtuuδut .


When Q = J , we will arrive at Eq. (16). While this is generally not true for nontrivial Qtxu (recall Proposition 5), we shallinterpret the Bellman objective Q as the resulting objective when the feedback policies are applied to all the afterwardstages, indeed implied in the Bellman equation Eq. (4).

D. Experiment DetailD.1. Setup

The networks in the classification problems are composed of 5-6 fully-connected layers. For the intermediate layers, weuse ReLU activation on (F)MNIST, and Tanh on the other three data set. We use identity mapping at the last predictionlayer for all data set except WINE, where we use sigmoid instead to help distinguish the performance among optimizers.The dimension of the hidden state is set to 10 for small data set, such as WINE and IRIS, and 20-32 for other largerdata set. The batch size is set to 8 for DIGITS, 16 for (F)MNIST, and 32 for WINE and IRIS. Results reported in allplots and figures are averaged over multiple random seeds (3 seeds for (F)MNIST, and 6-10 for the rest). To acceleratetraining, we only utilize 50% of the samples in MNIST and FMNIST. Regarding the machine information, we conductour experiments on GTX 1080 TI, RTX TITAN, and four Tesla V100 SXM2 16GB. For the baseline optimizers, we usethe implementation in https://github.com/Thrandis/EKFAC-pytorch for KFAC and EKFAC, and https://github.com/MichaelArbel/KWNG for KWNG.

D.2. Additional Experiments

0 500 1000 15000.0

0.5

1.0

1.5

2.0DIGITS

RMSprop

DDP-RMSprop

0 500 1000 15000.2

0.4

0.6

0.8

1.0

RMSprop

DDP-RMSprop

0 500 1000 15000.0

0.5

1.0

1.5

2.0

DIGITS

EKFAC

DDP-EKFAC

0 500 1000 1500

0.2

0.4

0.6

0.8

1.0

EKFAC

DDP-EKFAC

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

MNIST

RMSprop

DDP-RMSprop

0 1000 2000 3000 4000

0.2

0.4

0.6

0.8

RMSprop

DDP-RMSprop

Tra

inin

gL

oss

Acc

ura

cy%

Iterations

Figure 7. Optimization dynamics of both training loss (upper row) and testing accuracy (bottom row) for the experiment instances inTable 4 with large variance reduction. In all cases, optimizers integrated with DDP also admit slightly faster convergence.

Ablation Analysis on Other Data Set. In Fig. 8, we provide additional results for the ablation analysis on other dataset. Similar to the discussion in Fig. 5, DDP effectively suppresses the potentially unstable training due to the change ofhyper-parameter. On IRIS data set, notably, optimizers embedded with DDP often lead to much lower training loss in theend.

Remarks on Hyper-parameter Tuning. Recall that in order to stabilize the training, we add Tikhonov regularizationto both Quu and Vxx matrices. In practice, we find that the regularization imposed on Quu generally has greater effecton the overall training dynamics. The performance can degrade when Quu is regulated either insufficiently or too much.On the other hands, effect of Vxx regularization on optimization is typically optimizer-dependent. Empirically, we findthat regulating Vxx properly plays a key role in improving the performance of DDP-RMSprop, while the two rest are lesssensitive to this hyper-parameter.

https://github.com/Thrandis/EKFAC-pytorch

https://github.com/MichaelArbel/KWNG

https://github.com/MichaelArbel/KWNG


0 200 400 600 8000.0

0.2

0.4

0.6

0.8

1.0

1.2learning rate:1e-1

SGD

DDP-SGD

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

learning rate:5e-1

SGD

DDP-SGD

0 1000 2000 3000 40000.0

0.5

1.0

1.5

2.0

learning rate:1e-1

SGD

DDP-SGD

0 1000 2000 3000 40000.0

0.5

1.0

1.5

2.0

learning rate:5e-1

SGD

DDP-SGD

0 200 400 600 8000.0

0.2

0.4

0.6

0.8

1.0

1.2learning rate:2e-1

EKFAC

DDP-EKFAC

0 200 400 600 800

0.0

0.5

1.0

1.5

learning rate:4.5e-1

EKFAC

DDP-EKFAC

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

learning rate:2e-2

EKFAC

DDP-EKFAC

0 1000 2000 3000 4000

1.0

1.5

2.0

2.5learning rate:1.5e-1

EKFAC

DDP-EKFAC

0 200 400 600 800

0.0

0.2

0.4

0.6

0.8

1.0

learning rate:1e-3

RMSprop

DDP-RMSprop

0 200 400 600 8000.0

0.2

0.4

0.6

0.8

1.0

learning rate:5e-4

RMSprop

DDP-RMSprop

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

learning rate:1e-3

RMSprop

DDP-RMSprop

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

learning rate:5e-4

RMSprop

DDP-RMSprop

Tra

inin

gL

oss

Training Iterations

Tra

inin

gL

oss

Training Iterations

IRIS MNIST

Figure 8. Ablation analysis on other data set. Same as Fig. 5, the first and third columns report the training dynamics on the best-tunedconfigurations, while the second and last columns correspond to situations where the learning rate gets perturbed.

differential dynamic programming neural optimizer · opens up new avenues for principled...

Documents