stochastic optimal control theory - uni stuttgart › ... › kappen-handout.pdf · stochastic...

Stochastic optimal control theory

ICML, Helsinki 2008 tutorial∗

H.J. Kappen,

Radboud University, Nijmegen, the Netherlands

July 4, 2008

Abstract

Control theory is a mathematical description of how to act optimally

to gain future rewards. In this paper I give an introduction to deter-

ministic and stochastic control theory; partial observability, learning and

the combined problem of inference and control. Subsequently, I discuss

a class of non-linear stochastic control problems that can be efficiently

solved using a path integral. In this control formalism the central concept

of cost-to-go becomes a free energy and methods and concepts from prob-

abilistic graphical models and statistical physics can be readily applied. I

illustrate the theory with a number of examples.

∗Jointly with Marc Toussaint. TU Berlin

1

Contents

1 Introduction 3

2 Discrete time control 5

3 Continuous time control 6

3.1 The HJB equation . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Example: Mass on a spring . . . . . . . . . . . . . . . . . . . . . 73.3 Pontryagin minimum principle . . . . . . . . . . . . . . . . . . . 83.4 Again mass on a spring . . . . . . . . . . . . . . . . . . . . . . . 113.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Stochastic optimal control 12

4.1 Stochastic differential equations . . . . . . . . . . . . . . . . . . . 124.2 Stochastic optimal control theory . . . . . . . . . . . . . . . . . . 134.3 Linear quadratic control . . . . . . . . . . . . . . . . . . . . . . . 144.4 Example of LQ control . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Learning 16

5.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Inference and control . . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Certainty equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Path integral control 21

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Path integral control . . . . . . . . . . . . . . . . . . . . . . . . . 216.3 The diffusion process . . . . . . . . . . . . . . . . . . . . . . . . . 236.4 The path integral . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.5 Double slit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.5.1 MC sampling . . . . . . . . . . . . . . . . . . . . . . . . . 266.5.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . 27

6.6 The delayed choice . . . . . . . . . . . . . . . . . . . . . . . . . . 276.7 n joint arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.7.1 The variational approximation . . . . . . . . . . . . . . . 306.8 Other issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2

1 Introduction

Optimizing a sequence of actions to attain some future goal is the general topicof control theory [1, 2, 15]. It views an agent as an automaton that seeks tomaximize expected reward (or minimize cost) over some future time period.Two typical examples that illustrate this are motor control and foraging forfood. As an example of a motor control task, consider a human throwing aspear to kill an animal. Throwing a spear requires the execution of a motorprogram that is such that at the moment that the spear releases the hand, ithas the correct speed and direction such that it will hit the desired target. Amotor program is a sequence of actions, and this sequence can be assigned acost that consists generally of two terms: a path cost, that specifies the energyconsumption to contract the muscles in order to execute the motor program;and an end cost, that specifies whether the spear will kill the animal, just hurtit, or misses it altogether. The optimal control solution is a sequence of motorcommands that results in killing the animal by throwing the spear with minimalphysical effort. If x denotes the state space (the positions and velocities of themuscles), the optimal control solution is a function u(x, t) that depends both onthe actual state of the system at each time and also depends explicitly on time.

When an animal forages for food, it explores the environment with the ob-jective to find as much food as possible in a short time window. At each timet, the animal considers the food it expects to encounter in the period [t, t+ T ].Unlike the motor control example, the time horizon recedes into the future withthe current time and the cost consists now only of a path contribution and noend-cost. Therefore, at each time the animal faces the same task, but possiblyfrom a different location in the environment. The optimal control solution u(x)is now time-independent and specifies for each location in the environment xthe direction u in which the animal should move.

The general stochastic control problem is intractable to solve and requiresan exponential amount of memory and computation time. The reason is thatthe state space needs to be discretized and thus becomes exponentially largein the number of dimensions. Computing the expectation values means thatall states need to be visited and requires the summation of exponentially largesums. The same intractabilities are encountered in reinforcement learning.

In this tutorial, we aim to give a pedagogical introduction to control theory.For simplicity, we will first consider in section 2 the case of discrete time anddiscuss the dynamic programming solution. This gives us the basic intuitionabout the Bellman equations in continuous time that are considered later on.In section 3 we consider continuous time control problems. In this case, theoptimal control problem can be solved in two ways: using the Hamilton-Jacobi-Bellman (HJB) equation which is a partial differential equation [6] and is thecontinuous time analogue of the dynamic programming method, or using thePontryagin Minimum Principle (PMP) [7] which is a variational argument andresults in a pair of ordinary differential equations. We illustrate the methodswith the example of a mass on a spring.

In section 4 we generalize the control formulation to the stochastic dynamics.

3

In the presence of noise, the PMP formalism has no obvious generalization(see however [8]). In contrast, the inclusion of noise in the HJB frameworkis mathematically quite straight-forward. However, the numerical solution ofeither the deterministic or stochastic HJB equation is in general difficult due tothe curse of dimensionality.

There are some stochastic control problems that can be solved efficiently.When the system dynamics is linear and the cost is quadratic (LQ control), thesolution is given in terms of a number of coupled ordinary differential (Ricatti)equations that can be solved efficiently [1]. LQ control is useful to maintain asystem such as for instance a chemical plant, operated around a desired pointin state space and is therefore widely applied in engineering. However, it is alinear theory and too restricted to model the complexities of intelligent behaviorencountered in agents or robots.

The standard control theory formulation assumes that all model components(the dynamics, the environment, the costs) are known and that the state is fullyobserved. Often, this is not the case. Formulated in a Bayesian way, one mayonly know a probability distribution of the current state, or over the param-eters of the dynamics or the costs. This leads us to the problem of partialobservability or the problem of joint inference and control. We emphasize thedifference for learning in infinite horizon tasks (such as reinforcement learning)and finite horizon tasks. Whereas in the former case the problem of learningand inference decouple, they do not in the latter case. We illustrate the com-plexity of joint inference and control with a simple example. We also discuss theconcept of certainty equivalence, which states that for certain linear quadraticcontrol problems the inference and control problems disentangle and can besolved separately. We will discuss these issues in section 5.

Recently, we have discovered a class of continuous non-linear stochastic con-trol problems that can be solved more efficiently than the general case [3, 4].These are control problems with a finite time horizon, where the control actslinearly and additive on the dynamics and the cost of the control is quadratic.Otherwise, the path cost and end cost and the intrinsic dynamics of the systemare arbitrary. These control problems can have both time-dependent and time-independent solutions of the type that we encountered in the examples above.We introduce the path integral control method in section 6.

For this class of problems, the control computation essentially reduces to thecomputation of a path integral, which can be interpreted as a free energy. Thepath integral is closely related to non-linear Kalman filters or hidden Markovmodels. One can therefore consider various well-know methods from the ma-chine learning community to approximate this path integral, such as the Laplaceapproximation, Monte Carlo sampling [4], variational approximations or beliefpropagation [5]. In section 6.7 we show an example of an n joint arm where wecompute the optimal control using the variational approximation for large n.

Non-linear stochastic control problems display features not shared by de-terministic control problems nor by linear stochastic control. In deterministiccontrol, only one globally optimal solution exists. In stochastic control, theoptimal solution can be viewed as a weighted mixture of suboptimal solutions.

4

The weighting depends in a non-trivial way on the features of the problem, suchas the noise level, the horizon time and on the cost of the local optima. Thismulti-modality leads to surprising behavior is stochastic optimal control. Forinstance, the optimal control can be qualitatively very different for high and lownoise levels [3]. In section 6.6 we illustrate this effect for a simple two targettask and show that the decision to choose one of the targets is governed by aspontaneous symmetry breaking.

2 Discrete time control

We start by discussing the most simple control case, which is the finite horizondiscrete time deterministic control problem. In this case the optimal controlexplicitly depends on time. See also [9] for further discussion.

Consider the control of a discrete time dynamical system:

xt+1 = xt + f(t, xt, ut), t = 0, 1, . . . , T − 1 (1)

xt is an n-dimensional vector describing the state of the system and ut is anm-dimensional vector that specifies the control or action at time t. Note, thatEq. 1 describes a noiseless dynamics. If we specify x at t = 0 as x0 and wespecify a sequence of controls u0:T−1 = u0, u1, . . . , uT−1, we can compute futurestates of the system x1:T recursively from Eq.1.

Define a cost function that assigns a cost to each sequence of controls:

C(x0, u0:T−1) = φ(xT ) +

T−1∑

t=0

R(t, xt, ut) (2)

R(t, x, u) is the cost that is associated with taking action u at time t in statex, and φ(xT ) is the cost associated with ending up in state xT at time T .The problem of optimal control is to find the sequence u0:T−1 that minimizesC(x0, u0:T−1).

The problem has a standard solution, which is known as dynamic program-ming. Introduce the optimal cost to go:

J(t, xt) = minut:T−1

(

φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

(3)

which solves the optimal control problem from an intermediate time t until thefixed end time T , starting at an arbitrary location xt. The minimum of Eq. 2is given by J(0, x0).

One can recursively compute J(t, x) from J(t+1, x) for all x in the followingway:

J(T, x) = φ(x)

J(t, xt) = minut:T−1

(

φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

5

= minut

(

R(t, xt, ut) + minut+1:T−1

[

φ(xT ) +

T−1∑

s=t+1

R(s, xs, us)

])

= minut

(R(t, xt, ut) + J(t+ 1, xt+1))

= minut

(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut))) (4)

Note, that the minimization over the whole path u0:T−1 has reduced to a se-quence of minimizations over ut. This simplification is due to the Markoviannature of the problem: the future depends on the past and vise versa onlythrough the present. Also note, that in the last line the minimization is donefor each xt separately and also explicitly depends on time.

The algorithm to compute the optimal control u∗0:T−1, the optimal trajectoryx∗1:T and the optimal cost is given by

1. Initialization: J(T, x) = φ(x)

2. Backwards: For t = T − 1, . . . , 0 and for all x compute

u∗t (x) = arg minu{R(t, x, u) + J(t+ 1, x+ f(t, x, u))}

J(t, x) = R(t, x, u∗t (x)) + J(t+ 1, x+ f(t, x, u∗t (x)))

3. Forwards: For t = 0, . . . , T − 1 compute

x∗t+1 = x∗t + f(t, x∗t , u∗t (x

∗t )), qquadx

∗0 = x0

The execution of the dynamic programming algorithm is linear in the horizontime T and linear in the size of the state and action spaces.

3 Continuous time control

In the absence of noise, the optimal control problem in continuous time can besolved in two ways: using the Pontryagin Minimum Principle (PMP) [7] which isa pair of ordinary differential equations or the Hamilton-Jacobi-Bellman (HJB)equation which is a partial differential equation [6]. The latter is very similarto the dynamic programming approach that we have treated above. The HJBapproach also allows for a straightforward extension to the noisy case. We willfirst treat the HJB description and subsequently the PMP description. Forfurther reading see [1, 10].

3.1 The HJB equation

Consider the dynamical system Eq. 1 where we take the time increments tozero, ie. we replace t+ 1 by t+ dt with dt→ 0:

xt+dt = xt + f(xt, ut, t)dt (5)

6

In the continuous limit we will write xt = x(t). The initial state is fixed:x(0) = x0 and the final state is free. The problem is to find a control signalu(t), 0 < t < T , which we denote as u(0→ T ), such that

C(x0, u(0→ T )) = φ(xT )) +

∫ T

0

dtR(x(τ), u(τ), τ) (6)

is minimal. C consists of an end cost φ(x) that gives the cost of ending in aconfiguration x, and a path cost that is an integral over time that depends onthe trajectories x(0→ T ) and u(0→ T ).

Eq. 4 becomes

J(t, x) = minu

(R(t, x, u)dt+ J(t+ dt, x+ f(x, u, t)dt))

≈ minu

(R(t, x, u)dt+ J(t, x) + ∂tJ(t, x)dt + ∂xJ(t, x)f(x, u, t)dt)

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t)) (7)

where in the second line we have used the Taylor expansion of J(t+ dt, x+ dx)around x, t to first order in dt and dx and Eq.refcontinuousdeterministicdynamicsand in the third line have taken the limit dt → 0. Eq. 7 is a partial differ-ential equation, known as the Hamilton-Jacobi-Bellman (HJB) equation, thatdescribes the evolution of J as a function of x and t and must be solved withboundary condition J(x, T ) = φ(x). ∂t and ∂x denote partial derivatives withrespect to t and x, respectively.

The optimal control at the current x, t is given by

u(x, t) = argminu

(R(x, u, t) + Jx(t, x)f(x, u, t)) (8)

Note, that in order to compute the optimal control at the current state x(0) att = 0 one must compute J(x, t) for all values of x and t.

3.2 Example: Mass on a spring

To illustrate the optimal control principle consider a mass on a spring. Thespring is at rest at z = 0 and exerts a force proportional to F = −z towards therest position. Using Newton’s Law F = ma with a = z 1 the acceleration andm = 1 the mass of the spring, the equation of motion is given by.

z = −z + u

with u a unspecified control signal with −1 < u < 1. We want to solve thecontrol problem: Given initial position and velocity z0 and z0 at time t = 0,find the control path u(0→ T ) such that z(T ) is maximal.

Introduce x1 = z, x2 = z, then

x = Ax +Bu, A =

(

0 1-1 0

)

B =

(

01

)

1We use the notation z = dz

dtand z = d

2z

dt2.

7

and x = (x1, x2)T . The problem is of the type Eqs. 5 and 6, with φ(x) = CTx,

CT = (−1, 0), R(x, u, t) = 0 and f(x, u, t) = Ax+Bu. Eq. 7 takes the form

−Jt = JTx Ax− |JT

x B|

where we have performed the minimization with respect to u, which gives u =−sign(JT

x B). We try J(t, x) = ψ(t)Tx+α(t). The HJBE reduces to two ordinarydifferential equations

ψ = −ATψ

α = |ψTB|

These equations must be solved for all t, with final boundary conditions ψ(T ) =C and α(T ) = 0. Note, that the optimal control u only requires Jx(x, t), whichin this case is ψ(t) and thus we do not need to solve α. The solution for ψ is

ψ1(t) = − cos(t− T )

ψ2(t) = sin(t− T )

for 0 < t < T . The optimal control is

u(x, t) = −sign(ψ2(t)) = −sign(sin(t− T ))

As an example consider x1(0) = x2(0) = 0, T = 2π. Then, the optimal controlis

u = −1, 0 < t < π

u = 1, π < t < 2π

The optimal trajectories are for 0 < t < π

x1(t) = cos(t)− 1, x2(t) = − sin(t)

and for π < t < 2π

x1(t) = 3 cos(t) + 1, x2(t) = −3 sin(t)

The solution is drawn in fig. 1. We see that in order to excite the spring toits maximal height at T , the optimal control is to first push the spring downfor 0 < t < π and then to push the spring up between π < t < 2π, takingmaximally advantage of the intrinsic dynamics of the spring.

Note, that since there is no cost associated with the control u and u is hardlimited between -1 and 1, the optimal control is always either -1 or 1. This isknown as bang-bang control.

3.3 Pontryagin minimum principle

In the last section, we solved the optimal control problem as a partial differentialequation, with a boundary condition at the end time. The numerical solution

8

0 2 4 6 8−2

−1

0

1

2

3

4

t

x1

x2

Figure 1: Optimal control of mass on a spring such that at t = 2π the amplitudeis maximal. x1 is position of the spring, x2 is velocity of the spring.

requires a discretization of space and time and is computationally expensive.The solution is an optimal cost-to-go function J(x, t) for all x and t. From thiswe compute the optimal control sequence Eq. 8 and the optimal trajectory.

An alternative to the HJB approach is a variational approach that directlyfinds the optimal trajectory and optimal control and bypasses the expensivecomputation of the cost-to-go. This approach is known as the Pontryagin Min-imum Principle. We can write the optimal control problem as a constrainedoptimization problem with independent variables u(0→ T ) and x(0→ T ). Wewish to minimize

minu(0→T ),x(0→T )

φ(x(T )) +

∫ T

0

dtR(x(t), u(t), t)

subject to the constraint that u(0→ T ) and x(0→ T ) are compatible with thedynamics

x = f(x, u, t) (9)

and the boundary condition x(0) = x0. dx/dt.We can solve the constraint optimization problem by introducing the La-

grange multiplier function λ(t) that ensures the constraint Eq. 9 for all t:

C = φ(x(T )) +

∫ T

0

dt [R(t, x(t), u(t)) − λ(t)(f(t, x(t), u(t)) − x(t))]

= φ(x(T )) +

∫ T

0

dt[−H(t, x(t), u(t), λ(t)) + λ(t)x(t))]

−H(t, x, u, λ) = R(t, x, u)− λf(t, x, u) (10)

The solution is found by extremizing C. We can find the extremal solution ifwe compute the change δC in C that results from independent variation of x(t),

9

u(t) and λ(t) at all times t between 0 and T . We denote these variations byδx(t), δu(t) and δλ(t), respectively. We get:

δC = φx(x(T ))δx(T )

+

∫ T

0

dt[−Hxδx(t) −Huδu(t) + (−Hλ + x(t))δλ(t) + λ(t)δx(t)]

= (φx(x(T )) + λ(T )) δx(T )

+

∫ T

0

dt[

(−Hx − λ(t))δx(t) −Huδu(t) + (−Hλ + x(t))δλ(t)]

where the subscripts x, u, λ denote partial derivatives. For instance, Hx =∂H(t,x(t),u(t),λ(t))

∂x(t) . In the second line above we have used partial integration:

∫ T

0

dtλ(t)δx(t) =

∫ T

0

dtλ(t)d

dtδx(t) = −

∫ T

0

dtd

dtλ(t)δx(t) + λ(T )δx(T )− λ(0)δx(0)

and δx(0) = 0, because x(0) is given.The stationary solution (δC = 0) is obtained when the coefficients of the

independent variations δx(t), δu(t), δλ(t) and δx(T ) are zero. Thus,

λ = −Hx(t, x(t), u(t), λ(t))

0 = Hu((t, x(t), u(t), λ(t)) (11)

x = Hλ(t, x, u, λ)

λ(T ) = −φx(x(T ))

We can solve Eq. 11 for u and denote the solution as u∗(t, x, λ). This solutionis unique if H is convex in u. The remaining equations are

x = H∗λ(t, x, λ)

λ = −H∗x(t, x, λ) (12)

where we have defined H∗(t, x, λ) = H(t, x, u∗(t, x, λ), λ) and the boundaryconditions

x(0) = x0 λ(T ) = −φx(x(T )) (13)

Eqs. 12 are two coupled ordinary differential equations that describe the dy-namics of x and λ over time with a boundary conditionis Eqs. 13 for x at theinitial time and for λ at the final time. Compared to the HJB equation, thecomplexity of solving these equations is low since only time discretization andno space discretization is required. However, due to the mixed boundary condi-tions, finding a solution that satisfies these equations is not trivial and requiressophisticated numerical methods. The most common method for solving thePMP equations is called (multiple) shooting [11].

Eqs. 12 are known as the Hamilton equations of motion that arise in classicalmechanics, but then with initial conditions for both x and λ [12]. In fact, one

10

can view control theory as a generalization of classical mechanics (see [10] forfurther explanation).

In classical mechanics H is called the Hamiltonian. Consider the time evo-lution of H : 2

H = Ht +Huu+Hxx+Hλλ = Ht (14)

where we have used the dynamical equations Eqs. 12 and Eq. 11. In particular,when f and R in Eq. 10 do not explicitly depend on time, neither does H andHt = 0. In this case we see that H is a constant of motion: the control problemfinds a solution such that H(t = 0) = H(t = T ).

3.4 Again mass on a spring

We consider again the example of the mass on a spring that we introduced insection 3.2 where we had

f(x, u) = Ax+Bu R(x, u, t) = 0 φ(x) = −x1

The Hamiltonian Eq. 10 becomes

H(x, u, λ) = λ1x2 + λ2(−x1 + u)

Using Eq. 11 we obtain u∗ = −sign(λ2) and

H∗(x, λ) = λ1x2 − λ2x1 − |λ2|

The Hamilton equations become

x =∂H∗

∂λ⇒ x1 = x2, x2 = −x1 − sign(λ2)

λ = −∂H∗

∂x⇒ λ1 = −λ2, λ2 = λ1

with x(t = 0) = (x0, x0) and λ(t = T ) = (1, 0).In this simple example, the dynamical equation for λ does not depend on

x and thus we can simply solve it from t = T backwards to t = 0. We canthen insert the solution in the dynamical equation for x and solve from t = 0to t = T . In general, the two equations are interdependent and require a moreadvanced approach such a shooting method.

3.5 Comments

The HJB method gives a sufficient (and often necessary) condition for opti-mality. The solution of the PDE is expensive. The PMP method provides anecessary condition for optimal control. This means that it provides candidatesolutions for optimality.

2H is the total derivative and Ht is the partial derivative with respect to time.

11

The PMP method is computationally less complicated than the HJB methodbecause it does not require discretization of the state space. The PMP methodcan be used when dynamic programming fails due to lack of smoothness of theoptimal cost-to-go function.

The subject of optimal control theory in continuous space and time has beenwell studied in the mathematical literature and contains many complicationsrelated to the existence, uniqueness and smoothness of the solution, particularin the absence of noise. See [10] for a clear discussion and further references.In the presence of noise and in particular in the path integral framework, as wewill discuss below, it seems that many of these intricacies disappear.

4 Stochastic optimal control

In this section, we consider the extension of the continuous control problem tothe case that the dynamics is subject to noise and is given by a stochastic differ-ential equation. First, we give a very brief introduction to stochastic differentialequations.

4.1 Stochastic differential equations

Consider the random walk on the line:

xt+1 = xt + ξt ξt = ±√ν

with x0 = 0. The increments ξt are iid random variables with mean zero,⟨

ξ2t⟩

= ν and ν is a constant. We can write the solution for xt in closed form as

xt =

t∑

i=1

ξi

Since xt is a sum of random variables, xt becomes Gaussian in the limit of larget. We can compute the evolution of the mean and covariance:

〈xt〉 =

t∑

i=1

〈ξi〉 = 0

⟨

x2t

⟩

=

t∑

i,j=1

〈ξiξj〉 =

t∑

i=1

⟨

ξ2i⟩

+

t∑

i,j=1,j 6=i

〈ξi〉〈ξj〉 = νt

Note, that the fluctuations σt =√

〈x2t 〉 =

√νt increase with the square root of

t. This is a characteristic property of a diffusion process, such as for instancethe diffusion of ink in water or warm air in a room.

In the continuous time limit we define

dxt = xt+dt − xt = dξ (15)

12

with dξ an infinitesimal mean zero Gaussian variable. In order to get the rightscaling with dt in the interval from t to t + dt we must choose

⟨

dξ2⟩

= νdt.3

Then in the limit of dt→ 0 we obtain

d

dt〈x〉 = lim

dt→0

⟨

xt+dt − xt

dt

⟩

= limdt→0

〈dξ〉dt

= 0, ⇒ 〈x〉 (t) = 0

d

dt

⟨

x2⟩

= ν, ⇒⟨

x2⟩

(t) = νt

The conditional probability distribution of x at time t given x0 at time 0 isGaussian and specified by its mean and variance. Thus

ρ(x, t|x0, 0) =1√

2πνtexp

(

− (x− x0)2

2νt

)

The process Eq. 15 is called a Wiener process.

4.2 Stochastic optimal control theory

Consider the stochastic differential equation which is a generalization of Eq. 5:

dx = f(x(t), u(t), t)dt + dξ. (16)

dξ is a Wiener processes with 〈dξidξj〉 = νij(t, x, u)dt a symmetric positivedefinite matrix.

Because the dynamics is stochastic, it is no longer the case that when x attime t and the full control path u(t → T ) are given, we know the future pathx(t→ T ). Therefore, we cannot minimize Eq. 6, but can only hope to be able tominimize its expectation value over all possible future realizations of the Wienerprocess:

C(x0, u(0→ T )) =

⟨

φ(x(T )) +

∫ T

0

dtR(x(t), u(t), t)

⟩

x0

(17)

The subscript x0 on the expectation value is to remind us that the expectationis over all stochastic trajectories that start in x0.

The solution of the control problem proceeds very similar as in the deter-ministic case, with the only difference that we must add the expectation valueover trajectories. Eq. 4 becomes

J(t, xt) = minut

R(t, xt, ut) + 〈J(t+ dt, xt+dt)〉xt

where the expectation value is over all xt+dt at time t+ dt conditioned on thestate xt at time t. We must again make a Taylor expansion of J in dt and

3To understand this, imagine that we half the discretization step. The choice⟨

dξ2⟩

= νdt

ensures that Eq. 15 yields the same variance after two iterations as previously after oneiteration.

13

dx. However, since⟨

dx2⟩

is of order dt because of the Wiener process, we mustexpand up to order dx2:

〈J(t+ dt, xt+dt)〉 =

∫

dxt+dtN (xt+dt|xt, νdt)J(t + dt, xt+dt)

= J(t, xt) + dt∂tJ(t, xt) + 〈dx〉 ∂xJ(t, xt) +1

2

⟨

dx2⟩

∂2xJ(t, xt)

〈dx〉 = f(x, u, t)dt⟨

dx2⟩

= ν(t, x, u)dt

Thus, we obtain

−∂tJ(t, x) = minu

(

R(t, x, u) + f(x, u, t)∂xJ(x, t) +1

2ν(t, x, u)∂2

xJ(x, t)

)

(18)

which is the Stochastic Hamilton-Jacobi-Bellman Equation with boundary con-dition J(x, T ) = φ(x). Eq. 18 reduces to the deterministic HJB equation Eq.7in the limit ν → 0.

4.3 Linear quadratic control

In the case that the dynamics is linear and the cost is quadratic one can showthat the optimal cost to go J is also a quadratic form and one can solve thestochastic HJB equation in terms of ’sufficient statistics’ that describe J .

x is n-dimensional and u is p dimensional. The dynamics is linear

dx = [A(t)x +B(t)u + b(t)]dt+

m∑

j=1

(Cj(t)x +Dj(t)u + σj(t))dξj (19)

with dimensions: A = n× n,B = n× p, b = n× 1, Cj = n× n,Dj = n× p, σj =n× 1 and 〈dξjdξj′ 〉 = δjj′dt. The cost function is quadratic

φ(x) =1

2xTGx+ hTx+ c (20)

R(x, u, t) =1

2xTQ(t)x+ uTS(t)x+

1

2uTR(t)u (21)

with G and Q n× n matrices, S a p× n matrix and R a p× p matrix.We parametrize the optimal cost to go function as

J(t, x) =1

2xTP (t)x+ αT (t)x+ β(t) (22)

which should satisfy the stochastic HJB equation eq. 18 with P (T ) = G andα(T ) = h and β(T ) = c. P (t) is an n × n matrix, α(t) is an n-dimensionalvector and β(t) is a scalar. Substituting this form of J in Eq. 18, this equationcontains terms quadratic, linear and constant in x and u. We can thus do theminimization with respect to u exactly and the result is

u(t) = −Ψ(t)x(t)− ψ(t) (23)

14

with

R = R+m∑

j=1

DTj PDj , (p× p)

S = BTP + S +

m∑

j=1

DTj PCj , (p× n)

Ψ = R−1S, (p× n)

ψ = R−1(BTα+

m∑

j=1

DTj Pσj), (p× 1)

Collecting terms quadratic, linear and constant in x, the stochastic HJB equa-tion decouples as three ordinary differential equations

−P = PA+ATP +

m∑

j=1

CTj PCj +Q− ST R−1S (24)

−α = [A−BR−1S]Tα+

m∑

j=1

[Cj −DjR−1S]TPσj + Pb (25)

β =1

2

∣

∣

∣

√

Rψ∣

∣

∣

2

− αT b− 1

2

m∑

j=1

σTj Pσj (26)

The way to solve these equations is to first solve eq. 24 for P (t) with endcondition P (T ) = G. Use this solution in eq. 25 to compute the solution forα(t) with end condition α(T ) = h. Finally, use P (t) and α(t) to compute β(t)from eq. 26. The final solution is obtained by substituting P (t), α(t) and β(t)in Eqs. 22 and 23.

4.4 Example of LQ control

Find the optimal control for the dynamics

dx = (x+ u)dt+ dξ,⟨

dξ2⟩

= νdt

with end cost φ(x) = 0 and path cost R(x, u) = 12 (x2 + u2).

The Ricatti equations reduce to

−P = 2P + 1− P 2

−α = 0

β = −1

2νP

with P (T ) = α(T ) = β(T ) = 0 and

u(x, t) = −P (t)x (27)

15

0 2 4 6 8 100

1

2

3

4

5

6

t

Pβ

Figure 2: Stochastic optimal control in the case of a linear system with quadraticcost. T = 10, time discretization dt = 0.1, ν = 0.05. The optimal control is tosteer towards the origin with −P (t)x, where P is roughly constant until T ≈ 8.Afterwards control weakens because the expected gain from minimizing x2 overthe remaining time interval reduces to zero.

We compute the solution for P and β by numerical integration. The resultis shown in figure 2. The optimal control is to steer towards the origin with−P (t)x, where P is roughly constant until T ≈ 8. Afterwards control weakensbecause the expected gain from minimizing x2 over the remaining time intervalreduces to zero.

Note, that in this example the optimal control law Eq. 19 is independent ofν. It can be verified from the Ricatti equations that this is true in general for’non-multiplicative’ noise (Cj = Dj = 0).

5 Learning

Sofar, we have assumed that all aspects that define the control problem areknown. But in robotics as well as in other control applications this is notgenerally the case. What happens if (part of) the state is not observed?

For instance, as a result of measurement error we do not know xt but onlyknow a probability distribution p(xt|y0:t) given some previous observations y0:t.Or, we observe xt, but do not know the parameters of the dynamical equationEq. 16. Or, we do not know the cost/rewards functions that appear in Eq. 17.

We make a distinction between control problems where time runs to infinityand problems where time is finite. The control problems that we have discussedsofar have a finite horizon time that is constant. The dynamics and the en-

16

t tf

controlled trajectory

environmentx

t tf

t tf

Figure 3: Different types of control problems. The finite fixed horizon prob-lem (left) is time dependent. The finite receding horizon problem (middle) orthe infinite horizon discounted problem (right) is time independent. This hasimportant consequences for learning.

vironment may depend explicitly on time and the optimal control is also anexplicit function of time (see fig. 3 Left). We refer to these problems as finitetime problems.

With little adjustment one can also consider finite horizon problems wherethe horizon time moves with the actual time (fig. 3 Middle). In this case thedynamics and environment are necessarily static (in the probabilistic case thedynamics is a homogeneous Markov process) and the optimal control must alsobe independent of time. This situation is very similar to the case of reinforce-ment learning, where the horizon time is infinite, but future rewards are dis-counted exponentially (see fig. 3 Right). We refer to these problems as infinitetime problems. The issue of learning is quite different for finite and infinite timeproblems.

5.1 Reinforcement learning

Let us first consider the infinite time case. Imagine eternity, and you are orderedto cook a very good meal for tomorrow. You decide to spend all of today tolearn the recipes. When the next day arrives, you are ordered to cook a verygood meal for tomorrow. You decide to spend all of today to learn the recipes.Etcetera. In this scenario, the learning phase takes place for an infinite amountof time and the learning time scale is unrelated to the characteristic time scaleof the underlying control problem (T for receding horizon problems or −1/ logγ,with γ the discount rate, for reinforcement learning). The performance is mea-sured asymptotically, and the best learning method will get the best asymptoticperformance as fast as possible (see fig. 4).

In this learning paradigm, there is an exploration and optimization trade-offbecause some parts of the hidden state space are more relevant for the optimiza-tion (control) computation than others. The problem of optimal exploration isa distinct problem from the optimal control problem. The principled solution isto explore all space, for instance by using a dedicated exploration strategy suchas proposed in the E3 algorithm [13]. See also the review by [14] for some oldermethods for exploration such as the actor-critic architecture.

17

tt

V

Figure 4: In infinite time control problems, such as reinforcement learning,the learning performance is measured asymptotically on a time scale that isunrelated to the horizon time (T for receding horizon problems or −1/ logγ,with γ the discount rate, for reinforcement learning).

t

learningaction

Figure 5: When life is finite and is executed only one time, we should first learnand then act.

5.2 Inference and control

For finite time problems, the situation is quite different. Imagine our life aswe know it. It is finite and we have only one life. Our aim is to maximizeaccumulated reward during our lifetime, but in order to do so we have to allocatesome resources to learning as well. It requires that we plan our learning and thelearning problem becomes an intricate part of the control problem. At t = 0,we have no knowledge of the world and thus making optimal control actions isimpossible. At t = T , learning has become useless, because we will have no moreopportunity to make use of it. So we should learn early in life and act later inlife, as is schematically shown in fig. 5. This is the problem of joint inference andcontrol and is also referred to as dual control. See [15] for a further discussion.

As an example [16, 17], consider the simple LQ control problem

dx = αudt+ dξ (28)

with α unobserved and x observed. Path cost R(x, u, t) and end cost φ(x) andnoise variance ν are given.

Although α is unobserved, we have a means to observe α indirectly throughthe sequence xt, ut, t = 0, . . .. Each time step we observe dx and u and we canthus update our belief about α using the Bayes formula:

pt+dt(α|dx, u) ∝ p(dx|α, u)pt(α) (29)

with p(dx|α, u) a Normal distribution in dx with variance νdt and pt(α) a prob-ability distribution that expresses our belief at time t about the values of α. 4

4Note the difference in interpretation between the probability distribtutions that describe

18

The problem is that the future information that we receive about α depends onu: if we use a large u, the term αudt is larger than the noise term dξ and wewill get reliable information about α. This is beneficial, because knowing α willimprove our future controls. However, large u values are more costly and alsomay drive us away from our target state x(T ). Thus, the optimal control is abalance between optimal inference and minimal control cost.

The solution is to augment the state space with parameters θt (sufficientstatistics) that describe pt(α) = p(α|θt) and θ0 known, which describes ourinitial belief in the possible values of α. The cost that must be minimized is

C(x0, θ0, u(0→ T )) =

⟨

φ(x(T )) +

∫ T

0

dtR(x, u, t)

⟩

(x0,θ0)

(30)

where the average is with respect to the noise dξ as well as the uncertainty inα.

For simplicity, consider the example that α can have only two values α = ±1.Then pt(α|θ) = σ(αθ), with the sigmoid function σ(x) = 1

2 (1 + tanh(x)). Theupdate equation Eq. 29 implies a dynamics for θ:

dθ =u

νdx =

u

ν(αudt+ dξ) (31)

As for dx, dθ is a stochastic variable with a probability distribution that dependson θ at time t.

With zt = (xt, θt) we obtain a standard HJB Eq. 18:

−∂tJ(t, z)dt = minu

(

R(t, z, u)dt+ 〈dz〉z ∂zJ(z, t) +1

2

⟨

dz2⟩

z∂2

zJ(z, t)

)

with boundary condition J(z, T ) = φ(x). The expectation values appearing inthis equation are conditioned on (xt, θt) and are averages over p(α|θt) and the

Gaussian noise. We compute 〈dx〉x,θ = αudt, 〈dθ〉x,θ = αu2

ν dt,⟨

dx2⟩

x,θ= νdt,

⟨

dθ2⟩

x,θ= u2

ν dt, 〈dxdθ〉 = udt, with α = tanh(θ) the expected value of α for a

given value θ. The result is

−∂tJ = minu

(

R(x, u, t) + αu∂xJ +u2α

ν∂θJ +

1

2ν∂2

xJ +1

2

u2

ν∂2

θJ + u∂x∂θJ

)

with boundary conditions J(x, θ, T ) = φ(x).Thus, the joint inference on α and control problem on x has become a control

problem in x, θ. This is also true in general. Quoting Florentin (1962): ”Itseems that any systematic formulation of the adaptive control problem leadsto a meta-problem which is not adaptive” [16]. Note also, that dynamics for θis non-linear (due to the u2 term) although the original dynamics for dx waslinear. The solution to this non-linear stochastic control problem requires thesolution of this PDE.

dξ and α. dξ is a truly random variable and its value is different for each time step. p(dξ)describes objectively all there is to know about dξ. α is not random but constant, but we doknow its value. p(α) describes our subjective knowledge about α and this knowledge increasesover time.

19

5.3 Certainty equivalence

Although in general partial observable control problems are more complex andgive very different solutions than their fully observed counterparts, there existsan exception for a large class of linear quadratic problems, such as the Kalmanfilter [18].

Consider the dynamics

dx = (x+ u)dt+ dξ

y = x+ η

with dξ and η Gaussian noise and where now x is not observed, but y is observedand all other model parameters are known.

When x is observed, we can compute the quadratic cost, which we assumeof the form

C(xt, t, ut:T ) =

⟨

T∑

τ=t

1

2(x2

τ + u2τ )

⟩

Using the Ricatti equations of section 4.3 we can easily compute the optimalcontrol u(x, t).

The question is to what extend this control law is optimal if x is not observed.In that case, all we know about x is from p(xt|y0:t) using Kalman filteringon the past observations y0:t and the optimal control should be computed byminimizing

CKF(y0:t, t, ut:T ) =

∫

dxtp(xt|y0:t)C(xt, t, ut:T )

Since p(xt|y0:t) = N (xt|µt, σ2t ) is Gaussian and C is quadratic in x we can

derive

CKF(y0:t, t, ut:T ) =

∫

dxtC(xt, t, ut:T )N (xt|µt, σ2t ) =

T∑

τ=t

1

2u2

τ +

T∑

τ=t

⟨

x2τ

⟩

µt,σt

=

T∑

τ=t

1

2u2

τ +1

2(µ2

t + σ2t ) +

1

2

∫

dxt

⟨

x2t+dt

⟩

xt,νdtN (xt|µt, σ

2t ) + · · ·

=

T∑

τ=t

1

2u2

τ +1

2(µ2

t + σ2t ) +

1

2

⟨

x2t+dt

⟩

µt,νdt+

1

2σ2

t + · · ·

= C(µt, t, ut:T ) +1

2(T − t)σ2

t

The first term is identical to the observed case with xt → µt. The secondterm does not depend on u and thus does not affect the optimal control. Thus,the optimal control for the Kalman filter uKF(y0:t, t) is identical to the optimalcontrol function u(x, t) of the observed case with xt replaced by µt:

uKF(y0:t, t) = u(µt, t)

This simplification is known as certainty equivalence and holds for non multi-plicative LQ problems.

20

6 Path integral control

6.1 Introduction

As we have seen, the solution of the general stochastic optimal control problemrequires the solution of a partial differential equation. This is for may realisticapplications not an attractive option. The alternative considered often, is toapproximate the problem somehow by a linear quadratic problem which canthen be solved efficiently using the Ricatti equations.

In this section, we discuss the special class of non-linear, non-quadratic con-trol problems for which some progress can be made [3, 4]. For this class ofproblems, the non-linear Hamilton-Jacobi-Bellman equation can be transformedinto a linear equation by a log transformation of the cost-to-go. The transforma-tion stems back to the early days of quantum mechanics and was first used bySchrodinger to relate the Hamilton-Jacobi formalism to the Schrodinger equa-tion. The log transform was first used in the context of control theory by [19](see also [2]).

Due to the linear description, the usual backward integration in time ofthe HJB equation can be replaced by computing expectation values under aforward diffusion process. The computation of the expectation value requires astochastic integration over trajectories that can be described by a path integral.This is an integral over all trajectories starting at x, t, weighted by exp(−S/λ),where S is the cost of the path (also know as the Action) and λ is proportionalto the noise.

The path integral formulation is well-known in statistical physics and quan-tum mechanics, and several methods exist to compute path integrals approxi-mately. The Laplace approximation approximates the integral by the path ofminimal S. This approximation is exact in the limit of ν → 0, and the deter-ministic control law is recovered.

In general, the Laplace approximation may not be sufficiently accurate. Avery generic and powerful alternative is Monte Carlo (MC) sampling. Thetheory naturally suggests a naive sampling procedure, but is also possible todevise more efficient samplers, such as importance sampling.

We illustrate the control method on two tasks: a temporal decision task,where the agent must choose between two targets at some future time; and asimple n joint arm. The decision task illustrates the issue of spontaneous sym-metry breaking and how optimal behavior is qualitatively different for high andlow noise. The n joint arm illustrates how the efficient approximate inferencemethods (the variational approximation in this case) can be used to computeoptimal controls in very high dimensional problems.

6.2 Path integral control

Consider the special case of Eqs. 16 and 17 where the dynamic is linear in uand the cost is quadratic in u:

f(x, u, t) = f(x, t) + u (32)

21

R(x, u, t) = V (x, t) +1

2uTRu (33)

with R a non-negative matrix. f(x, t) and V (x, t) are arbitrary functions of xand t. In other words, the system to be controlled can be arbitrary complexand subject to arbitrary complex costs. The control instead, is restricted to thesimple linear-quadratic form.

The stochastic HJB equation 18 becomes

−∂tJ(x, t) = minu

(

1

2uTRu+ V (x, t) + (f(x, t) + u)T∂xJ(x, t) +

1

2Trν∂2

xJ(x, t)

)

where TrA =∑

ij Aij is the trace of the matrix A. Due to the linear-quadraticappearance of u, we can minimize with respect to u explicitly which yields:

u = −R−1∂xJ(x, t) (34)

which defines the optimal control u for each x, t. The HJB equation becomes

−∂tJ(x, t) = −1

2

(

(∂xJ(x, t))TR−1(∂xJ(x, t)))

+ V (x, t) + f(x, t)T∂xJ(x, t) +1

2Trν∂2

xJ(x, t)

Note, that after performing the minimization with respect to u, the HJBequation has become non-linear in J . We can, however, remove the non-linearityand this will turn out to greatly help us to solve the HJB equation. Define ψ(x, t)through J(x, t) = −λ logψ(x, t). We further assume that there exists a constantλ such that the matrices R and ν satisfy:

R−1 = λν (35)

This relation basically says that directions in which control is expensive shouldhave low noise variance. It can also be interpreted as saying that all noise direc-tions are controllable (in the correct proportion). See [4] for further discussion.Then the HJB becomes

−∂tψ(x, t) =

(

−V (x, t)

λ+ f(x, t)T ∂x +

1

2Trν∂2

x

)

ψ(x, t) (36)

Eq. 36 must be solved backwards in time with ψ(x, T ) = exp(−φ(x)/λ).The linearity allows us to reverse the direction of computation, replacing it

by a diffusion process, in the following way. Let ρ(y, τ |x, t) describe a diffusionprocess for τ > t defined by the Fokker-Planck equation

∂τρ = −Vλρ− ∂T

y (fρ) +1

2Trν∂2

yρ (37)

with ρ(y, t|x, t) = δ(y − x).Define A(x, t) =

∫

dyρ(y, τ |x, t)ψ(y, τ). It is easy to see by using the equa-tions of motion Eq. 36 and 37 that A(x, t) is independent of τ . Evaluating

22

A(x, t) for τ = t yields A(x, t) = ψ(x, t). Evaluating A(x, t) for τ = T yieldsA(x, t) =

∫

dyρ(y, T |x, t)ψ(x, T ). Thus,

ψ(x, t) =

∫

dyρ(y, T |x, t) exp(−φ(y)/λ) (38)

We arrive at the important conclusion that the optimal cost-to-go J(x, t) =−λ logψ(x, t) can be computed either by backward integration using Eq. 36or by forward integration of a diffusion process given by Eq. 37. The optimalcontrol is given by Eq. 34.

Both Eq. 36 and 37 are partial differential equations and, although beinglinear, still suffer from the curse of dimensionality. However, the great advan-tage of the forward diffusion process is that it can be simulated using standardsampling methods which can efficiently approximate these computations. Inaddition, as is discussed in [4], the forward diffusion process ρ(y, T |x, t) can bewritten as a path integral and in fact Eq. 38 becomes a path integral. This pathintegral can then be approximated using standard methods, such as the Laplaceapproximation.

As an example, we consider the control problem Eqs. 32 and 33 for thesimplest case of controlled free diffusion:

V (x, t) = 0, f(x, t) = 0, φ(x) =1

2αx2

In this case, the forward diffusion described by Eqs. 37 can be solved in closedform and is given by a Gaussian with variance σ2 = ν(T − t):

ρ(y, T |x, t) =1√2πσ

exp

(

− (y − x)22σ2

)

(39)

Since the end cost is quadratic, the optimal cost-to-go Eq. 38 can be computedexactly as well. The result is

J(x, t) = νR log

(

σ

σ1

)

+1

2

σ21

σ2αx2 (40)

with 1/σ21 = 1/σ2 + α/νR. The optimal control is computed from Eq. 34:

u = −R−1∂xJ = −R−1σ21

σ2αx = − αx

R+ α(T − t)

We see that the control attracts x to the origin with a force that increases witht getting closer to T . Note, that the optimal control is independent of the noiseν as we also saw in the previous LQ example in section 4.4.

6.3 The diffusion process

The diffusion equation Eq. 37 contains three terms. The second and third termsdescribe drift f(x, t)dt and diffusion dξ and correspond to the diffusion process

23

Eq. 16 with u = 0. The first term describes a process that kills a sampletrajectory with a rate V (x, t)dt/λ. This term does not conserve the probabilitymass. Thus, the solution of Eq. 37 can be obtained by sampling the followingprocess

dx = f(x, t)dt+ dξ

x = x+ dx, with probability 1− V (x, t)dt/λ

xi = †, with probability V (x, t)dt/λ (41)

The third line means that the simulation trajectory is killed with the probabilityV (x, t)dt/λ.

We can obtain a sampling estimate of

Ψ(x, t) =

∫

dyρ(y, T |x, t) exp(−φ(y)/λ)

≈ 1

N

∑

i∈alive

exp(−φ(xi(T ))/λ) (42)

by computing N trajectories xi(t→ T ), i = 1, . . . , N . Each trajectory starts atthe same value x and is sampled using the dynamics Eq. 42. ’Alive’ denotes thesubset of trajectories that do not get killed along the way by the † operation.

6.4 The path integral

The diffusion process Eq. 37 can formally be ’solved’ as a path integral [4]:

ρ(y, T |x, t) =

∫

[dx]yx exp

(

− 1

λSpath(x(t→ T ))

)

(43)

Spath(x(t→ T )) =

∫ T

t

dτ1

2(x(τ) − f(x(τ), τ))2 + V (x(τ), τ)

Combining Eq. 43 and Eq. 38, the cost-to-go becomes

Ψ(x, t) =

∫

[dx]x exp

(

− 1

λS(x(t→ T ))

)

S(x(t→ T )) = Spath(x(t→ T )) + φ(x(T )) (44)

Note, that Ψ has the general form of a partition sum. S is the energy of a pathand λ the temperature. The corresponding probability distribution is

p(x(t→ T )|x, t) =1

Ψ(x, t)exp

(

− 1

λS(x(t→ T ))

)

J = −λ log Ψ can be interpreted as a free energy. See [4] for details.Although we have solved the optimal control problem formally as a path

integral, we are still left with the problem of computing the path integral. Hereone can resort to various standard methods such as Monte Carlo sampling [4]

24

of which the naive forward sampling scheme Eq. 41 is an example. One canhowever, improve on this naive scheme using importance sampling where onechanges the drift term such as to minimize the annihilation of the diffusion bythe −V (x, t)dt/λ term.

A particularly cheap approximation is the Laplace approximation, that findsthe trajectory that minimizes S in Eq. 44. This approximation is exact in thelimit of λ→ 0 which is the noiseless limit. The Laplace approximation gives thedeterministic control solution, also known as the classical path. A particularlyeffective forward importance sampling method is to use the classical path as adrift term in a forward sampling scheme. We will give an example of the naiveand importance forward sampling scheme below for the double slit problem.

One can also use a variational approximation to approximate the path in-tegral using the variational approach for diffusion processes [20]. The generaltreatment of the variational method to path integral control is future work. Anillustration to a particular simple n joint arm is presented in section 6.7.

6.5 Double slit

In this section, we illustrate the path integral control method for the simpleexample of a double slit. The example is sufficiently simple that we can computethe optimal control by forward diffusion in closed form. We use this example tocompare the Monte Carlo and Laplace approximations to the exact result.

Consider a stochastic particle that moves from time t to T according to thedynamics

dx = udt+ dξ

The cost is given by Eq. 33 with φ(x) = 12x

2 and V (x, t1) implements a slit atan intermediate time t1, t < t1 < T :

V (x, t1) = 0, a < x < b, c < x < d

= ∞, else

The problem is illustrated in Fig. 6a where the constant motion is in the t(horizontal) direction and the noise and control is in the x (vertical) direction.

The cost to go can be solved in closed form. The result for t > t1 is givenby Eq. 40 and for t < t1 is [4]:

J(x, t) = νR log

(

σ

σ1

)

+1

2

σ21

σ2x2

− νR log1

2(F (b, x)− F (a, x) + F (d, x)− F (c, x)) (45)

F (x0, x) = Erf

(

√

A

2ν(x0 −

B(x)

A)

)

A =1

t1 − t+

1

R+ T − t1B(x) =

x

t1 − t

25

0 0.5 1 1.5 2

−6

−4

−2

0

2

4

6

8

t

x

(a) The double slit

−10 −5 0 5 100

1

2

3

4

5

x

J

t=0t=0.99t=1.01t=2

(b) Cost-to-go J(x, t)

Figure 6: (a) The particle moves horizontally with constant velocity from t = 0to T = 2 and is deflected up or down by noise and control. The end costφ(x) = x2/2. A double slit is placed at t1 = 1 with openings at −6 < x < −4and 6 < x < 8. Also shown are two example trajectories under optimal control.(b) J(x, t) as a function of x for t = 0, 0.99, 1.01, 2 as computed from Eq. 40and 45. R = 0.1, ν = 1, dt = 0.02.

Eqs. 40 and 45 together provide the solution for the control problem in termsof J and we can compute the optimal control from Eq. 34.

A numerical example for the solution for J(x, t) is shown in fig. 6b. Thetwo parts of the solution (compare t = 0.99 and t = 1.01) are smooth at t = t1for x in the slits, but discontinuous at t = t1 outside the slits. For t = 0, thecost-to-go J is higher around the right slit than around the left slit, because theright slit is further removed from the optimal target x = 0 and thus requiresmore control u and/or its expected target cost φ is higher.

6.5.1 MC sampling

We assess the quality of the naive MC sampling scheme, as given by Eqs. 41and 42 in fig. 7, where we compare J(x, 0) as given by Eq. 45 with the MCestimate Eq. 42. The left figure shows the trajectories of the sampling procedurefor one particular value of x. Note, the inefficiency of the sampler because mostof the trajectories are killed at the infinite potential at t = t1. The right figureshows the accuracy of the estimate of J(x, 0) for all x between −10 and 10using N = 100000 trajectories. Note, that the number of trajectories that arerequired to obtain accurate results, strongly depends on the value of x and λdue to the factor exp(−φ(x)/λ) in Eq. 41. For high λ or low 〈φ〉, few samplesare required (see the estimates around x = −4). For small noise or high 〈φ〉the estimate is strongly determined by the trajectory with minimal φ(x(T ))and many samples may be required to reach this x. In other words, samplingbecomes more accurate for high noise, which is a well-known general feature of

26

0 0.5 1 1.5 2−10

−5

0

5

10

(a) Sample trajectories

−10 −5 0 5 100

2

4

6

8

10

x

J

MCExact

(b) MC sampling estimate ofJ(x, 0)

Figure 7: Monte Carlo sampling of J(x, t = 0) with ψ from Eq. 41 for thedouble slit problem. The parameters are as in fig. 6. (a) Sample of trajectoriesthat start at x to estimate J(x, t). Only trajectories that pass through a slitcontribute to the estimate. (b) MC estimate of J(x, t) = 0 with N = 100000trajectories for each x.

sampling. Also, low values of the cost-to-go are more easy to sample accuratelythan high values. This is in a sense fortunate, since the objective of the controlis to move the particle to lower values of J so that subsequent estimates becomeeasier.

The sampling is of course particularly difficult in this example because ofthe infinite potential that annihilates most of the trajectories. However, similareffects should be observed in general due to the multi-modality of the Action.

6.5.2 Importance sampling

We can improve the sampling procedure using the importance sampling pro-cedure using the Laplace approximation (see [4]). The Laplace approximationand the results of the importance sampler are given in fig. 8. We see that theLaplace approximation is quite good for this example, in particular when onetakes into account that a constant shift in J does not affect the optimal con-trol. The MC importance sampler dramatically improves over the naive MCresults in fig. 7, in particular since 1000 times less samples are used and is alsosignificantly better than the Laplace approximation.

6.6 The delayed choice

As a next example, we consider a dynamical system in one dimension that mustreach one of two targets at locations x = ±1 at a future time T . As we mentionedearlier, the timing of the decision, that is when the automaton decides to go leftor right, is the consequence of spontaneous symmetry breaking. To simplify the

27

−10 −5 0 5 100.5

1

1.5

2

2.5

3

x

J

Figure 8: Comparison of Laplace approximation (dotted line) and Monte Carloimportance sampling (solid jagged line) of J(x, t = 0) with exact result Eq. 45(solid smooth line) for the double slit problem. The importance sampler usedN = 100 trajectories for each x. The parameters are as in fig. 6.

mathematics to its bare minimum, we take V = 0 and f = 0 in Eqs. 32 and 33and φ(x) = ∞ for all x, except for two narrow slits of infinitesimal size ǫ thatrepresent the targets. At the targets we have φ(x = ±1) = 0. In this simplecase, we can compute J exactly (see [4]) and is given by

J(x, t) =R

Ttg

(

1

2x2 − νTtg log 2 cosh

x

νTtg

)

+ const.

where the constant diverges as O(log ǫ) independent of x and Ttg = T − t thetime to reach the targets (the time-to-go). The expression between brackets isa typical free energy with temperature νTtg. It displays a symmetry breakingat νTtg = 1 (fig. 9Left). For νTtg > 1 (far in the past or high noise) it isbest to steer towards x = 0 (between the targets) and delay the choice whichslit to aim for until later. The reason why this is optimal is that from thatposition the expected diffusion alone of size νTtg is likely to reach any of theslits without control (although it is not clear yet which slit). Only sufficientlylate in time (νTtg < 1) should one make a choice. The optimal control is givenby the gradient of J :

u =1

Ttg

(

tanhx

νTtg− x)

(46)

Figure 9Right depicts two trajectories and their controls under stochasticoptimal control Eq. 46 and deterministic optimal control (Eq. 46 with ν = 0),using the same realization of the noise. Note, that the deterministic controldrives x away from zero to either one of the targets depending on the instan-taneous value of sign(x), whereas for large Ttg the stochastic control drives xtowards zero and is smaller in size. The stochastic control maintains x aroundzero and delays the choice for which slit to aim until Ttg ≈ 1/ν.

28

−2 −1 0 1 20.4

0.6

0.8

1

1.2

1.4

1.6

1.8

x

J(x,

t)

T=2

T=1

T=0.5

0 0.5 1 1.5 2−2

−1

0

1

2stochastic

0 0.5 1 1.5 2−2

−1

0

1

2

0 0.5 1 1.5 2−2

−1

0

1

2deterministic

0 0.5 1 1.5 2−2

−1

0

1

2

Figure 9: (Left) Symmetry breaking in J as a function of Ttg implies a ’de-layed choice’ mechanism for optimal stochastic control. When the target is farin the future, the optimal policy is to steer between the targets. Only whenTtg < 1/ν should one aim for one of the targets. ν = R = 1. (Right) Sampletrajectories (top row) and controls (bottom row) under stochastic control Eq. 46(left column) and deterministic control Eq. 46 with ν = 0 (right column), usingidentical initial conditions x(t = 0) = 0 and noise realization.

The fact that symmetry breaking occurs in terms of the value of νTtg, is dueto the fact that the action Eq. 43 Spath ∝ 1/Ttg, which in turn is due to the factthat we assumed V = 0. When V 6= 0, Spath will also contain a contributionthat is proportional to Ttg and the symmetry breaking pattern as a function ofTtg can be very different.

6.7 n joint arm

In this example we illustrate the use of the variational approximation. Weconsider a particularly simple realization of an n joint arm in two dimensions.The location of the ith joint in the 2d plane is

xi =i∑

j=1

cos θi

yi =

i∑

j=1

sin θi

with i = 1, . . . , n. Each joint has length 1. Each of the joint angles is controlledby a variable ui. The dynamics of each joint is

dθi = uidt+ dξi, i = 1, . . . , n

with dξi independent Gaussian noise with⟨

dξ2i⟩

= νdt. Denote by ~θ the vector

of joint angles and ~θ0 its value at t = 0. The expected cost for the control pathut:T is

C(~θ0, t, ut:T ) =

⟨

φ(~θ(T )) +

∫ T

t

dτ1

2u(τ)2

⟩

29

φ(~θ) =α

2

(

(xn(~θ)− xtarget)2 + (yn(~θ)− ytarget)2

)

with xtarget, ytarget the target coordinates of the end joint.Because the state dependent path cost V and the intrinsic dynamics of f are

zero, the solution to the diffusion process Eq. 41 that starts at ~θ0 is a Gaussianso that Eq. 38 becomes 5

Ψ(~θ0, t) =

∫

d~θ

(

1√

2πν(T − t)

)n

exp

(

−n∑

i=1

(θi − θ0i )2/2ν(T − t)− φ(~θ)/ν

)

ui =1

T − t(

〈θi〉 − θ0i)

(47)

where 〈θi〉 is the expectation value of θi computed wrt the probability distribu-tion

p(~θ) =1

Ψ(~θ0, t)exp

(

−n∑

i=1

(θi − θ0i )2/2ν(T − t)− φ(~θ)/ν

)

(48)

Thus, the stochastic optimal control problem reduces the the inference prob-lem to compute 〈θi〉. One can use a simple importance sampling scheme, where

the proposal distribution is the n dimensional Gaussian centered on ~θ0 andwhere samples are weighted with exp(−φ(~θ)/ν). I tried this, but that does notwork very well (results not shown). One can also use a Metropolis Hastingsmethods with a Gaussian proposal distribution. This works quite well (resultsnot shown). One can also use a very simple variational method which we willnow discuss.

6.7.1 The variational approximation

We compute the expectations 〈θi〉 wrt the distribution eq. 48 by minimizing the

KL =

∫

dθq(θ) logq(θ)

p(θ)

q(θ) =

n∏

i=1

qi(θi)

qi(θ) =1

√

2πσ2i

exp

(

− (θ − µi)2

2σ2i

)

The KL becomes

KL = −n∑

i=1

(

log√

2πσ2i +

1

2

)

+ log Ψ(~θ0, t)

+1

2ν(T − t)

n∑

i=1

(

σ2i + (µi − θ0i )2

)

+1

ν

⟨

φ(~θ)⟩

q

5This is not exactly correct because θ is a periodic variable. One should use the solution

to diffusion on a circle instead. We can ignore this as long as√

ν(T − t) is small comparedto 2π.

30

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Figure 10: Path integral control of a n joint arm. The objective is that the endjoint reaches a target location at the end time T . Left) n = 3. Solid line: jointconfiguration in 2d at t = 0. Dashed: expected joint configuration computedat the horizon time T from Eq. 48 with θ0 the current joint position. Targetlocation of the end effector is at the origin, resulting in a triangular configurationfor the arm. As time t increases, each joint moves to its expected target locationdue to the control law Eq. 47. At the same time the expected configuration isrecomputed, resulting in a different triangular arm configuration. Right). Thesame setup, with n = 10.

Because φ is quadratic in xn and yn and these are defined in terms of sines

and cosines,⟨

φ(~θ)⟩

can be computed in closed form. The computation of the

variational equations result from setting the derivative of the KL with respectto µi and σ2

i equal to zero. The result is

µi ← θ0i + α(T − t)(

sinµie−σ2

i/2(〈xn〉 − xtarget)− cosµie

−σ2i/2(〈yn〉 − ytarget)

)

1

σ2i

← 1

ν

(

1

(T − t) + αe−σ2i − α (〈xn〉 − xtarget) cosµie

−σ2i/2 − α (〈yn〉 − ytarget) sinµie

−σ2i/2

)

with 〈xn〉 =∑n

i=1 cosµi exp(−σ2i /2) and 〈yn〉 =

∑ni=1 sinµi exp(−σ2

i /2). Afterconvergence the estimate for 〈θi〉 = µi.

The problem is illustrated in fig. 10 Note, that the computation of 〈θi〉 solvesthe coordination problem between the different joints. Once 〈θi〉 is known, eachjoint is steered independently to its target value 〈θi〉 using the control law Eq. 47.The computation of 〈θi〉 in the variational approximation is very efficient andcan be used to control arms with hundreds of joints.

6.8 Other issues

One can extend the path integral control formalism to multiple agents thatjointly solve a task. In this case the agents need to coordinate their actions notonly through time, but also among each other. It was recently shown that the

31

coordination problem can be mapped on a graphical model inference problemand can be solved using the junction tree algorithm or approximate inferencemethods. Exact or approximate control solutions can be computed for instancewith hundreds of agents, depending on the complexity of the cost function [21, 5].

As briefly mentioned above, the finite horizon control problem with recedinghorizon is conceptually quite similar to reinforcement learning. One can com-pare these methods and show that indeed the solutions are very similar. Theadvantage of the path integral approach is that it 1) seems to be more efficient,since it does not require to sample the whole state space (depending on thehorizon time) and 2) that solutions for short horizon time can be combined tocompute solutions for larger horizon times. See [22] for a discussion. An openissue is to design efficient learning methods for receding horizon path integralcontrol methods that mimic reinforcement learning. In [22] a first attempt wasmade.

Finally, one would like to demonstrate that the path integral method issufficiently fast for real-time robotics applications. For robotics applications andalso for other realistic control applications it is important to deal with obstaclesin the environment which may represent static objects or moving actors. Hardobjects should ideally be represented by infinite costs and thus the functionV will be non-smooth, and the probability distributions ρ(y, T |x, t) will haveforbidden areas where the probability is zero. Application of variational methodsto these problems is hard is an open problem.

Acknowledgements

This work is supported in part by the Dutch Technology Foundation and theBSIK/ICIS project.

References

[1] R. Stengel. Optimal control and estimation. Dover publications, New York,1993.

[2] W.H. Fleming and H.M. Soner. Controlled Markov Processes and Viscositysolutions. Springer Verlag, 1992.

[3] H.J. Kappen. A linear theory for control of non-linear stochastic systems.Physical Review Letters, 95:200201, 2005.

[4] H.J. Kappen. Path integrals and symmetry breaking for optimal controltheory. Journal of statistical mechanics: theory and Experiment, pageP11011, 2005.

[5] B. Broek, Wiegerinck. W., and H.J. Kappen. Graphical model inference inoptimal control of stochastic multi-agent systems. Journal of AI Research,pages 95–122, 2008. Accepted for publication.

32

[6] R. Bellman and R. Kalaba. Selected papers on mathematical trends incontrol theory. Dover, 1964.

[7] L.S. Pontryagin, V.G. Boltyanskii, R.V. Gamkrelidze, and E.F.Mishchenko. The mathematical theory of optimal processes. Interscience,1962.

[8] J Yong and X.Y. Zhou. Stochastic controls. Hamiltonian Systems and HJBEquations. Springer, 1999.

[9] R. Weber. Lecture notes on optimization and control. Lecture notes of acourse given autumn 2006, 2006.

[10] U. Jonsson, C. Trygger, and P. Ogren. Lectures on optimal control. 2002.

[11] G. Fraser-Andrews. A multiple-shooting technique for optimal control.Journal of Optimization Theory and Applications, 102:299–313, 1999.

[12] H. Goldstein. Classical mechanics. Addison Wesley, 1980.

[13] M. Kearns and S Singh. Near-optimal reinforcement learning in polynomialtime. Machine Learning, pages 209–232, 2002.

[14] S. B. Thrun. The role of exploration in learning control. In D.A. Whiteand D.A. Sofge, editors, Handbook of intelligent control. Multiscience Press,1992.

[15] D.P. Bertsekas. Dynamic Programming and optimal control. Athena Sci-entific, Belmont, Massachusetts, 2000. Second edition.

[16] J.J. Florentin. Optimal, probing, adaptive control of a simple bayesiansystem. International Journal of Electronics, 13:165–177, 1962.

[17] P. R. Kumar. Optimal adaptive control of linear-quadratic-gaussian sys-tems. SIAM Journal on Control and Optimization, 21(2):163–178, 1983.

[18] H. Theil. A note on certainty equivalence in dynamic planning. Economet-rica, 25:346–349, 1957.

[19] W.H. Fleming. Exit probabilties and optimal stochastic control. AppliedMath. Optim., 4:329–346, 1978.

[20] C. Archambeau, M. Opper, Y. Shen, D. Cornford, and J. Shawe-Taylor.Variational inference for diffusion processes. In Daphne Koller and YoramSinger, editors, Advances in Neural Information Processing Systems 19,Cambridge, MA, 2008. MIT Press.

[21] W. Wiegerinck, B. van den Broek, and H.J. Kappen. Stochastic optimalcontrol in continuous space-time multi-agent systems. In Proceedings UAI,pages 528–535. Association for Uncertainty in Artificial Intelligence, 2006.

33

[22] H.J. Kappen. An introduction to stochastic control theory, path integralsand reinforcement learning. In 9th Granada seminar on ComputationalPhysics: Computational and Mathematical Modeling of Cooperative Be-havior in Neural Systems, pages 149–181. Americal Institute of Physics,2007.

34

stochastic optimal control theory - uni stuttgart › ... › kappen-handout.pdf · stochastic...

Documents