optimal control and dynamic programming - tu/e · 2018-05-13 · approximate dynamic programming...

4SC000 Q2 2017-2018

Optimal Control and Dynamic Programming

Duarte Antunes

Motivation

1

•Last two lectures we discussed two alternative methods to DP for stage decision problems: discretisation and static optimization.

• Discretization allows to obtain an optimal policy (approximately) when the dimension of the problem (state and input dimension) is small and static optimisation allows to obtain an optimal path when the problem is convex

dimension convexity DP discretization

Static optimization

small convex

small non-convex

large convex

large non-convex

n 3

n 3

• we show how to obtain optimal policies using static optimisation and certainty equivalent control (lec 3, slide 33) applying this to solve linear quadratic control with input constraints.

• discuss related approximate dynamic programming techniques such as MPC and rollout applying it to a non-convex, large dimensional problem (control of switched linear systems).

In this lecture:

Outline

• Approximate dynamic programming

• Linear quadratic control with inequality constraints

• Control of switched systems

Challenge

2

Jk(xk) = minuk gk(xk, uk) + Jk+1(fk(xk, uk))

Iterating the dynamic programming algorithm for stage decision problems is hard since it is hard to compute the costs-to-go (see e.g. slide 11 of lecture 5).

Jk+1(xk+1)

xk+1uk

Jk(xk) gk(xk, uk) + Jk+1(fk(xk, uk))

For each hard to minimize this function of

xk

?

Hard to obtain an expression for the cost-to-go

xk

Typically unknown (known only for the terminal stage)uk

• Moreover:

- discretization is only possible if the state and input spaces are not large.

- static optimisation only assures optimality for convex problems and only allows to compute optimal paths.

- How to obtain (sub)optimal policy?

Idea I -Certainty equivalent control

3

Compute optimal path online starting from the current state and apply the first decision

xk

xk+1

Repeat this procedure at every stage, for the current state

time/stage k h

time/stage hk + 1

Idea II - MPC

4

Similar to Idea I but compute decisions only in a (short) horizon . .

Compute optimal path over horizon starting from the current state and apply first decision

xk

xk+1

Repeat this procedure in a receding/rolling horizon way

• Optimization problem to solve at each stage is then much simpler and makes the algorithm feasible to run online. This is the fundamental idea of Model Predictive Control (MPC).

• Note that this is (in general) not optimal not even for problems without disturbances.

time/stage k h

time/stage k +H hk + 1

k +H

Idea III - Rollout

5

Similar to Ideas I II but after the optimization horizon use a base policy. Apply the first decision of the optimization procedure.

xk

• Optimization problem to solve at each stage is then much simpler and makes the algorithm feasible to run online. This is the fundamental idea of Rollout.

• Note that this is (in general) not optimal not even for problems without disturbances.

Repeat this procedure in a receding/rolling horizon way

time/stage k k +H h

time/stage k +H hk + 1

xk+1

Approximate dynamic programming

Approximate the cost-to go and use the solution of the following equation as the control input (in general different from optimal control input!)

Jk(xk) = minuk gk(xk, uk) + Jk+1(fk(xk, uk))

Jk+1(xk+1)

xk+1 uk

xkFor each state just need to minimize this function and compute one action

uk

gk(xk, uk) + Jk+1(fk(xk, uk))

u⇤ku⇤

k

Typically unknown

known by construction Jk+1(xk+1)

uk

More general than previous ideas!

6

Discussion

7

• Ideas I, II, III can be seen as special cases of approximate dynamic programming.

• Idea I - for problems with disturbances, the true cost (requiring the computation of expected values) is approximated by a deterministic cost. For problems without disturbances idea I yields the same policy as DP.

• Idea II - approximation is the deterministic cost over a short horizon. This is an approximation even for problems without disturbances.

H = 1• Idea III - approximation is the cost of the base policy when , and it is the cost over a short horizon plus the cost of the base policy if the horizon is larger than one. Again this is an approximation even for problems without disturbances.

• Although we mainly consider finite horizon costs the ideas extend naturally to infinite horizon costs.

We formalize these statements next.

Certainty equivalent control

8

Certainty equivalent control is the following implicit policy for a stage decision problem:

1. At each stage (initially ), assuming that the full state is measured solve the following problem (for initial condition ), assuming no disturbances

and obtain estimates of the optimal control inputs

2. Take the first decision and apply it to the process

(the system will evolve to in general different from due to disturbances). Repeat (go to 1.)

xkk k = 0xk

xk+1 = fk(xk, uk, wk)

x`+1 = f`(x`, u`, 0)

(wk = 0, wk+1 = 0, . . . )

uk, uk+1, . . . , uh�1

uk

k = 0 k = 10 1 2 h� 1 h 30 1 2 h� 1 h

u1u0

uh�1

u1 u2

x2x1

x2 x2

xk+1 = fk(xk, uk, 0)xk+1 = fk(xk, , wk)uk

Ph�1k=0 gk(xk, uk) + gh(xh)

Ph�1`=k g`(x`, u`) + gh(xh)

` 2 {k, k + 1, . . . , h� 1}


9

For problems with no disturbances, this is equivalent to

minuk gk(xk, uk) + Jk+1(xk+1) x`+1 = f`(x`, u`, 0)

` 2 {k, k + 1, . . . , h� 1}s.t.

Jk+1(xk+1) = minuk+1,...,uk+H�1

k+H�1X

`=k+1

g`(x`, u`)

and select the first and apply it. uk

s.t. (1)initial condition xk+1

Note that, for problems without disturbances, are the optimal costs-to-go and satisfy the DP equation

Jk

Jk(xk) = minuk

gk(xk, uk) + Jk+1(xk+1)

For problems with stochastic disturbances Jk are approximations of optimal costs-to-go

Jk(xk) = minuk

gk(xk, uk) + E[Jk+1(xk+1)]

Approximated byJk+1(xk+1)

x`+1 = f`(x`, u`, w`)

` 2 {k, k + 1, . . . , h� 1}

Model Predictive Control

10

• At each time consider the optimal control problem only in a horizon

time/stage k k +H h

and select the first control input resulting from this optimization.

H

minuk,uh+1,...,uk+H�1

k+H�1X

`=k

g`(x`, u`)

uk

k

s.t.

xk

(1)

` 2 {k, k + 1, . . . , k +H � 1}

x`+1 = f`(x`, u`)

• Equivalent to minuk


s.t. (1)` 2 {k + 1, k + 2, . . . , k +H � 1}

Jk+1(xk+1) = minuk+1,...,uk+H�1

k+H�1X

`=k+1

g`(x`, u`)

Model Predictive Control

11

• Note that at the last decision stages the cost function is slightly different for finite horizon problems, including also the terminal cost

• There are several variants of Model Predictive Control and in particular some variants use a terminal constraint (this is useful to prove stability). For example, impose that the state after the horizon must be zero (for finite-horizon problems at the last stages ).

minuk,uh+1,...,uk+H�1

k+H�1X

`=k

g`(x`, u`) s.t.+gh(xh)

xk+H = 0xh = 0

k 2 {h�H,h�H � 1, . . . , h}

(1)

` 2 {k, k + 1, . . . , h� 1}

Rollout

12

• At each time consider the optimal control problem in a horizon

time/stage k k +H h

and select the first control input resulting from this optimization.

H

uk

• Similar to MPC but use a base policy after the horizonuk = µ(xk)

assuming that after the horizon a base policy is used

s.t. (1)

xk

` 2 {k, k + 1, . . . , k +H � 1}+

h�1X

`=k+H

g`(x`, µ`(x`)) + gh(xh)minuk,uk+1,...,uk+H�1

k+H�1X

`=k

g`(x`, u`)

• Equivalent to minuk


s.t. (1) ` 2 {k + 1, k + 2, . . . , k +H � 1}

+h�1X

`=k+H

g`(x`, µ`(x`)) + gh(xh)Jk+1(xk+1) = minuk+1,...,uk+H�1

k+H�1X

`=k+1

g`(x`, u`)

Further remarks on ADP

13

• Quality of an approximation is measured by how “good” is or how close it is from optimal and not by how “good” is the approximation of

• Decisions only need to be computed (in real-time) for the value of the present state (do not need to iterate the cost-to-go).

• There are several variants to approximate the costs-to-go.

• Due to the heuristic nature of the approximation, it is very hard to quantify when a specific approximation method is good and to establish formal results.

• For example we have seen that the optimal policy for the infinite horizon problem linear quadratic regulator problem makes the closed-loop of a linear system stable. This is typically very hard to establish for approximate methods.

u⇤k

u⇤k

Jk+1(xk+1) Jk+1(xk+1)

Outline




14

⇡ = {µ0, . . . , µh�1} uk = µk(xk)

Dynamic model

Cost function

Problem formulation

xk+1 = Axk +Buk uk 2 R

uk 2 [�c, c] 8k

h�1X

k=0

x

|kQxk + u

|kRuk + x

|hQhxh

Input constraints

Find a policy which minimizes the cost function

Let us focus first on obtaining an optimal path and then we will find a control policy with certainty equivalence and MPC

c > 0(given )

(for simplicity)

Possible disturbances

+wk

Quadratic programming formulation

15

Define

Dynamic model (equality constraints)

Cost function

Inequality Constraints

x =⇥x

|1 x

|2 . . . x

|h

⇤|u =

⇥u|0 u|

1 . . . u|h�1

⇤|

x = Ax0 + Bu

A =

2

6664

AA2

...Ah

3

7775B =

2

66666664

B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0A3B A2B AB B . . . 0...

. . .. . .

. . .. . .

...Ah�1B Ah�2B Ah�3B . . . AB B

3

77777775

Q =

2

6664

Q 0 0 . . . 00 Q 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . Qh

3

7775 R =

2

6664

R 0 0 . . . 00 R 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . R

3

7775

(Ax0 + Bu)|Q(Ax0 + Bu) + u

|Ru

I�I

�u c

2

664

11. . .1

3

775

Standard quadratic programming problem (see, e.g., Convex optimization, Stephen Boyd, Lieven Vandenberghe) implemented in Matlab with quadprog.m. Convex problem!

This allow us already to obtain an optimal path.

Example

16

Double integrator(model slide 20, II_1)

Optimal path no constraints (slide 30, Lec 5), parameters x0 =⇥1 0

⇤|⌧ = 0.2

Optimal path with constraints

parameters optpolicy=2, fdisturbances=0 (or optpolicy=3, fdisturbances=0 !)†

†

*run matlab code on slides 25-28 with parameters optpolicy=1, fdisturbances=0

*

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.8

-0.6

-0.4

-0.2

0

v(t)

0 5 10t

-4

-2

0

2

u(t)

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)

Q = Qh = I, R = 1

h = 50

uk 2 [�L,L], L = 0.25

x =⇥y v

⇤|Position Velocity

Example

17

What if there are disturbances? Desired behaviour far from expected!


*

We want an optimal policy! CEC! (and MPC)

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

0.2

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)


18

At each stage define


Cost function


Q =

2

6664

Q 0 0 . . . 00 Q 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . Qh

3

7775 R =

2

6664

R 0 0 . . . 00 R 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . R

3

7775

I�I

�u c

2

664

11. . .1

3

775

h�1X

`=k

x

|`Qx` + u

|`Ru` + x

|hQhxh =

B =

2

66666664

B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0A3B A2B AB B . . . 0...

. . .. . .

. . .. . .

...Ah�1�kB Ah�2�kB Ah�3�kB . . . AB B

3

77777775

A =

2

6664

AA2

...Ah�k

3

7775

Take the first component of and apply it to the plant.u uk, that is,

x =⇥x

|k+1 x

|k+2 . . . x

|h

⇤|u =

⇥u|k u|

k+1 . . . u|k+H�1

⇤|

(Axk + Bu)|Q(Axk + Bu) + u

|Ru

x = Axk + Bu

Example

19

Considering disturbances affecting the system (see code slide 25 for details on disturbances)

and applying certainty equivalence control we obtain

xk+1 = Axk +Buk


+wk

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

0.2

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)

Discussion

• Although quadprog.m is quite efficient in problems with a large sampling rate or a large time horizon of interest it might not possible to run quadprog.m and this motivates MPC.

• Another motivation for MPC is that the model is typically not extremely accurate and considering short horizon might even lead to better results.

• We shall try MPC imposing that the final state after the prediction horizon is zero. See Bertsekas book, Ch. 6, for the close connection with Rollout (after the horizon the base policy is apply zero control input!)

• CEC allowed to transform an algorithm to compute optimal paths (using quadprog.m) to an algorithm that computes optimal policies.

• The fact that is works so well follows from the fact that the problem is convex: a problem with inequality constraints is convex if the cost is convex and the set of feasible solutions is convex, which can be shown to be the case for this problem.

20

MPC

21

At each stage define


Cost function


R =

2

6664

R 0 0 . . . 00 R 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . R

3

7775

I�I

�u c

2

664

11. . .1

3

775

Take the first component of and apply it to the plant.u uk, that is,

x =⇥x

|k+1 x

|k+2 . . . x

|k+H�1

⇤|

x = Axk + Bu

(Axk + Bu)|Q(Axk + Bu) + u

|Ru

u =⇥u|k u|

k+1 . . . u|k+H�1

⇤|

A =

2

6664

AA2

...AH�1

3

7775B =

2

666664

B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0...

. . .. . .

. . .. . .

...AH�2B AH�3B . . . . . . B 0

3

777775Q =

2

6664

Q 0 0 . . . 00 Q 0 . . . 0...

. . .. . .

. . ....

0 0 0 . . . Q

3

7775

k+H�1X

`=k

x

|`Qx` + u

|`Ru` =

Example

22

First iteration (no disturbances). This is the state and input that is predicted

Applying the first control input leads to the state


u0 = �0.25 x1 =

0.995�0.05

�

and the process is repeated.

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

0.2

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)

Example

23

Note that the end result is different from the one predicted at iteration 0 because the state and input predictions change over time.


End result

0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)

Example

24

Results with disturbances


0 5 10t

0

0.5

1

y(t)

0 5 10t

-0.6

-0.4

-0.2

0

0.2

v(t)

0 5 10t

-0.4

-0.2

0

0.2

0.4

u(t)

Matlab code

25

optpolicy = 1; fdisturbances = 0; % double integrator model Ac = [0 1;0 0]; Bc = [0;1]; tau = 0.2; Q = [1 0; 0 1]; S = [0;0]; R = 0.01; QT = [1 0; 0 1]; n = 2; m = 1; x0 = [1;0]; c = 0.25; sigmaw = 0.01; % disturbance level H = 25; % prediction horizon for mpc h = 50; % simulation horizon sysd = c2d(ss(Ac,Bc,zeros(1,n),0),tau); A = sysd.a; B = sysd.b; C = sysd.c;

% preliminaries for the policies switch optpolicy case 1 P{h+1} = QT; for k = h:-1:1 % Riccati equations P{k} = A'*P{k+1}*A + Q - …(S+A’*P{k+1}*B)*pinv(R+B’*P{k+1}*B)*(S'+B'*P{k+1}*A); K{k} = -pinv(R+B'*P{k+1}*B)*(S'+B'*P{k+1}*A); end case 2 U2 = quadconstrainedcontrol(A,B,Q,R,c,h,x0,1); case 4 U3 = [quadconstrainedcontrol(A,B,Q,R,c,H,x0,2); zeros(h-H,1)]; end

Model definition Preliminaries

Matlab code

26

t = tau*(0:1:h); x = zeros(n,h); u = zeros(1,h); x(:,1) = x0; randn('seed',15); % used for results in slides if fdisturbances == 1 gainnodisturbances = 1; else gainnodisturbances = 0; end cost = 0 for k=1:h switch optpolicy case 1 % LQR u(:,k) = K{k}*x(:,k); case 2 % CEC no disturbances u(:,k) = U2(k); case 3 % CEC [u_] = quadconstrainedcontrol(A,B,Q,R,c,h+1-k,x(:,k),1); u(:,k) = u_(1); case 4 % MPC no disturbances u(:,k) = U3(k); case 5 % MPC if k <= h-H-1 [u_] = quadconstrainedcontrol(A,B,Q,R,c,H,x(:,k),2); u(:,k) = u_(1); else [u_] = quadconstrainedcontrol(A,B,Q,R,c,h+1-k,x(:,k),1); u(:,k) = u_(1); end end w(:,k) = B*sigmaw*randn(1)*gainnodisturbances; x(:,k+1) = A*x(:,k) + B*u(:,k) + w(:,k); cost = cost + x(:,k)'*Q*x(:,k) + u(:,k)'*R*u(:,k); end v = x(1,:);

Simulate system

Matlab code

27

N = 1000; ts = tau/N; nl = h*N; uc = kron(u,ones(1,N)); sysd = c2d(ss(Ac,Bc,zeros(1,size(Ac,1)),zeros(1,size(Bc,2))),ts); Ad = sysd.a; Bd = sysd.b; xc = zeros(n,nl+1); xc(:,1) = x(:,1); for k = 1:h xc(:,(k-1)*N+1) = x(:,k); for l=1:N xc(:,(k-1)*N+l+1) = Ad*xc(:,(k-1)*N+l)+Bd*u(:,k); end end tc = ts*(0:nl);

figure(1) subplot(1,3,1) plot(tc,xc(1,:)) set(gca,'Fontsize',18) xlabel('t') ylabel('y(t)') grid on axis([0 h*tau -0.2 1]) subplot(1,3,2) plot(tc(1:end),xc(2,:)) set(gca,'Fontsize',18) xlabel('t') ylabel('v(t)') grid on subplot(1,3,3) plot(tc(1:end-1),uc) set(gca,'Fontsize',18) grid on set(gca,'Fontsize',18) xlabel('t') ylabel('u(t)')

Continuous-time simulation and plots

Matlab code

28

function [u] = quadconstrainedcontrol(A,B,Q,R,L,h,x0,opt) % opt = 1, CEC, opt = 2, MPC n = size(A,1); m = size(B,2); % define matrices M, N Abar = zeros(n*h,n); Bbar = zeros(n*h,m*h); Qbar = zeros(n*h,n*h); Rbar = zeros(m*h,m*h); for i = 1:h Abar(1+(i-1)*n:n*i,:) = A^i; for j = 1:i Bbar(1+(i-1)*n:n*i,1+(j-1)*m:m*j) = A^(i-j)*B; end Qbar(1+(i-1)*n:n*i,1+(i-1)*n:n*i) = Q; Rbar(1+(i-1)*m:m*i,1+(i-1)*m:m*i) = R; end switch opt case 1 % CEC u = quadprog(Rbar+Bbar’*Qbar*Bbar,((Bbar’*Qbar*Abar)*x0)’,… [eye(m*h);-eye(m*h)],[L*ones(m*h,1);L*ones(m*h,1)],[],[]); case 2 % MPC u = quadprog(Rbar+Bbar’*Qbar*Bbar,((Bbar’*Qbar*Abar)*x0)’,[eye(m*h);-eye(m*h)],… [L*ones(m*h,1);L*ones(m*h,1)],Bbar(end-n+1:end,:),-A^h*x0); end

Key function

Outline




Switched linear systems

29

xk+1 = A�kxkDynamic model

Cost function

• Applying the dynamic programming algorithm would lead to a combinatorial problem.

• Let us apply approximate dynamic programming.

Goal: Find an optimal path (or policy) for the control input for a given initial condition

uk = �k

x0

�k = {1, . . . ,m}xk 2 Rn

h�1X

k=0

x

|kQ�kxk + x

|hQhxh

CEC

30

1. At step compute all the possible control sequences

2. Compute the cost for each of these control sequences, and pick a sequence with minimum cost.

3. Apply the first element set and go again to step 1 if

k

�k = �k k = k + 1

(�k,�k+1,�k+2, . . . ,�h�1)

(�k, �k+1, �k+2, . . . , �h�1)

k 6= h

Still combinatorial!

Ph�1`=k x

|`Q�`x` + x

|hQhxh

MPC

31

1. At step compute all possible control sequences in the prediction horizon

2. Compute the cost

for each of these control sequences and pick a sequence

with minimum cost.3. Apply the first element set and go again to step 1 if .

k(�k,�k+1,�k+2, . . . ,�H�1)

(�k,�k+1,�k+2, . . . ,�h�1)

(�k, �k+1, �k+2, . . . , �h�1)

if

otherwise.

if

if

otherwise

(�k, �k+1, �k+2, . . . , �H�1)

�k = �k k = k + 1k 6= h

otherwise

0 k h�H

0 k h�H

0 k h�H

Ph�1`=k x

|`Q�`x` + x

|hQhxh

Pk+H�1`=k x

|`Q�`x`

32

1. At step compute all the control sequences with free taking the form

2. Compute the cost for each of these control sequences, and pick a sequence with minimum cost.

3. Apply the first element , set and go again to step 1 if .

k

Pick a base policy and start with k = 0, H > 0

�k = �k

(�0, �1, . . . , �h�1)

(�k, . . . ,�k+H�1,�0, �1, . . . , �h�1�k�H)

�k, . . . ,�k+H�1

Rollout

otherwise(�k,�k+1, . . . ,�h�1)if

k = k + 1k 6= h

(�k, �k+1, �k+2, . . . , �h�1)

0 k h�H

Ph�1`=k x

|`Q�`x` + x

|hQhxh

Outline




• Mixing fluids

• Explicit versus implicit policies

• Resource-aware control

33

Mixing

Control law

Camera

imagesActuators

Actuation: 4 possible rotations decided once every h seconds

How to mix two fluids in minimum time*?

�k 2 {1, . . . , 4}

*Mixing and switching, V.S. Dolk MSc thesis

34

Periodic open loop policy

Basic mechanism for effective mixing is repetitive stretching & folding

Basic mechanism can be seen as a periodic open-loop strategyCan a closed-loop strategy perform better?

35

Model

xk+1 = A�kxk

Divide the domain into subdomains and denote by the concentration of fluid a time

N i 2 {1, . . . , N}x

ik

k⌧

x

1

x

2

x

3 Then the following equation describes how the concentrations evolve as a function of the control action

xk+1 = A�kxk

�k 2 {1, . . . , 4}

36

Cost function

Model

• The intensity of segregation is defined as

and measures how far is the concentration from homogeneity.

• The concentrations for each subdomains will be equal to a given constant when homogeneity is achieved. This means that the following error will converge to zero

c

e|kek

ek := xk � c

2

6664

11...1

3

7775

• We are interested in minimizing the intensity of segregation for every time step

Phk=0 e

|kek

c = 1N

⇥1 1 . . . 1

⇤x0

37

Results

PeriodicRollout

• The periodic sequence consists of alternating a rotation of the outer cylinder and a clockwise rotation of the inner cylinder.

• The rollout strategy consists of optimizing in a horizon of length 4 assuming that afterwards this periodic policy is used.

38

Results

• An MPC approach with horizon boils down to picking the decision minimizing the (squared) norm of the error at the next iteration (minimum error “greedy” strategy)

H = 2

• In this case MPC with leads to worse results.

argmin�ke|kek+e|k+1ek+1 = argmin�k

e|kek+e|kA|�kA�kek = argmin�k

e|kA|�kA�kek| {z }

e|k+1ek+1

H = 2

Outline




• Mixing fluids



39

Introduction

• We show next that for switched linear systems, we can write explicit rollout and MPC policies

• Although finite-horizon costs can be considered, this is actually easier if we consider infinite-horizon costs

• Note that certainty equivalent control in the case of infinite horizon switched systems problems would entail computing an infinite number of options and therefore it is not viable.

xk+1 = A�kxk

P1k=0 x

|kQ�kxk

40

Cost of base policy• Suppose that the base policy corresponds to a constant switching , . Then it

is easy to show that�k = j

xk+1 = Ajxk

8k

when

where is the solution to the linear system

• If the base policy corresponds to periodic switching a similar expression can be obtained.

A|jM

jAj �M j = �Qj

M j =P1

k=0(A|j )

kQj(Aj)k

P1k=` x

|kQjxk = x

|`M

jx`

• For example if andm = 2, �k 2 {1, 2} (�`,�`+1,�`+2,�`+3,�`+4, . . . ) = (1, 2, 1, 2, 1, . . . )

then

where

P1k=` x

|kQ�kxk = x

|` M

1x`

A|1M

2A1 � M1 = �Q1

A|2M

1A2 � M2 = �Q2

41

Explicit rollout policy

Note that in the prediction horizon there are

Suppose that we index all these options

Then each corresponds to a unique set of free schedules at each iteration

mH options to be considered

and a unique set of costs can be obtained

x

|kP

ixk =

k+H�1X

`=k

x

|`Q�`x` +

1X

`=k+H

x

|`Q�`x`

i

i 2 {1, 2, 3, . . . ,mH}

(�k,�k+1, . . . ,�k+H�1) = (�i0, �

i1, �

i2, . . . , �

iH�1)

Then an explicit expression for the rollout policy is

�k = ⇡(◆(xk))

◆(xk) = argmini

x

|kP

ixk

where is a map that extracts the first component of the scheduling associated with ⇡

⇡(i) = �i0

i

Outline




• Mixing fluids



43

Resource-aware control How to efficiently control several dynamical systems via a wireless network by deciding how to schedule transmissions?

Controller

Actuator 1

(only one user can transmit at a given time)

Inverted pendulum 1 Inverted pendulum II

Actuator II

✓1✓2⌧1

44

Model• The two processes are identical and modeled by

x

ik+1 = Ax

ik + Bu

ik i 2 {1, 2}

• Each system has an identical cost

• If each system could update the control law at every time instant the optimal control law would take the form (LQR problem)

u

ik = Kix

ik

• Instead, since each system can only update the control input at mutually exclusive times, we have

u

1k =

(K1x

1k, if �k = 1

v

1k, if �k = 2

u

2k =

(K2x

2k, if �k = 2

v

2k, if �k = 1

vik = uik�1

i 2 {1, 2}

and we wish to minimize J1 + J2

Ji =1X

k=0

(xik)

|Qx

ik + (ui

k)|Ru

ik

45

Model• If we let

xk =

2

664

x

1k

v

1k

x

2k

v

2k

3

775

we can write

where

xk+1 = A�kxk

P1k=0 x

|kQ�kxk

model

cost

A1 =

2

664

A+ BK 0 0 0K 0 0 00 0 A B0 0 0 I

3

775 A2 =

2

664

A B 0 00 I 0 00 0 A+ BK 00 0 K 0

3

775

Q1 =

2

664

Q+K|RK 0 0 00 0 0 00 0 Q 00 0 0 R

3

775 Q2 =

2

664

Q 0 0 00 R 0 00 0 Q+K|RK 00 0 0 0

3

775

46

Example

base policy

B =R ⌧0 eAcsdsBc

A = eAc⌧

Ac =

0 13 0

�

Bc =

01

�

Q = I

R = 0.01

⌧ = 0.1

x0 =

2

6666664

�0.100100

3

7777775

Parameters

0 1 2 3 4 5-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0 1 2 3 4 5-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

cost of periodic (base): 16.4974cost of rollout: 12.4969

rollout

base policy

time t = k⌧time t = k⌧

time t = k⌧ time t = k⌧

first component of first component of

second component ofsecond component of x

2k

x

2k

x

1k

x

1k

0 1 2 3 4 5-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

47

Example

cost of periodic (base): 16.4974cost of rollout: 12.4969

Given the initial condition paying more attention to system 2 leads to better results!

2 2 2 1 2 1 2 1 2 1 1 2 3 4 5 6 7 8 9 10

�k

k

0 1 2 3 4 5-12

-10

-8

-6

-4

-2

0

2

4

0 1 2 3 4 5-1

-0.5

0

0.5

1

1.5

2

rollout

base policy

time t = k⌧ time t = k⌧

control input control inputu1k u2

k

48

Another resource-aware control problem

How to control the two arms of a SCARA robot with only one amplifier to track a trajectory for the tip with minimum error?*

Only torque or can be applied at each time instant

⌧1 ⌧2

✓2

✓1

⌧1

⌧2

0 2 4 6 8 10-0.2

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10-1.5

-1

-0.5

0

0.5

1

Schematic

⌧1

⌧2

time

*Control of a SCARA robot with a reduced number of amplifiers, Internship report TU/e, VDL, Yuri Steinbuch, 2015

49

Concluding remarksTo summarize:

• When discretization is not feasible, approximate dynamic programming (e.g. CEC, rollout, MPC) can be a practical way to provide a policy. However, it is suboptimal.

• Must see case by case if it can be used - for SLS it works very well (although suboptimal!)

After this lecture you should be able to:• Apply ADP techniques to find suboptimal policies for stage decision problems.

• Apply ADP techniques to control switched linear systems and linear quadratic problem with input constraints.

dimension convexity DP discretizat

ion

Static optimization

ADP ADP SLS

small convex ?

small non-convex

?

large convex ?

large non-convex ?

optimal control and dynamic programming - tu/e · 2018-05-13 · approximate dynamic programming...

Documents