optimal control and dynamic programming - tu/e · 2018-05-13 · approximate dynamic programming...
TRANSCRIPT
4SC000 Q2 2017-2018
Optimal Control and Dynamic Programming
Duarte Antunes
Motivation
1
•Last two lectures we discussed two alternative methods to DP for stage decision problems: discretisation and static optimization.
• Discretization allows to obtain an optimal policy (approximately) when the dimension of the problem (state and input dimension) is small and static optimisation allows to obtain an optimal path when the problem is convex
dimension convexity DP discretization
Static optimization
small convex
small non-convex
large convex
large non-convex
n 3
n 3
• we show how to obtain optimal policies using static optimisation and certainty equivalent control (lec 3, slide 33) applying this to solve linear quadratic control with input constraints.
• discuss related approximate dynamic programming techniques such as MPC and rollout applying it to a non-convex, large dimensional problem (control of switched linear systems).
In this lecture:
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
Challenge
2
Jk(xk) = minuk gk(xk, uk) + Jk+1(fk(xk, uk))
Iterating the dynamic programming algorithm for stage decision problems is hard since it is hard to compute the costs-to-go (see e.g. slide 11 of lecture 5).
Jk+1(xk+1)
xk+1uk
Jk(xk) gk(xk, uk) + Jk+1(fk(xk, uk))
For each hard to minimize this function of
xk
?
Hard to obtain an expression for the cost-to-go
xk
Typically unknown (known only for the terminal stage)uk
• Moreover:
- discretization is only possible if the state and input spaces are not large.
- static optimisation only assures optimality for convex problems and only allows to compute optimal paths.
- How to obtain (sub)optimal policy?
Idea I -Certainty equivalent control
3
Compute optimal path online starting from the current state and apply the first decision
xk
xk+1
Repeat this procedure at every stage, for the current state
time/stage k h
time/stage hk + 1
Idea II - MPC
4
Similar to Idea I but compute decisions only in a (short) horizon . .
Compute optimal path over horizon starting from the current state and apply first decision
xk
xk+1
Repeat this procedure in a receding/rolling horizon way
• Optimization problem to solve at each stage is then much simpler and makes the algorithm feasible to run online. This is the fundamental idea of Model Predictive Control (MPC).
• Note that this is (in general) not optimal not even for problems without disturbances.
time/stage k h
time/stage k +H hk + 1
k +H
Idea III - Rollout
5
Similar to Ideas I II but after the optimization horizon use a base policy. Apply the first decision of the optimization procedure.
xk
• Optimization problem to solve at each stage is then much simpler and makes the algorithm feasible to run online. This is the fundamental idea of Rollout.
• Note that this is (in general) not optimal not even for problems without disturbances.
Repeat this procedure in a receding/rolling horizon way
time/stage k k +H h
time/stage k +H hk + 1
xk+1
Approximate dynamic programming
Approximate the cost-to go and use the solution of the following equation as the control input (in general different from optimal control input!)
Jk(xk) = minuk gk(xk, uk) + Jk+1(fk(xk, uk))
Jk+1(xk+1)
xk+1 uk
xkFor each state just need to minimize this function and compute one action
uk
gk(xk, uk) + Jk+1(fk(xk, uk))
u⇤ku⇤
k
Typically unknown
known by construction Jk+1(xk+1)
uk
More general than previous ideas!
6
Discussion
7
• Ideas I, II, III can be seen as special cases of approximate dynamic programming.
• Idea I - for problems with disturbances, the true cost (requiring the computation of expected values) is approximated by a deterministic cost. For problems without disturbances idea I yields the same policy as DP.
• Idea II - approximation is the deterministic cost over a short horizon. This is an approximation even for problems without disturbances.
H = 1• Idea III - approximation is the cost of the base policy when , and it is the cost over a short horizon plus the cost of the base policy if the horizon is larger than one. Again this is an approximation even for problems without disturbances.
• Although we mainly consider finite horizon costs the ideas extend naturally to infinite horizon costs.
We formalize these statements next.
Certainty equivalent control
8
Certainty equivalent control is the following implicit policy for a stage decision problem:
1. At each stage (initially ), assuming that the full state is measured solve the following problem (for initial condition ), assuming no disturbances
and obtain estimates of the optimal control inputs
2. Take the first decision and apply it to the process
(the system will evolve to in general different from due to disturbances). Repeat (go to 1.)
xkk k = 0xk
xk+1 = fk(xk, uk, wk)
x`+1 = f`(x`, u`, 0)
(wk = 0, wk+1 = 0, . . . )
uk, uk+1, . . . , uh�1
uk
k = 0 k = 10 1 2 h� 1 h 30 1 2 h� 1 h
u1u0
uh�1
u1 u2
x2x1
x2 x2
xk+1 = fk(xk, uk, 0)xk+1 = fk(xk, , wk)uk
Ph�1k=0 gk(xk, uk) + gh(xh)
Ph�1`=k g`(x`, u`) + gh(xh)
` 2 {k, k + 1, . . . , h� 1}
Certainty equivalent control
9
For problems with no disturbances, this is equivalent to
minuk gk(xk, uk) + Jk+1(xk+1) x`+1 = f`(x`, u`, 0)
` 2 {k, k + 1, . . . , h� 1}s.t.
Jk+1(xk+1) = minuk+1,...,uk+H�1
k+H�1X
`=k+1
g`(x`, u`)
and select the first and apply it. uk
s.t. (1)initial condition xk+1
Note that, for problems without disturbances, are the optimal costs-to-go and satisfy the DP equation
Jk
Jk(xk) = minuk
gk(xk, uk) + Jk+1(xk+1)
For problems with stochastic disturbances Jk are approximations of optimal costs-to-go
Jk(xk) = minuk
gk(xk, uk) + E[Jk+1(xk+1)]
Approximated byJk+1(xk+1)
x`+1 = f`(x`, u`, w`)
` 2 {k, k + 1, . . . , h� 1}
Model Predictive Control
10
• At each time consider the optimal control problem only in a horizon
time/stage k k +H h
and select the first control input resulting from this optimization.
H
minuk,uh+1,...,uk+H�1
k+H�1X
`=k
g`(x`, u`)
uk
k
s.t.
xk
(1)
` 2 {k, k + 1, . . . , k +H � 1}
x`+1 = f`(x`, u`)
• Equivalent to minuk
gk(xk, uk) + Jk+1(xk+1)
s.t. (1)` 2 {k + 1, k + 2, . . . , k +H � 1}
Jk+1(xk+1) = minuk+1,...,uk+H�1
k+H�1X
`=k+1
g`(x`, u`)
Model Predictive Control
11
• Note that at the last decision stages the cost function is slightly different for finite horizon problems, including also the terminal cost
• There are several variants of Model Predictive Control and in particular some variants use a terminal constraint (this is useful to prove stability). For example, impose that the state after the horizon must be zero (for finite-horizon problems at the last stages ).
minuk,uh+1,...,uk+H�1
k+H�1X
`=k
g`(x`, u`) s.t.+gh(xh)
xk+H = 0xh = 0
k 2 {h�H,h�H � 1, . . . , h}
(1)
` 2 {k, k + 1, . . . , h� 1}
Rollout
12
• At each time consider the optimal control problem in a horizon
time/stage k k +H h
and select the first control input resulting from this optimization.
H
uk
• Similar to MPC but use a base policy after the horizonuk = µ(xk)
assuming that after the horizon a base policy is used
s.t. (1)
xk
` 2 {k, k + 1, . . . , k +H � 1}+
h�1X
`=k+H
g`(x`, µ`(x`)) + gh(xh)minuk,uk+1,...,uk+H�1
k+H�1X
`=k
g`(x`, u`)
• Equivalent to minuk
gk(xk, uk) + Jk+1(xk+1)
s.t. (1) ` 2 {k + 1, k + 2, . . . , k +H � 1}
+h�1X
`=k+H
g`(x`, µ`(x`)) + gh(xh)Jk+1(xk+1) = minuk+1,...,uk+H�1
k+H�1X
`=k+1
g`(x`, u`)
Further remarks on ADP
13
• Quality of an approximation is measured by how “good” is or how close it is from optimal and not by how “good” is the approximation of
• Decisions only need to be computed (in real-time) for the value of the present state (do not need to iterate the cost-to-go).
• There are several variants to approximate the costs-to-go.
• Due to the heuristic nature of the approximation, it is very hard to quantify when a specific approximation method is good and to establish formal results.
• For example we have seen that the optimal policy for the infinite horizon problem linear quadratic regulator problem makes the closed-loop of a linear system stable. This is typically very hard to establish for approximate methods.
u⇤k
u⇤k
Jk+1(xk+1) Jk+1(xk+1)
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
14
⇡ = {µ0, . . . , µh�1} uk = µk(xk)
Dynamic model
Cost function
Problem formulation
xk+1 = Axk +Buk uk 2 R
uk 2 [�c, c] 8k
h�1X
k=0
x
|kQxk + u
|kRuk + x
|hQhxh
Input constraints
Find a policy which minimizes the cost function
Let us focus first on obtaining an optimal path and then we will find a control policy with certainty equivalence and MPC
c > 0(given )
(for simplicity)
Possible disturbances
+wk
Quadratic programming formulation
15
Define
Dynamic model (equality constraints)
Cost function
Inequality Constraints
x =⇥x
|1 x
|2 . . . x
|h
⇤|u =
⇥u|0 u|
1 . . . u|h�1
⇤|
x = Ax0 + Bu
A =
2
6664
AA2
...Ah
3
7775B =
2
66666664
B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0A3B A2B AB B . . . 0...
. . .. . .
. . .. . .
...Ah�1B Ah�2B Ah�3B . . . AB B
3
77777775
Q =
2
6664
Q 0 0 . . . 00 Q 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . Qh
3
7775 R =
2
6664
R 0 0 . . . 00 R 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . R
3
7775
(Ax0 + Bu)|Q(Ax0 + Bu) + u
|Ru
I�I
�u c
2
664
11. . .1
3
775
Standard quadratic programming problem (see, e.g., Convex optimization, Stephen Boyd, Lieven Vandenberghe) implemented in Matlab with quadprog.m. Convex problem!
This allow us already to obtain an optimal path.
Example
16
Double integrator(model slide 20, II_1)
Optimal path no constraints (slide 30, Lec 5), parameters x0 =⇥1 0
⇤|⌧ = 0.2
Optimal path with constraints
parameters optpolicy=2, fdisturbances=0 (or optpolicy=3, fdisturbances=0 !)†
†
*run matlab code on slides 25-28 with parameters optpolicy=1, fdisturbances=0
*
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.8
-0.6
-0.4
-0.2
0
v(t)
0 5 10t
-4
-2
0
2
u(t)
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Q = Qh = I, R = 1
h = 50
uk 2 [�L,L], L = 0.25
x =⇥y v
⇤|Position Velocity
Example
17
What if there are disturbances? Desired behaviour far from expected!
*run matlab code on slides 25-28 with parameters optpolicy=2, fdisturbances=1
*
We want an optimal policy! CEC! (and MPC)
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
0.2
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Certainty equivalent control
18
At each stage define
Dynamic model (equality constraints)
Cost function
Inequality Constraints
Q =
2
6664
Q 0 0 . . . 00 Q 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . Qh
3
7775 R =
2
6664
R 0 0 . . . 00 R 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . R
3
7775
I�I
�u c
2
664
11. . .1
3
775
h�1X
`=k
x
|`Qx` + u
|`Ru` + x
|hQhxh =
B =
2
66666664
B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0A3B A2B AB B . . . 0...
. . .. . .
. . .. . .
...Ah�1�kB Ah�2�kB Ah�3�kB . . . AB B
3
77777775
A =
2
6664
AA2
...Ah�k
3
7775
Take the first component of and apply it to the plant.u uk, that is,
x =⇥x
|k+1 x
|k+2 . . . x
|h
⇤|u =
⇥u|k u|
k+1 . . . u|k+H�1
⇤|
(Axk + Bu)|Q(Axk + Bu) + u
|Ru
x = Axk + Bu
Example
19
Considering disturbances affecting the system (see code slide 25 for details on disturbances)
and applying certainty equivalence control we obtain
xk+1 = Axk +Buk
*run matlab code on slides 25-28 with parameters optpolicy=3, fdisturbances=1
+wk
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
0.2
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Discussion
• Although quadprog.m is quite efficient in problems with a large sampling rate or a large time horizon of interest it might not possible to run quadprog.m and this motivates MPC.
• Another motivation for MPC is that the model is typically not extremely accurate and considering short horizon might even lead to better results.
• We shall try MPC imposing that the final state after the prediction horizon is zero. See Bertsekas book, Ch. 6, for the close connection with Rollout (after the horizon the base policy is apply zero control input!)
• CEC allowed to transform an algorithm to compute optimal paths (using quadprog.m) to an algorithm that computes optimal policies.
• The fact that is works so well follows from the fact that the problem is convex: a problem with inequality constraints is convex if the cost is convex and the set of feasible solutions is convex, which can be shown to be the case for this problem.
20
MPC
21
At each stage define
Dynamic model (equality constraints)
Cost function
Inequality Constraints
R =
2
6664
R 0 0 . . . 00 R 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . R
3
7775
I�I
�u c
2
664
11. . .1
3
775
Take the first component of and apply it to the plant.u uk, that is,
x =⇥x
|k+1 x
|k+2 . . . x
|k+H�1
⇤|
x = Axk + Bu
(Axk + Bu)|Q(Axk + Bu) + u
|Ru
u =⇥u|k u|
k+1 . . . u|k+H�1
⇤|
A =
2
6664
AA2
...AH�1
3
7775B =
2
666664
B 0 0 0 . . . 0AB B 0 0 . . . 0A2B AB B 0 . . . 0...
. . .. . .
. . .. . .
...AH�2B AH�3B . . . . . . B 0
3
777775Q =
2
6664
Q 0 0 . . . 00 Q 0 . . . 0...
. . .. . .
. . ....
0 0 0 . . . Q
3
7775
k+H�1X
`=k
x
|`Qx` + u
|`Ru` =
Example
22
First iteration (no disturbances). This is the state and input that is predicted
Applying the first control input leads to the state
*run matlab code on slides 25-28 with parameters optpolicy=4, fdisturbances=0
u0 = �0.25 x1 =
0.995�0.05
�
and the process is repeated.
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
0.2
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Example
23
Note that the end result is different from the one predicted at iteration 0 because the state and input predictions change over time.
*run matlab code on slides 25-28 with parameters optpolicy=5, fdisturbances=0
End result
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Example
24
Results with disturbances
*run matlab code on slides 25-28 with parameters optpolicy=5, fdisturbances=1
0 5 10t
0
0.5
1
y(t)
0 5 10t
-0.6
-0.4
-0.2
0
0.2
v(t)
0 5 10t
-0.4
-0.2
0
0.2
0.4
u(t)
Matlab code
25
optpolicy = 1; fdisturbances = 0; % double integrator model Ac = [0 1;0 0]; Bc = [0;1]; tau = 0.2; Q = [1 0; 0 1]; S = [0;0]; R = 0.01; QT = [1 0; 0 1]; n = 2; m = 1; x0 = [1;0]; c = 0.25; sigmaw = 0.01; % disturbance level H = 25; % prediction horizon for mpc h = 50; % simulation horizon sysd = c2d(ss(Ac,Bc,zeros(1,n),0),tau); A = sysd.a; B = sysd.b; C = sysd.c;
% preliminaries for the policies switch optpolicy case 1 P{h+1} = QT; for k = h:-1:1 % Riccati equations P{k} = A'*P{k+1}*A + Q - …(S+A’*P{k+1}*B)*pinv(R+B’*P{k+1}*B)*(S'+B'*P{k+1}*A); K{k} = -pinv(R+B'*P{k+1}*B)*(S'+B'*P{k+1}*A); end case 2 U2 = quadconstrainedcontrol(A,B,Q,R,c,h,x0,1); case 4 U3 = [quadconstrainedcontrol(A,B,Q,R,c,H,x0,2); zeros(h-H,1)]; end
Model definition Preliminaries
Matlab code
26
t = tau*(0:1:h); x = zeros(n,h); u = zeros(1,h); x(:,1) = x0; randn('seed',15); % used for results in slides if fdisturbances == 1 gainnodisturbances = 1; else gainnodisturbances = 0; end cost = 0 for k=1:h switch optpolicy case 1 % LQR u(:,k) = K{k}*x(:,k); case 2 % CEC no disturbances u(:,k) = U2(k); case 3 % CEC [u_] = quadconstrainedcontrol(A,B,Q,R,c,h+1-k,x(:,k),1); u(:,k) = u_(1); case 4 % MPC no disturbances u(:,k) = U3(k); case 5 % MPC if k <= h-H-1 [u_] = quadconstrainedcontrol(A,B,Q,R,c,H,x(:,k),2); u(:,k) = u_(1); else [u_] = quadconstrainedcontrol(A,B,Q,R,c,h+1-k,x(:,k),1); u(:,k) = u_(1); end end w(:,k) = B*sigmaw*randn(1)*gainnodisturbances; x(:,k+1) = A*x(:,k) + B*u(:,k) + w(:,k); cost = cost + x(:,k)'*Q*x(:,k) + u(:,k)'*R*u(:,k); end v = x(1,:);
Simulate system
Matlab code
27
N = 1000; ts = tau/N; nl = h*N; uc = kron(u,ones(1,N)); sysd = c2d(ss(Ac,Bc,zeros(1,size(Ac,1)),zeros(1,size(Bc,2))),ts); Ad = sysd.a; Bd = sysd.b; xc = zeros(n,nl+1); xc(:,1) = x(:,1); for k = 1:h xc(:,(k-1)*N+1) = x(:,k); for l=1:N xc(:,(k-1)*N+l+1) = Ad*xc(:,(k-1)*N+l)+Bd*u(:,k); end end tc = ts*(0:nl);
figure(1) subplot(1,3,1) plot(tc,xc(1,:)) set(gca,'Fontsize',18) xlabel('t') ylabel('y(t)') grid on axis([0 h*tau -0.2 1]) subplot(1,3,2) plot(tc(1:end),xc(2,:)) set(gca,'Fontsize',18) xlabel('t') ylabel('v(t)') grid on subplot(1,3,3) plot(tc(1:end-1),uc) set(gca,'Fontsize',18) grid on set(gca,'Fontsize',18) xlabel('t') ylabel('u(t)')
Continuous-time simulation and plots
Matlab code
28
function [u] = quadconstrainedcontrol(A,B,Q,R,L,h,x0,opt) % opt = 1, CEC, opt = 2, MPC n = size(A,1); m = size(B,2); % define matrices M, N Abar = zeros(n*h,n); Bbar = zeros(n*h,m*h); Qbar = zeros(n*h,n*h); Rbar = zeros(m*h,m*h); for i = 1:h Abar(1+(i-1)*n:n*i,:) = A^i; for j = 1:i Bbar(1+(i-1)*n:n*i,1+(j-1)*m:m*j) = A^(i-j)*B; end Qbar(1+(i-1)*n:n*i,1+(i-1)*n:n*i) = Q; Rbar(1+(i-1)*m:m*i,1+(i-1)*m:m*i) = R; end switch opt case 1 % CEC u = quadprog(Rbar+Bbar’*Qbar*Bbar,((Bbar’*Qbar*Abar)*x0)’,… [eye(m*h);-eye(m*h)],[L*ones(m*h,1);L*ones(m*h,1)],[],[]); case 2 % MPC u = quadprog(Rbar+Bbar’*Qbar*Bbar,((Bbar’*Qbar*Abar)*x0)’,[eye(m*h);-eye(m*h)],… [L*ones(m*h,1);L*ones(m*h,1)],Bbar(end-n+1:end,:),-A^h*x0); end
Key function
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
Switched linear systems
29
xk+1 = A�kxkDynamic model
Cost function
• Applying the dynamic programming algorithm would lead to a combinatorial problem.
• Let us apply approximate dynamic programming.
Goal: Find an optimal path (or policy) for the control input for a given initial condition
uk = �k
x0
�k = {1, . . . ,m}xk 2 Rn
h�1X
k=0
x
|kQ�kxk + x
|hQhxh
CEC
30
1. At step compute all the possible control sequences
2. Compute the cost for each of these control sequences, and pick a sequence with minimum cost.
3. Apply the first element set and go again to step 1 if
k
�k = �k k = k + 1
(�k,�k+1,�k+2, . . . ,�h�1)
(�k, �k+1, �k+2, . . . , �h�1)
k 6= h
Still combinatorial!
Ph�1`=k x
|`Q�`x` + x
|hQhxh
MPC
31
1. At step compute all possible control sequences in the prediction horizon
2. Compute the cost
for each of these control sequences and pick a sequence
with minimum cost.3. Apply the first element set and go again to step 1 if .
k(�k,�k+1,�k+2, . . . ,�H�1)
(�k,�k+1,�k+2, . . . ,�h�1)
(�k, �k+1, �k+2, . . . , �h�1)
if
otherwise.
if
if
otherwise
(�k, �k+1, �k+2, . . . , �H�1)
�k = �k k = k + 1k 6= h
otherwise
0 k h�H
0 k h�H
0 k h�H
Ph�1`=k x
|`Q�`x` + x
|hQhxh
Pk+H�1`=k x
|`Q�`x`
32
1. At step compute all the control sequences with free taking the form
2. Compute the cost for each of these control sequences, and pick a sequence with minimum cost.
3. Apply the first element , set and go again to step 1 if .
k
Pick a base policy and start with k = 0, H > 0
�k = �k
(�0, �1, . . . , �h�1)
(�k, . . . ,�k+H�1,�0, �1, . . . , �h�1�k�H)
�k, . . . ,�k+H�1
Rollout
otherwise(�k,�k+1, . . . ,�h�1)if
k = k + 1k 6= h
(�k, �k+1, �k+2, . . . , �h�1)
0 k h�H
Ph�1`=k x
|`Q�`x` + x
|hQhxh
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
• Mixing fluids
• Explicit versus implicit policies
• Resource-aware control
33
Mixing
Control law
Camera
imagesActuators
Actuation: 4 possible rotations decided once every h seconds
How to mix two fluids in minimum time*?
�k 2 {1, . . . , 4}
*Mixing and switching, V.S. Dolk MSc thesis
34
Periodic open loop policy
Basic mechanism for effective mixing is repetitive stretching & folding
Basic mechanism can be seen as a periodic open-loop strategyCan a closed-loop strategy perform better?
35
Model
xk+1 = A�kxk
Divide the domain into subdomains and denote by the concentration of fluid a time
N i 2 {1, . . . , N}x
ik
k⌧
x
1
x
2
x
3 Then the following equation describes how the concentrations evolve as a function of the control action
xk+1 = A�kxk
�k 2 {1, . . . , 4}
36
Cost function
Model
• The intensity of segregation is defined as
and measures how far is the concentration from homogeneity.
• The concentrations for each subdomains will be equal to a given constant when homogeneity is achieved. This means that the following error will converge to zero
c
e|kek
ek := xk � c
2
6664
11...1
3
7775
• We are interested in minimizing the intensity of segregation for every time step
Phk=0 e
|kek
c = 1N
⇥1 1 . . . 1
⇤x0
37
Results
PeriodicRollout
• The periodic sequence consists of alternating a rotation of the outer cylinder and a clockwise rotation of the inner cylinder.
• The rollout strategy consists of optimizing in a horizon of length 4 assuming that afterwards this periodic policy is used.
38
Results
• An MPC approach with horizon boils down to picking the decision minimizing the (squared) norm of the error at the next iteration (minimum error “greedy” strategy)
H = 2
• In this case MPC with leads to worse results.
argmin�ke|kek+e|k+1ek+1 = argmin�k
e|kek+e|kA|�kA�kek = argmin�k
e|kA|�kA�kek| {z }
e|k+1ek+1
H = 2
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
• Mixing fluids
• Explicit versus implicit policies
• Resource-aware control
39
Introduction
• We show next that for switched linear systems, we can write explicit rollout and MPC policies
• Although finite-horizon costs can be considered, this is actually easier if we consider infinite-horizon costs
• Note that certainty equivalent control in the case of infinite horizon switched systems problems would entail computing an infinite number of options and therefore it is not viable.
xk+1 = A�kxk
P1k=0 x
|kQ�kxk
40
Cost of base policy• Suppose that the base policy corresponds to a constant switching , . Then it
is easy to show that�k = j
xk+1 = Ajxk
8k
when
where is the solution to the linear system
• If the base policy corresponds to periodic switching a similar expression can be obtained.
A|jM
jAj �M j = �Qj
M j =P1
k=0(A|j )
kQj(Aj)k
P1k=` x
|kQjxk = x
|`M
jx`
• For example if andm = 2, �k 2 {1, 2} (�`,�`+1,�`+2,�`+3,�`+4, . . . ) = (1, 2, 1, 2, 1, . . . )
then
where
P1k=` x
|kQ�kxk = x
|` M
1x`
A|1M
2A1 � M1 = �Q1
A|2M
1A2 � M2 = �Q2
41
Explicit rollout policy
Note that in the prediction horizon there are
Suppose that we index all these options
Then each corresponds to a unique set of free schedules at each iteration
mH options to be considered
and a unique set of costs can be obtained
x
|kP
ixk =
k+H�1X
`=k
x
|`Q�`x` +
1X
`=k+H
x
|`Q�`x`
i
i 2 {1, 2, 3, . . . ,mH}
(�k,�k+1, . . . ,�k+H�1) = (�i0, �
i1, �
i2, . . . , �
iH�1)
Then an explicit expression for the rollout policy is
�k = ⇡(◆(xk))
◆(xk) = argmini
x
|kP
ixk
where is a map that extracts the first component of the scheduling associated with ⇡
⇡(i) = �i0
i
42
Example
If and :m = 2 H = 2
i
1
2
3
4
�0 �1 ⇡(i) Pi
1 1 1
1 1
1
2
22
2 2 2
P2 = Q1 +A|1Q2A1 + (A2A1)|M(A2A1)
P1 = Q1 +A|1Q1A1 + (A1A1)|M(A1A1)
P 4 = Q2 +A|2Q2A2 + (A2A2)|M(A2A2)
P 3 = Q2 +A|2Q1A2 + (A1A2)|M(A1A2)
P i
P 1
P 2
P 3
P 4
Outline
• Approximate dynamic programming
• Linear quadratic control with inequality constraints
• Control of switched systems
• Mixing fluids
• Explicit versus implicit policies
• Resource-aware control
43
Resource-aware control How to efficiently control several dynamical systems via a wireless network by deciding how to schedule transmissions?
Controller
Actuator 1
(only one user can transmit at a given time)
Inverted pendulum 1 Inverted pendulum II
Actuator II
✓1✓2⌧1
44
Model• The two processes are identical and modeled by
x
ik+1 = Ax
ik + Bu
ik i 2 {1, 2}
• Each system has an identical cost
• If each system could update the control law at every time instant the optimal control law would take the form (LQR problem)
u
ik = Kix
ik
• Instead, since each system can only update the control input at mutually exclusive times, we have
u
1k =
(K1x
1k, if �k = 1
v
1k, if �k = 2
u
2k =
(K2x
2k, if �k = 2
v
2k, if �k = 1
vik = uik�1
i 2 {1, 2}
and we wish to minimize J1 + J2
Ji =1X
k=0
(xik)
|Qx
ik + (ui
k)|Ru
ik
45
Model• If we let
xk =
2
664
x
1k
v
1k
x
2k
v
2k
3
775
we can write
where
xk+1 = A�kxk
P1k=0 x
|kQ�kxk
model
cost
A1 =
2
664
A+ BK 0 0 0K 0 0 00 0 A B0 0 0 I
3
775 A2 =
2
664
A B 0 00 I 0 00 0 A+ BK 00 0 K 0
3
775
Q1 =
2
664
Q+K|RK 0 0 00 0 0 00 0 Q 00 0 0 R
3
775 Q2 =
2
664
Q 0 0 00 R 0 00 0 Q+K|RK 00 0 0 0
3
775
46
Example
base policy
B =R ⌧0 eAcsdsBc
A = eAc⌧
Ac =
0 13 0
�
Bc =
01
�
Q = I
R = 0.01
⌧ = 0.1
x0 =
2
6666664
�0.100100
3
7777775
Parameters
0 1 2 3 4 5-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0 1 2 3 4 5-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
1.2
cost of periodic (base): 16.4974cost of rollout: 12.4969
rollout
base policy
time t = k⌧time t = k⌧
time t = k⌧ time t = k⌧
first component of first component of
second component ofsecond component of x
2k
x
2k
x
1k
x
1k
0 1 2 3 4 5-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
47
Example
cost of periodic (base): 16.4974cost of rollout: 12.4969
Given the initial condition paying more attention to system 2 leads to better results!
2 2 2 1 2 1 2 1 2 1 1 2 3 4 5 6 7 8 9 10
�k
k
0 1 2 3 4 5-12
-10
-8
-6
-4
-2
0
2
4
0 1 2 3 4 5-1
-0.5
0
0.5
1
1.5
2
rollout
base policy
time t = k⌧ time t = k⌧
control input control inputu1k u2
k
48
Another resource-aware control problem
How to control the two arms of a SCARA robot with only one amplifier to track a trajectory for the tip with minimum error?*
Only torque or can be applied at each time instant
⌧1 ⌧2
✓2
✓1
⌧1
⌧2
0 2 4 6 8 10-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10-1.5
-1
-0.5
0
0.5
1
Schematic
⌧1
⌧2
time
*Control of a SCARA robot with a reduced number of amplifiers, Internship report TU/e, VDL, Yuri Steinbuch, 2015
49
Concluding remarksTo summarize:
• When discretization is not feasible, approximate dynamic programming (e.g. CEC, rollout, MPC) can be a practical way to provide a policy. However, it is suboptimal.
• Must see case by case if it can be used - for SLS it works very well (although suboptimal!)
After this lecture you should be able to:• Apply ADP techniques to find suboptimal policies for stage decision problems.
• Apply ADP techniques to control switched linear systems and linear quadratic problem with input constraints.
dimension convexity DP discretizat
ion
Static optimization
ADP ADP SLS
small convex ?
small non-convex
?
large convex ?
large non-convex ?