optimal adaptive control of nonlinear discrete-time...
TRANSCRIPT
Optimal Adaptive Control of Nonlinear Discrete time S stemsDiscrete-time Systems
S. JagannathanDepartment of Electrical and Computer EngineeringMissouri University of Science and TechnologyRolla, MO, USA
(sarangap@mst edu)([email protected])
Outline Introduction Introduction
Background on Optimal Adaptive Control
Optimal Adaptive Control of Unknown Dynamic Systems
Challenges of Value and/or Policy Iteration-based Schemesg y
Online Optimal Adaptive Control w/o Policy and/or Value Iterations
Online Optimal Tracking
Hardware Implementation
Extensions to Continuous-time
Conclusions Conclusions
Introduction
Reinforcement learning (RL) /Adaptive dynamic programming (ADP) techniques with online approximators such as neural networks (NN’s) have been widely employed in the literature in the context of optimal adaptive
t l (Si B t P ll & W h 2004)control (Si, Barto, Powell & Wunsch, 2004) Policy and/or value iterations are utilized NN reconstruction errors are ignored
C f th NN i l t ti t id d Convergence of the NN implementations are not provided Full or partial knowledge of the system dynamics are needed and in
some cases a model of the system is needed The Bellman equation forms the basis for a family of RL/ADP schemes for The Bellman equation forms the basis for a family of RL/ADP schemes for
finding optimal adaptive policies forward in time RL/ADP based schemes do not require the system dynamics
Convergence and/or closed-loop stability can be demonstrated by using Convergence and/or closed loop stability can be demonstrated by using temporal difference error (TD) method
RL/ADP-based online methods are preferred over offline-based schemes
Backgroundg
F.L. Lewis, D. Vrabie, and V. Syrmos, “Optimal Control” Third edition, To appear, John Wiley Press.
Optimal Control of Discrete-time Systems
Consider discrete-time (DT) nonlinear affine system given by(1)( ) ( )k k k kx f x g x u (1)
where is the system state and is the control inputs Value function
1 ( ) ( ) ,k k k kx f x g x u n
kx R mku R
Value function(2)
with is a discount factor is a( )k ku h x
( ) ( , ) ( ) ,h i k i k Tk i i i i i
i k i kV x r x u Q x u Ru
0 1 ( ) 0 0Q x R with is a discount factor, , is a
feedback control input, and stage cost is
Note: Discounting makes initial time steps more important and traditional
( )k ku h x0 1 ( ) 0, 0kQ x R
( , ) ( ) .Tk k k k kr x u Q x u Ru
Note: Discounting makes initial time steps more important and traditional learning technique usually learn during first few time steps, and hence discounting is not preferred for learning schemes. Value function is the positive definite solution to the Bellman equation that satisfies V(0)=0positive definite solution to the Bellman equation that satisfies V(0)=0.
Optimal Control for DT System
Discrete-time Hamilton-Jacobi-Bellman (HJB) equation (3)
Optimal control policy:
* *1(.)
( ) min ( , ( )) ( ) .k k k khV x r x h x V x
(4) * *1
(.)( ) arg min ( , ( )) ( )k k k k
hh x r x h x V x
)()(2 1
*1
kkT xVxgR
Note: For solving optimal control , DT HJB equation (3) needs
)()(2 1kkg
)(*kxh
to be solved. However, due to nonlinear nature of DT HJB equation, finding its solution is difficult or impossible.
Hamiltonian
Discrete-time Hamiltonian(5)
where is the forward difference 1( , ( ), ) ( , ( )) ( ) ( ) ,h h
k K k k k k kH x h x V r x h x V x V x
1( ) ( )k h k h kV V x V x
operator. It is important to note Hamiltonian is the temporal difference (TD) error for prescribed control policy . ( )k ku h x
Iterative based methods will be shown to obtain the value function and optimal control from the DT HJB.
Traditional Optimal Control of Linear Systems
Linear Quadratic Regulator DT linear time-invariant system dynamics:
(6)A B (6) Infinite horizon value function:
(7)
1 ,k k kx Ax Bu
TT RuuQxxxV )()( (7)
Bellman equation:(8)
ki
iiiik RuuQxxxV )()(
)()()( TT xVRuuQxxxV (8))()()( 1 kkkkkk xVRuuQxxxV
Bellman Equation for DT LQR: ARE
Value function in quadratic form of state: with some P.kTkk PxxxV )(
DT Hamiltonian:(9)( , ) ( ) ( ) .T T T T
k k k k k k k k k k k kH x u x Qx u Ru Ax Bu P Ax Bu x Px
Use stationarity condition to obtain optimal control(10)1( ) .T T
k k ku Kx B PB R B PAx
( , ) 0k k kH x u u
Substitute (10) into (9) to obtain DT Algebraic Riccati Equation (DARE)
( ) .k k ku Kx B PB R B PAx
(DARE)(11)
DT ARE is exactly the Bellman optimality equation for the DT LQR.
1( ) 0 .T T T TA PA P Q A PB B PB R B PA
DT ARE is exactly the Bellman optimality equation for the DT LQR.
Value Function Approximation (VFA)
VFA is the key for implementing PI and VI online for dynamical systems with infinite state and action spaces.
Linear system:Since value function is quadratic of state in LQR case, it can be
(12)1 1( ) ( ( )) ( ) ( )T T T TV x x Px vec P x x p x p x (12) Nonlinear system:According to Weierstrass Higher Order Approximation Theorem there
1 12 2( ) ( ( )) ( ) ( ) ,k k k k k k kV x x Px vec P x x p x p x
According to Weierstrass Higher-Order Approximation Theorem, there exists a dense basis set such that
(13){ ( )}i x
( ) ( ) ( ) ( ) ( ) ( ) ,L
Ti i i i i i LV x w x w x w x W x x
( )
with basis vector and converges uniformly to zero when .
1 1 1( ) ( ) ( ) ( ) ( ) ( ) ,i i i i i i L
i i i L
1 2( ) ( ) ( ) ( ) : n LLx x x x R R ( )L x
L
Optimal Adaptive Control of Linear System
Example: Solving DT LQR online by using PI and VI System dynamics:
(14)V l f i
1 ,k k kx Ax Bu
Value function:(15)
Optimal control:k
Tk
kii
Tii
Tik PxxRuuQxxxV
)()(
Optimal control:(16)
Note: Solving optimal control needs which is the solution of
1( ) .T Tk k ku Kx B PB R B PAx
PNote: Solving optimal control needs which is the solution of Riccati equation. Next, PI and VI will be used to solve online.
PP
Optimal Adaptive Control via PI
PI for DT LQR: Initialization: select any admissible control . Start iteration until convergence
P li l ti t
0 ( )kh x
0j
Policy evaluation step:(17)
where 1 1 ( , ( )) ( ) ,T T T
j k k k j k j j kp x x r x h k x Q K RK x
1( )jp vec P where Policy improvement step:
(18)
1 ( )jjp vec P
jTjTj AxPBRBPBxhu 1111 )()( (18)kkjk AxPBRBPBxhu 1 )()(
Optimal Adaptive Control via VI
VI for DT LQR: Initialization: select any control . Start iteration until convergence
V l d t t
0 ( )kh x
0j
Value update step:(19)
where 1( )jp vec P
1 1 ( , ( )) ( ) ,T T T T Tj k k j j k k j j k j kp x r x h k p x x Q K RK x p x
where Control policy update:
(20)
1 ( )jjp vec P
jTjTj AxPBRBPBxhu 1111 )()( (20)
Note: The control policy update (18),(20) requires system dynamics which cannot be known beforehand
kkjk AxPBRBPBxhu 1 )()(
BA,dynamics which cannot be known beforehand.BA,
Policy (PI) and Value Iteration(VI) Flowchart
Note: Difference between PI and VI: 1) VI does not require initial admissible ) qcontrol; 2) Value update is different with policy evaluation.
Optimal Adaptive Control via Policy Approximation
Second ‘Actor’ Neural Network:For relaxing the requirement of full system dynamics knowledge, a
second ne ral net ork is introd ced for control policsecond neural network is introduced for control policy. Actor NN
(21)( ) ( ) ,Tk k ku h x U x (21)
with is activation or basis function and is weights or unknown parameter.
( ) : n Mx R R M mU R
Update law is required for the NN weights
Note: After convergence of critic NN parameters to in PI or VI, actor NN is used to update the control policy based on . Actor NN avoids h d f d if d i ( i li )
1jW
1jW )(fthe need of state drift dynamics (or in linear system).
j)(f A
Actor-Critic Implementation of TD Learning
Fig 1. Temporal Difference Learning using Policy Iteration.
The value function is estimated by observing the current and the next state, and the cost incurred. Based on the new value, the action i d t dis updated.
Nonlinear Optimal Control via Offline
Discrete-time unknown nonlinear affine system
1k k k k k k kx f ( x ) g( x )u( x ) f g u (1) Infinite horizon cost function
1k k k k k k kf ( ) g( ) ( ) f g
Tk n n n
k
V ( x ) Q( x ) u Ru
( )
where1
n kT
k k k kQ( x ) u Ru V( x )
(2)
0kQ( x ) , .0 mxmR The control input must be admissible (Chen &
Jagannathan, 2008): is continuous on a compact set;
kQ( )
ku
ku
is continuous on a compact set; stabilizes (1) and ; is finite for all .
ku0
0k
k xu
0V( x ) 0x ku
T. Dierks, B. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systemsusing offline-trained neural networks with proof of convergence”, Neural Networks, vol.22.pp. 851-860, 2009.
Nonlinear Optimal Control (cont.)
The optimal control input for (1) is found by solvingand revealed to be0k kV( x ) / u k k( )
1 1
1
12
T kk k
k
V( x )u R g( x )x
(3)
Optimal control (3) is generally not implementable for unknown nonlinear systems due to andkg( x ) 1kx
This work relaxes the need for explicit knowledge of (1) while ensuring convergence and stability Online NN system identification scheme Online NN system identification scheme Offline optimal control using only the learned NN
model of the system
Identifier Design
The nonlinear system (1) is rewritten as
( )f h1 ( , )k k k k k kx f g u h x u T T Ts s k k s k s kw ( v z ) ( x ) w ( z( x ) ( x ) (4)
The proposed nonlinear identifier is written as
1T
k s k k kˆ ˆx w ( z ) v (5)
where
1k s ,k k k( ) (5)
k kk T
ˆ xvx x C
(6)
and is the identification error, is an additional tunable parameter, and is a constant
k k sx x C ( )
k k kˆx x x k 0sCadd t o a tu ab e pa a ete , a d s a co sta t0sC
NN Identifier (cont.)
Assumption: The term is assumed to be upper bounded by a function of estimation error such that
s k( x )bounded by a function of estimation error such that
T Ts k s k M k k k( x ) ( x ) ( x ) x x (7)
where
The proposed tuning laws for and are given byw
M
The proposed tuning laws for and are given by
and
s ,kwk
1 1T
s ,k s ,k s k kˆ ˆw w ( z )x T Tˆ ˆ / ( C )
(8)
(9)and T Tk k s k k k k sx x / ( x x C ) (9)
2T TL x x tr[ w w ] (10)
The asymptotical error convergence can be demonstrated using
k k k s ,k s ,k s k sL x x tr[ w w ] (10)
Estimation of Control Coefficient Matrix
Using the online system identification scheme (5), an estimate for is given bykg( x )
T Tˆ( ) ( ) ( / ) (11)
where is the derivative of the activation function with respect to and is a constant
T Tk s ,k k s k kˆg( x ) w ( z )v ( z / u ) (11)
xk( z )
z function with respect to and is a constant matrix
After a sufficiently long online learning session, the
kzk kz u
y g g ,robust term (6) approaches to zero, and the dynamic system can be rewritten as
Then observe
1 1T
k k k k k s ,k kˆ ˆx f ( x ) g( x )u x w ( z ) (12)
ˆx / u g( x ) x / u Then observeto yield the relationship shown in (11)
1 1k k k k kx / u g( x ) x / u
Offline Heuristic Dynamic Programming
Only the NN model learned online is used for the offline training
1) Initialize the cost function and select the initial stabilizing control input:
0V ( ) Tˆ ˆ( ) i Q( ) R V ( )
2) Compute the value function0 0kV ( x ) , 0 0 1
Tk k k k ku
u ( x ) argmin Q( x ) u Ru V ( x )
1 0 0 0T T
k k k kˆ ˆV ( x ) Q( x ) u Ru V w ( z )
3) Optimization is achieved by iterating between a
1 0 0 0
0 0 0 1
k k s ,k k
Tk k
V ( x ) Q( x ) u Ru V w ( z )
ˆQ( x ) u Ru V ( x )
p y g
sequence of control policies and value functions (Proof of converge shown in Al-Tamimi, Lewis, & Abu-Khalaf, 2008)
T ˆ( ) i Q( ) R V( ) 1T
i k k k k i kuu ( x ) argmin Q( x ) u Ru V( x )
1T T
i k k i i i s ,k kˆV ( x ) Q( x ) u Ru V w ( z )
NN Implementation of Offline HDP
Value function (2) and optimal control (3) are assumed to have NN representations as
The NN approximations of (13) and (14) are written as
Ti k Vi k ViV ( x ) W ( x ) (13)
Ti k Ai k Aiu ( x ) W ( x ) (14)
The NN approximations of (13) and (14) are written asT T
i k Vi k Vi kˆ ˆ ˆV ( x ) W ( x ) W
T Ti k Ai k Ai k
ˆ ˆu ( x ) W ( x ) W
(15)(16)
Critic weights are tuned at each iteration using method of weighted residuals
i k Ai k Ai k( ) ( )
1
1 1ˆ ˆ( ) ( ) ( ) ( ), , , )T TVi k k k i k k Vi AiW x x dx x d x x W W dx
(17)
Tuning the Action Network Define the action error to be the difference between the
estimated control input (14) and the estimated optimal controlcontrol
11
1
12
i j ,kT Tai , j i ,k i ,k Ai , j k k
j ,k
ˆ ˆV ( x )ˆˆ ˆe u u W ( x ) R g ( x )x
11
1
12
T
j ,k j ,kT TAi , j k s j ,k s ,k Vi
ij ,k j ,k
ˆz ( x )ˆ ˆˆW ( x ) R v ( z ) w Wˆ ˆu x
(18)
Action network weight update
1T TAi , j Ai , j j k ai , j k k
ˆ ˆW W e / ( ) (19)
Proof of action network convergence can be shown
Remarks
The results indicate that only a nearly optimal control law is possible when the NN reconstruction errors are explicitly considered.
When the NN reconstruction errors are ignored, the NN estimate of the optimal control law converges to the optimal control uniformly.
Although the NN reconstruction errors are often ignored in the literature, proof of convergence for the NN implementations of HDP are rarely studiedHDP are rarely studied.
Offline HDP Flowchart
Simulation Results
E l 1 Li S t Example 1: Linear System1 1
12 1
0 3 0 8 00 8 1 8 1
,kk k k
k
x . .x x u
x . .
Online identifier parameters:2 1 0 8 1 8 1,kx . .
,09.0s 0027.0s
Offline training set and learning parameter:s
],5.0,5.0[1 x ],5.0,5.0[2 x 01.0j
Value network activation functions},,,,,,{ 6
2231
41
2221
21 xxxxxxxx
Action network activation functions
DARE solution},,,,{ 5
23121 xxxx
0 691 1 3405[ ] DARE solution 0 691 1 3405k ku [ . . ] x
Linear System Results
1State Identification Error
1.2
x1
- x2
Trajectories
0.6
0.8
1
rror
x1
x2
0.8
1DARE Solution
NN Solution
0.2
0.4
tific
atio
n E
r
0.4
0.6
x 2-0.2
0Iden
t
0
0.2
5 10 15 20 25 30 35 40 45 50Iteration Index k
-1 -0.5 0 0.5 1-0.2
x1
Fig. 2 Fig. 3g g
Nonlinear System Example
E l 2 N li t Example 2: Nonlinear system
2 k1 1 -sin(0.5x ) 0kx 2,k1 11
2,k 1,k2 1
( )-cos(1.4x )sin(0.9x ) 1
,kk k
,k
x ux
NN identifier, value and action network parameters same as Ex. 1
HDP algorithm described in Al Tamimi Le is and Ab HDP algorithm described in Al-Tamimi, Lewis, and Abu-Khalaf (2008) was also implemented using full knowledge of the nonlinear system dynamics for comparison
Fig. 4 Fig. 5
Fig. 7
Fig. 6
Challenges of PI and VI –based Schemes
Iterative-based schemes may not be suitable for hardware implementation
PI and VI-based schemes may not converge y gif suitable number of iterations are not chosen
In certain instances, model of the nonlinear ,system needs to be created before designing the PI and VI-schemes
Effect of Insufficient Iterations using HDP
Consider discrete-time nonlinear affine systemuxgxfx )()(
wherekkkk uxgxfx )()(1
kx2 085.1
k
kkkk
kk x
xgxxx
xf,2,2,2,1
,2 )(,5.1)*5.0sin(
)(
Simulation parameter Initial states:
Tx ]5.4- 5.0[0 122 RIQ
Sampling interval:1,22 RIQ
sec1.0sT
Effect of Insufficient Iterations 3000
15
1000
2000
err
ors
e1e2
5
10
15
erro
rs
e1e2
-2000
-1000
0
Reg
ulat
ion
-5
0
5
Reg
ulat
ion
e
0 0.3 0.6 0.9 1.2 1.5-3000
2000
Time (Sec)
0 2 4 6 8 10-10
-5
Time (Sec)
F th fi it i b i h b f it ti i
Fig. Number of iteration=1000Fig. Number of iteration=10
Fig. 9
From these figures it is obvious when number of iteration is decreased, policy iteration cannot force the regulation errors converge to zeros
Online Near Optimal Regulation
Optimal control
No offline scheme or internal dynamics f ( x )
1 1
1
12
T kk k
k
J( x )u R g( x )x
(20)
No offline scheme or internal dynamics To begin, the cost function (2) is assumed to have an NN
representation written as
kf ( x )
The NN approximation of (4) is written as
Tk c k cJ ( x ) W ( x )
(21)T Tˆ ˆ ˆJ ( ) W ( ) W
The cost function error (residual or TD) dynamics are formed as
T Tk c ,k k c ,k kJ ( x ) W ( x ) W , (22)
formed as1 1
T Tc,k k c,k k c,k k
ˆ ˆe r W W
1 1 1T
c,k k c,k k kˆe r W
(23)
Auxiliary Critic Error System
Define the auxiliary error vector
h1 1
1 1T x( j )
c ,k k c ,k kˆE Y W X
(24)where
1 2 1k k k k jY [ r r ... r ]
1 1k k k k jX [ ... ]
Auxiliary error dynamics
1k k k
y y
1 1T T Tc ,k k k c ,k
ˆE Y X W (25)
The system (25) can be viewed as an affine nonlinear system with control input 1c ,kW
Critic Weight Update
Therefore, the critic weight update is written as
1
1T T T
c ,k k k k c c ,k kW X X X E Y
(26)and (24) becomes
( )
1T Tc ,k c ,k c ,kE E (27)
We can show that exists 1Tk kX X
The weight update (26) is comparable to the least squares updates used in offline training Instead of summing over a mesh of training points, the update
(26) represents a sum over the system’s time history stored in Thus, the update (9) uses data collected in real time instead of
)(kEc
p ( )data formed offline
Action Network
The optimal control (3) has a NN representation written as
(28)Tk A k Au( x ) W ( x )
The NN approximation of (28) takes the form ofT
k k A k kˆˆ ˆu u( x ) W ( x ) (29)
The optimal control signal error is defined as
k k A,k k( ) ( ) (29)
1 ( x )ˆ ˆ
Action network weight update
1 1
1
12
T T ka ,k A,k k k c ,k
k
( x )ˆ ˆe W ( x ) R g ( x ) Wx
(30)
g p
1
Tk a ,k
A,k A,k a Tk k
eˆ ˆW W
(31)
In the absence of NN reconstruction errors, the action and critic errors converge to zero asymptotically and the action and critic NN weights remain bounded. That is .ˆ uu
Offline and Online Schemes: A Comparison
Simulation Results: Example 1
Example 1: Linear System
1 1 0 0 8 0,kx .x x u
Results of solving the discrete time algebraic Riccati
12 1 0 8 1 8 1k k k
,k
x x ux . .
g gequation (DARE) that requires A to be known
0 6239 1 2561k ku [ . . ] x
Initial stabilizing control:
21,k 1,k 2,k1.5063x 2.0098x xkJ 2
2,k3.7879x
g
Implementation of the proposed optimal regulator does not required the A matrix to be known
0 0 5 1 4,k ku [ . . ] x
not required the A matrix to be known
Optimal Regulation ResultsC t i t ti ti f ti Cost approximator activation functions :
Action network activation functions:},,,,,,{ 6
2231
41
2221
21 xxxxxxxx
Action network activation functions:
Tuning constants:},,,,{ 5
23121 xxxx
,10 6ca 1.0aau g co sta ts Final cost estimator weights:
,c a
0030.00025.00015.00082.07886.30097.25071.1[ˆ cW]0002.00009.00003.00008.00000.00000.00020.00014.0
21 1 21 5063 2 0098k ,k ,k ,kJ . x . x x 2
2 3 7879 ,k. x
Final weights of the action network:
]005400075000500007400049000920...0095.00338.00589.02586.16208.0[ˆ AW
]0054.0.0075.00050.00074.00049.00092.0
)(]2561.16239.0[)( kxku
Optimal Regulation Results (cont.)
Fig. 10Fig. 10
Fig. 11 Fig. 12
Simulation Results: Example 2
Regulation Example 2: Nonlinear System
1 1 1 2 0,k ,k ,kx x x 1 2
2 1 1 21 8 1, , ,
k k,k ,k ,k
x ux x . x
Tuning parameters and activation functions same as example 1
Initial stabilizing control:0 0 4 1 24,k ku [ . . ] x
Online results are compared to result obtained via offline training (Chen & Jagannathan, 2008)
Online Results for the Nonlinear System
Fig. 13 Fig. 14
Fig. 15 Fig. 16
Optimal Tracking Results (cont.)
Fig. 17
Optimal Tracking
For regulation, the optimal control inputs are constructed so that when0ku 0kx
In contrast, tracking control inputs typically consist of feedback and feedforward terms so that is not zero when the tracking error becomes zero
kukewhen the tracking error becomes zero
In the literature, finite horizon cost functions in the form ofk
N
T TJ e Qe u Ru are often considered for tracking to mitigate the fact that does not vanish (Lewis and Syrmos 1995)
ku
1k j j j j
j
J e Qe u Ru
does not vanish (Lewis and Syrmos, 1995)
Objective: Design the control input to ensure that the nonlinear t (1) t k d i d t j t i ti lsystem (1) tracks a desired trajectory in an optimal manner.
T. Dierks and S. Jagannathan, “Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics”, Proc. of the IEEE Conference on Decision and Control, pp. 6750-6755, December 2009.
Optimal Tracker Development
To begin, write the desired trajectory dynamics1d ,k d ,k k d ,kx f ( x ) g( x )u (32)
Tracking error with dynamics given by(33)
k k d ,ke x x
k e,k k e,ke f g( x )u
Infinite horizon cost function for tracking can be given by
e ,k k d ,kf f ( x ) f ( x ) e,k k d ,ku u u andwith
g g y
, , , , 1 , , 10 0
e k e k i e k e k i e k e ki i
J r r r r J
(34)
Th i l i d i ibl h i fi i
with Te,k e k e,k e e,kr Q ( e ) u R u , 0e kQ ( e ) , mxm
eR
u J The signal is admissible, thus is finite.e ,ku e,kJ
Optimal Tracking Control Input
Optimal feedback control found by solving 0e ,k e ,kJ / u
The optimal control is then given by
(35)
u
1 1
1
12
T e ke,k e k
k
J (e )u R g (x )e
The optimal control is then given byk, u ,
(36)1 1
1
12
T e kk d ,k e k
k
J ( e )u u R g ( x )e
where the feedforward term is found from (32) as12 ke
1( k ) ( ) ( f )
d ,k, u ( k ),1
1d ,k k d ,k d ,ku ( k ) g( x ) ( x f ) (37)
Optimal control requires nonlinear cost function, feedback andf df d l hi h ill b i t d b OLA’feedforward values which will be approximated by OLA’s
Feedforward Control Input Approximation
The feedforward control input (37) is approximated using an OLA and written as
The feedforward OLA update is derived using Lyapunov
11
Td ,k k d ,k d ,k d
ˆu g( x ) ( x ) (38)
The feedforward OLA update is derived using Lyapunov methods to be
Tˆ ˆ (39)
The estimate of the system control input (36) is then written
1 1T
d ,k d ,k d d k k e,kˆ( e g( x )u ) (39)
as
k d ,k e,kˆ ˆ ˆu u u (40)
Remarks
Remark 1: For offline ADP, the optimal cost function is learned using a mesh
of training points. In contrast, the optimal cost function is learned online in this work using the system’s time history.
Remark 2: Remark 2: By construction, the feedback control input error becomes
zero when the tracking error is zero. As a result, the feedback control estimator (41) cannot be updated once the
ea ,keke
feedback control estimator (41) cannot be updated once the tracking error has converged to zero.
To prevent the control input error from becoming zero before the p p gcost function and nearly optimal control policy have been learned, external probing noise may be introduced into the nonlinear system dynamics to ensure the tracking error is persistently exciting.
Online Regulation and Tracking: Comparison
Nonlinear System Consider the Nonlinear system
21 1 12 1 1 00 5,k ,k,k ,kx usin( . x )x
.
Desired trajectory:
2 1 22 1 0 11 4 0 9,k ,k,k ,kx ucos( . x ) sin( . x )
1 1 0 25d ,kx sin( . k )
Initial stabilizing control:2 1 0 25,
d ,k
.x cos( . k )
0 1 0 e
No offline training is required to implement the proposed
10
2
0 1 00 0 1
,ke
,k
e.u
e.
optimal tracking control scheme 11 past values are used to form the auxiliary cost function
errorerror
OLA and Parameter Design The OLA’s considered for implementation were neural networks:
10 hidden layer neurons were used in both the cost function and feedback networks along with the sigmoid activation function
25 hidden layer neurons were used in the feedforward estimator along with the radial basis activation function
Tuning gains were selected according the theoretical results: Tuning gains were selected according the theoretical results: Cost function parameter tuning gain: Feedback control signal parameter tuning gain: 1.0ec
10 Feedforward control signal parameter tuning gain:
All tunable parameters were initialized to zero
1.0ea05.0d
Probing noise was generated from a random distribution with zero mean and variance of 0.04 to ensure the tracking error was persistently exciting
Optimal Tracking Results
Fig. 18
Results for Optimal Tracking Control
Fig. 19 Fig. 20 All parameters remain bounded and converge to constant values as
stability theorem suggested
Hardware Implementation
Multi-modal Engine Control
Objective is to control an SI engine to minimize emissions and maximize fuel efficiency Engine dynamics are unknown States are not available Nonstrict nonlinear discrete-time system Nonstrict nonlinear discrete time system
High levels of exhaust gas recirculation (EGR) dilute cylinder contentsAlthough stoichiometric equivalence ratio is Although stoichiometric equivalence ratio is maintained, partial fuel burns and misfires become frequent
Lean combustion changes the equivalence Lean combustion changes the equivalence ratio causing partial fuel burns and misfires
Control of fuel input attempts to reduce cyclic dispersion of heat releasecyclic dispersion of heat release
Lean Engine Control
Daw model equations: states are total air and total fuel; output is heat release or accelaration
1 1 2 11x k AF k F k x k R F k CE k x k d k
2 2 21 1x k CE k F k x k MF k u k d k
kCEkxky 2
kCE
kCE )(max
lumk
)(
1001
kkx
Rk 2
This model represents spark engine dynamics as a nonlinear system in nonstrict feedback form.
kx1
S. Jagannathan and P. He, “Neural network-based state feedback control of nonlinear discrete-time system in non-strict feedback form”, IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 2073-2087, December 2008.
Lean Engine Control and Diagnostics—Experimental Results
1x
2x
1y
TV TW
(.)
(.)
(.)
1
2
Neural Network (NN) Neural Network (NN) ControllerController EngineEngine Adaptive critic neural network
controller for lean engine operation reduces NOx by over 90%, unburned hydrocarbons 2
nx
2y
my
(.)
(.)
(.)
(.)
3
LInputsHidden Layer
outputs
Measurements
ControlInputs
NN ObserverNN Observer
90%, unburned hydrocarbons by 30%, and CO by 10% from stoichiometric levels.
Fuel efficiency is enhanced by 10% NN ObserverNN Observer
0.045Uncontrolled =0.76
0.045Controlled =0.76 0.05
HR
(k)
Measured HR at =0.76MisfireMisfireMisfireMisfire
10%. Misfires have been minimized
0.02
0.025
0.03
0.035
0.04
HR
(k+1
)
0.02
0.025
0.03
0.035
0.04
HR
(k+1
)
850 855 860 865 870 875 880 885 890 895 9000
k
H850 855 860 865 870 875 880 885 890 895 9000
0.05
k
HR
(k)
Estimated HR at =0.76
0 0.01 0.02 0.03 0.040
0.005
0.01
0.015
HR(k)0 0.01 0.02 0.03 0.04
0
0.005
0.01
0.015
HR(k)
k
850 855 860 865 870 875 880 885 890 895 9000
1
2x 10-3
k
u(k)
gra
ms
Control input u(k) at =0.76
Cyclic Dispersion in Heat Release Without
58
Engine Heat Release During MisfireCyclic Dispersion in Heat Release Without
(left) and With (right) Control
Lean Engine Control Time series of Heat Release for Equivalence Ratio Set Point = 0.77
Time series of heat 800
900
1000
1100
1200
e(k)
, J
Control ActionInitiated
release shows more stable heat release output with controller
ti400
500
600
700
800
Hea
t Rel
ease
active Return maps of
current cycle heat l l tt d 1200
Uncontrolled HR: =0.77 - COV: 0.3851
1200
Controlled HR: =0.77+3.9% - COV: 0.1364
-500 -400 -300 -200 -100 0 100 200 300 400 500
300
kth cycle
release plotted against next cycle show 64% reduced cyclic dispersion
800
900
1000
1100
1200
ase(
k+1)
, J
800
900
1000
1100
1200
ase(
k+1)
, J
Without Control With Control
cyclic dispersion
300
400
500
600
700
Hea
t Rel
ea
300
400
500
600
700
Hea
t Rel
ea
Fig. 21400 600 800 1000 1200
Heat Release(k), J400 600 800 1000 1200
Heat Release(k), J
g
Reinforcement Learning Based EGR1 ( 1)dx k 2ˆ ( )x k ˆ ( )e k ( 2)x k
Nonlinear System
1 ( )dx k 2ˆ ( )dx kl5 +
1( )e k
1 ( 1)dx k
+
2 ( )x k
2ˆ ( )e kAction NN2
( 1)y k ( )u k
1( )e k 1 ( 2)dx k
5 1( )l e k
+
Action NN1
3 3ˆ ( ) ( )Tw k kCritic NN
ˆ ( )Q k
1( 1)x k
ˆ( )x k
ˆ( )k
1( )x kObserver
( )y kˆ( 1)y k ˆ ( )Q k
ˆ( )k[1 0] 2ˆ ( 1)x k
( )x k 1z( )x k
1000
Controller Output Tracking Error at =0.72
200
0
200
400
600
800
acki
ng E
rror(%
)1350 1400 1450 1500 1550
-1000
-800
-600
-400
-200
Iteration(k)Tr
aIteration(k)
P. Shih, B. Kaul, S. Jagannathan, and J. Drallmeier, “Reinforcement learning based dual-control methodology forcomplex nonlinear discrete-time systems with application to spark engine EGR operation”, IEEE Transactions on Neural Networks, vol. 19, no. 8,pp. 1369-1388, August 2008.
Experimental Datax 10
-3 Controlled Estimation Error at EGR=18%
EGR = 18%
800
900
1000
1100Uncontrolled Return Map at EGR=18%
800
900
1000
1100Controlled Return Map at EGR=18%
0 5
1
1.5
2
x 10 Controlled Estimation Error at EGR 18%
êx1 m=-4.76e-006
êx2 m=3.09e-006
ey m=-0.00229
300
400
500
600
700
800
Hea
t Rel
ease
(k+1
), J
300
400
500
600
700
800
Hea
t Rel
ease
(k+1
), J
-1
-0.5
0
0.5
Est
imat
ion
Erro
r
0 200 400 600 800 10000
100
200
Heat Release(k), J0 200 400 600 800 1000
0
100
200
Heat Release(k), JHeat Release and Control at EGR=18%. Controller on at k=5242 600 800 1000 1200 1400 1600
-2.5
-2
-1.5
Iteration(k)
600
700
800
900
Hea
t Rel
ease
(J)
Fig. 22
2
-1
0
1
Con
trol(m
g)
4000 4500 5000 5500 6000 6500 7000-2
Iteration(k)
Mobile Robot Formation Control
A wedge formation consisting of a leader L
and two followers was considered F1 F2.8ijdL 3/2ijd
Initial stabilizing controls for leader i and each follower1.3 ( ) ( )ieo i icu g k e k 1.4 ( ) ( )jeo j jcu g k e k
To evaluate the performance of the optimal controller, a second non-optimal NN control scheme is selected for each robot according to (Jagannathan 2006)(Jagannathan, 2006)
1 ˆ( ) ( )( ( ) ( ) ( ( ))T TNN cu k B k Ke k W k V z k
ˆ ˆ( 1) ( ) ( 1) ( 1) ( )Tc cW k k e k e k W k
62B. Brenner, T. Dierks, and S. Jagannathan, “Neuro optimal control of robot formations”, Proc. of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 234-241, April 2011.
Experimental Results
Primary ComponentsMST M t1. MST Mote
2. MicrostrainInertial Sensor
3. CMU2 Camera4. US Digital Wheel
Encoders5. Sharp IR Sensors
Missouri S&T Robot Testbed
63
Experimental Results
Optimal Path
leader5
x 10-3 Feedback Weights
tude
1.2
1.4
1.6
1.8
(m)
leaderfollower
0 20 40 60 80 100 120 140 160 180 200-5
0
Time (s)
Mag
nit
0 02Cost Weights
de
0.6
0.8
1
1.2
Y Po
sitio
n (
0 20 40 60 80 100 120 140 160 180 200-0.02
0
0.02
Time (s)
Mag
nitu
d
Feedforward Weights
-0.5 0 0.5 1 1.50
0.2
0.4
X Position (m)
0 20 40 60 80 100 120 140 160 180 200-0.1
0
0.1Feedforward Weights
Time (s)M
agni
tude
X Position (m)
Fig. 24
Time (s)
Fig. 25
64
Experimental Results: With Obstacle Avoidance
ˆeuNNu
Fig. 26
Fig. 27
65
Experimental Results: With Obstacle Avoidance
ˆeuNNu
Fig. 29Fig. 28
finalk
kJJ ))((
Control Policy/ ˆeu NNu
k
ceeTotal keJJ0
))((
Control Policy/Robot
L 126.04 192.71F 829 18 2635
e NN
66
F 829.18 2635
Continuous-time RL/ADP Schemes• Dierks & Jagannathan, ACC 2010 – Single OLA to learn the HJB equation
for the online optimal control of nonlinear continuous-time systems
67Optimal Regulator Block Diagram Optimal Tracker Block Diagram
Conclusions RL using Value and Policy iteration-based schemes help
solve optimal control forward-in-time without needing system dynamicsdynamics
Due to implementation challenges, online RL/ADP schemes without iterative approach for optimal control are preferred.
Discrete-time design and implementation is preferred for hardware implementation though the Lyapunov first difference i li ith t t th t t L this nonlinear with respect to the states. Lyapunov theory guarantees that all signal remain bounded while under ideal conditions, the approximated control input approaches the
ti l i t t ti lloptimal input asymptotically.
Simulation and hardware implementation results verify the theoretical resultstheoretical results