optimal adaptive control of nonlinear discrete-time...

Optimal Adaptive Control of Nonlinear Discrete time S stemsDiscrete-time Systems

S. JagannathanDepartment of Electrical and Computer EngineeringMissouri University of Science and TechnologyRolla, MO, USA

(sarangap@mst edu)([email protected])

Outline Introduction Introduction

Background on Optimal Adaptive Control

Optimal Adaptive Control of Unknown Dynamic Systems

Challenges of Value and/or Policy Iteration-based Schemesg y

Online Optimal Adaptive Control w/o Policy and/or Value Iterations

Online Optimal Tracking

Hardware Implementation

Extensions to Continuous-time

Conclusions Conclusions

Introduction

Reinforcement learning (RL) /Adaptive dynamic programming (ADP) techniques with online approximators such as neural networks (NN’s) have been widely employed in the literature in the context of optimal adaptive

t l (Si B t P ll & W h 2004)control (Si, Barto, Powell & Wunsch, 2004) Policy and/or value iterations are utilized NN reconstruction errors are ignored

C f th NN i l t ti t id d Convergence of the NN implementations are not provided Full or partial knowledge of the system dynamics are needed and in

some cases a model of the system is needed The Bellman equation forms the basis for a family of RL/ADP schemes for The Bellman equation forms the basis for a family of RL/ADP schemes for

finding optimal adaptive policies forward in time RL/ADP based schemes do not require the system dynamics

Convergence and/or closed-loop stability can be demonstrated by using Convergence and/or closed loop stability can be demonstrated by using temporal difference error (TD) method

RL/ADP-based online methods are preferred over offline-based schemes

Backgroundg

F.L. Lewis, D. Vrabie, and V. Syrmos, “Optimal Control” Third edition, To appear, John Wiley Press.

Optimal Control of Discrete-time Systems

Consider discrete-time (DT) nonlinear affine system given by(1)( ) ( )k k k kx f x g x u (1)

where is the system state and is the control inputs Value function

1 ( ) ( ) ,k k k kx f x g x u n

kx R mku R

Value function(2)

with is a discount factor is a( )k ku h x

( ) ( , ) ( ) ,h i k i k Tk i i i i i

i k i kV x r x u Q x u Ru

0 1 ( ) 0 0Q x R with is a discount factor, , is a

feedback control input, and stage cost is

Note: Discounting makes initial time steps more important and traditional

( )k ku h x0 1 ( ) 0, 0kQ x R

( , ) ( ) .Tk k k k kr x u Q x u Ru

Note: Discounting makes initial time steps more important and traditional learning technique usually learn during first few time steps, and hence discounting is not preferred for learning schemes. Value function is the positive definite solution to the Bellman equation that satisfies V(0)=0positive definite solution to the Bellman equation that satisfies V(0)=0.

Optimal Control for DT System

Discrete-time Hamilton-Jacobi-Bellman (HJB) equation (3)

Optimal control policy:

* *1(.)

( ) min ( , ( )) ( ) .k k k khV x r x h x V x

(4) * *1

(.)( ) arg min ( , ( )) ( )k k k k

hh x r x h x V x

)()(2 1

*1

kkT xVxgR

Note: For solving optimal control , DT HJB equation (3) needs

)()(2 1kkg

)(*kxh

to be solved. However, due to nonlinear nature of DT HJB equation, finding its solution is difficult or impossible.

Hamiltonian

Discrete-time Hamiltonian(5)

where is the forward difference 1( , ( ), ) ( , ( )) ( ) ( ) ,h h

k K k k k k kH x h x V r x h x V x V x

1( ) ( )k h k h kV V x V x

operator. It is important to note Hamiltonian is the temporal difference (TD) error for prescribed control policy . ( )k ku h x

Iterative based methods will be shown to obtain the value function and optimal control from the DT HJB.

Traditional Optimal Control of Linear Systems

Linear Quadratic Regulator DT linear time-invariant system dynamics:

(6)A B (6) Infinite horizon value function:

(7)

1 ,k k kx Ax Bu

TT RuuQxxxV )()( (7)

Bellman equation:(8)

ki

iiiik RuuQxxxV )()(

)()()( TT xVRuuQxxxV (8))()()( 1 kkkkkk xVRuuQxxxV

Bellman Equation for DT LQR: ARE

Value function in quadratic form of state: with some P.kTkk PxxxV )(

DT Hamiltonian:(9)( , ) ( ) ( ) .T T T T

k k k k k k k k k k k kH x u x Qx u Ru Ax Bu P Ax Bu x Px

Use stationarity condition to obtain optimal control(10)1( ) .T T

k k ku Kx B PB R B PAx

( , ) 0k k kH x u u

Substitute (10) into (9) to obtain DT Algebraic Riccati Equation (DARE)

( ) .k k ku Kx B PB R B PAx

(DARE)(11)

DT ARE is exactly the Bellman optimality equation for the DT LQR.

1( ) 0 .T T T TA PA P Q A PB B PB R B PA

DT ARE is exactly the Bellman optimality equation for the DT LQR.

Value Function Approximation (VFA)

VFA is the key for implementing PI and VI online for dynamical systems with infinite state and action spaces.

Linear system:Since value function is quadratic of state in LQR case, it can be

(12)1 1( ) ( ( )) ( ) ( )T T T TV x x Px vec P x x p x p x (12) Nonlinear system:According to Weierstrass Higher Order Approximation Theorem there

1 12 2( ) ( ( )) ( ) ( ) ,k k k k k k kV x x Px vec P x x p x p x

According to Weierstrass Higher-Order Approximation Theorem, there exists a dense basis set such that

(13){ ( )}i x

( ) ( ) ( ) ( ) ( ) ( ) ,L

Ti i i i i i LV x w x w x w x W x x

( )

with basis vector and converges uniformly to zero when .

1 1 1( ) ( ) ( ) ( ) ( ) ( ) ,i i i i i i L

i i i L

1 2( ) ( ) ( ) ( ) : n LLx x x x R R ( )L x

L

Optimal Adaptive Control of Linear System

Example: Solving DT LQR online by using PI and VI System dynamics:

(14)V l f i

1 ,k k kx Ax Bu

Value function:(15)

Optimal control:k

Tk

kii

Tii

Tik PxxRuuQxxxV

)()(

Optimal control:(16)

Note: Solving optimal control needs which is the solution of

1( ) .T Tk k ku Kx B PB R B PAx

PNote: Solving optimal control needs which is the solution of Riccati equation. Next, PI and VI will be used to solve online.

PP

Optimal Adaptive Control via PI

PI for DT LQR: Initialization: select any admissible control . Start iteration until convergence

P li l ti t

0 ( )kh x

0j

Policy evaluation step:(17)

where 1 1 ( , ( )) ( ) ,T T T

j k k k j k j j kp x x r x h k x Q K RK x

1( )jp vec P where Policy improvement step:

(18)

1 ( )jjp vec P

jTjTj AxPBRBPBxhu 1111 )()( (18)kkjk AxPBRBPBxhu 1 )()(

Optimal Adaptive Control via VI

VI for DT LQR: Initialization: select any control . Start iteration until convergence

V l d t t

0 ( )kh x

0j

Value update step:(19)

where 1( )jp vec P

1 1 ( , ( )) ( ) ,T T T T Tj k k j j k k j j k j kp x r x h k p x x Q K RK x p x

where Control policy update:

(20)

1 ( )jjp vec P

jTjTj AxPBRBPBxhu 1111 )()( (20)

Note: The control policy update (18),(20) requires system dynamics which cannot be known beforehand

kkjk AxPBRBPBxhu 1 )()(

BA,dynamics which cannot be known beforehand.BA,

Policy (PI) and Value Iteration(VI) Flowchart

Note: Difference between PI and VI: 1) VI does not require initial admissible ) qcontrol; 2) Value update is different with policy evaluation.

Optimal Adaptive Control via Policy Approximation

Second ‘Actor’ Neural Network:For relaxing the requirement of full system dynamics knowledge, a

second ne ral net ork is introd ced for control policsecond neural network is introduced for control policy. Actor NN

(21)( ) ( ) ,Tk k ku h x U x (21)

with is activation or basis function and is weights or unknown parameter.

( ) : n Mx R R M mU R

Update law is required for the NN weights

Note: After convergence of critic NN parameters to in PI or VI, actor NN is used to update the control policy based on . Actor NN avoids h d f d if d i ( i li )

1jW

1jW )(fthe need of state drift dynamics (or in linear system).

j)(f A

Actor-Critic Implementation of TD Learning

Fig 1. Temporal Difference Learning using Policy Iteration.

The value function is estimated by observing the current and the next state, and the cost incurred. Based on the new value, the action i d t dis updated.

Nonlinear Optimal Control via Offline

Discrete-time unknown nonlinear affine system

1k k k k k k kx f ( x ) g( x )u( x ) f g u (1) Infinite horizon cost function

1k k k k k k kf ( ) g( ) ( ) f g

Tk n n n

k

V ( x ) Q( x ) u Ru

( )

where1

n kT

k k k kQ( x ) u Ru V( x )

(2)

0kQ( x ) , .0 mxmR The control input must be admissible (Chen &

Jagannathan, 2008): is continuous on a compact set;

kQ( )

ku

ku

is continuous on a compact set; stabilizes (1) and ; is finite for all .

ku0

0k

k xu

0V( x ) 0x ku

T. Dierks, B. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systemsusing offline-trained neural networks with proof of convergence”, Neural Networks, vol.22.pp. 851-860, 2009.

Nonlinear Optimal Control (cont.)

The optimal control input for (1) is found by solvingand revealed to be0k kV( x ) / u k k( )

1 1

1

12

T kk k

k

V( x )u R g( x )x

(3)

Optimal control (3) is generally not implementable for unknown nonlinear systems due to andkg( x ) 1kx

This work relaxes the need for explicit knowledge of (1) while ensuring convergence and stability Online NN system identification scheme Online NN system identification scheme Offline optimal control using only the learned NN

model of the system

Identifier Design

The nonlinear system (1) is rewritten as

( )f h1 ( , )k k k k k kx f g u h x u T T Ts s k k s k s kw ( v z ) ( x ) w ( z( x ) ( x ) (4)

The proposed nonlinear identifier is written as

1T

k s k k kˆ ˆx w ( z ) v (5)

where

1k s ,k k k( ) (5)

k kk T

ˆ xvx x C

(6)

and is the identification error, is an additional tunable parameter, and is a constant

k k sx x C ( )

k k kˆx x x k 0sCadd t o a tu ab e pa a ete , a d s a co sta t0sC

NN Identifier (cont.)

Assumption: The term is assumed to be upper bounded by a function of estimation error such that

s k( x )bounded by a function of estimation error such that

T Ts k s k M k k k( x ) ( x ) ( x ) x x (7)

where

The proposed tuning laws for and are given byw

M

The proposed tuning laws for and are given by

and

s ,kwk

1 1T

s ,k s ,k s k kˆ ˆw w ( z )x T Tˆ ˆ / ( C )

(8)

(9)and T Tk k s k k k k sx x / ( x x C ) (9)

2T TL x x tr[ w w ] (10)

The asymptotical error convergence can be demonstrated using

k k k s ,k s ,k s k sL x x tr[ w w ] (10)

Estimation of Control Coefficient Matrix

Using the online system identification scheme (5), an estimate for is given bykg( x )

T Tˆ( ) ( ) ( / ) (11)

where is the derivative of the activation function with respect to and is a constant

T Tk s ,k k s k kˆg( x ) w ( z )v ( z / u ) (11)

xk( z )

z function with respect to and is a constant matrix

After a sufficiently long online learning session, the

kzk kz u

y g g ,robust term (6) approaches to zero, and the dynamic system can be rewritten as

Then observe

1 1T

k k k k k s ,k kˆ ˆx f ( x ) g( x )u x w ( z ) (12)

ˆx / u g( x ) x / u Then observeto yield the relationship shown in (11)

1 1k k k k kx / u g( x ) x / u

Offline Heuristic Dynamic Programming

Only the NN model learned online is used for the offline training

1) Initialize the cost function and select the initial stabilizing control input:

0V ( ) Tˆ ˆ( ) i Q( ) R V ( )

2) Compute the value function0 0kV ( x ) , 0 0 1

Tk k k k ku

u ( x ) argmin Q( x ) u Ru V ( x )

1 0 0 0T T

k k k kˆ ˆV ( x ) Q( x ) u Ru V w ( z )

3) Optimization is achieved by iterating between a

1 0 0 0

0 0 0 1

k k s ,k k

Tk k

V ( x ) Q( x ) u Ru V w ( z )

ˆQ( x ) u Ru V ( x )

p y g

sequence of control policies and value functions (Proof of converge shown in Al-Tamimi, Lewis, & Abu-Khalaf, 2008)

T ˆ( ) i Q( ) R V( ) 1T

i k k k k i kuu ( x ) argmin Q( x ) u Ru V( x )

1T T

i k k i i i s ,k kˆV ( x ) Q( x ) u Ru V w ( z )

NN Implementation of Offline HDP

Value function (2) and optimal control (3) are assumed to have NN representations as

The NN approximations of (13) and (14) are written as

Ti k Vi k ViV ( x ) W ( x ) (13)

Ti k Ai k Aiu ( x ) W ( x ) (14)

The NN approximations of (13) and (14) are written asT T

i k Vi k Vi kˆ ˆ ˆV ( x ) W ( x ) W

T Ti k Ai k Ai k

ˆ ˆu ( x ) W ( x ) W

(15)(16)

Critic weights are tuned at each iteration using method of weighted residuals

i k Ai k Ai k( ) ( )

1

1 1ˆ ˆ( ) ( ) ( ) ( ), , , )T TVi k k k i k k Vi AiW x x dx x d x x W W dx

(17)

Tuning the Action Network Define the action error to be the difference between the

estimated control input (14) and the estimated optimal controlcontrol

11

1

12

i j ,kT Tai , j i ,k i ,k Ai , j k k

j ,k

ˆ ˆV ( x )ˆˆ ˆe u u W ( x ) R g ( x )x

11

1

12

T

j ,k j ,kT TAi , j k s j ,k s ,k Vi

ij ,k j ,k

ˆz ( x )ˆ ˆˆW ( x ) R v ( z ) w Wˆ ˆu x

(18)

Action network weight update

1T TAi , j Ai , j j k ai , j k k

ˆ ˆW W e / ( ) (19)

Proof of action network convergence can be shown

Remarks

The results indicate that only a nearly optimal control law is possible when the NN reconstruction errors are explicitly considered.

When the NN reconstruction errors are ignored, the NN estimate of the optimal control law converges to the optimal control uniformly.

Although the NN reconstruction errors are often ignored in the literature, proof of convergence for the NN implementations of HDP are rarely studiedHDP are rarely studied.

Offline HDP Flowchart

Simulation Results

E l 1 Li S t Example 1: Linear System1 1

12 1

0 3 0 8 00 8 1 8 1

,kk k k

k

x . .x x u

x . .

Online identifier parameters:2 1 0 8 1 8 1,kx . .

,09.0s 0027.0s

Offline training set and learning parameter:s

],5.0,5.0[1 x ],5.0,5.0[2 x 01.0j

Value network activation functions},,,,,,{ 6

2231

41

2221

21 xxxxxxxx

Action network activation functions

DARE solution},,,,{ 5

23121 xxxx

0 691 1 3405[ ] DARE solution 0 691 1 3405k ku [ . . ] x

Linear System Results

1State Identification Error

1.2

x1

- x2

Trajectories

0.6

0.8

1

rror

x1

x2

0.8

1DARE Solution

NN Solution

0.2

0.4

tific

atio

n E

r

0.4

0.6

x 2-0.2

0Iden

t

0

0.2

5 10 15 20 25 30 35 40 45 50Iteration Index k

-1 -0.5 0 0.5 1-0.2

x1

Fig. 2 Fig. 3g g

Nonlinear System Example

E l 2 N li t Example 2: Nonlinear system

2 k1 1 -sin(0.5x ) 0kx 2,k1 11

2,k 1,k2 1

( )-cos(1.4x )sin(0.9x ) 1

,kk k

,k

x ux

NN identifier, value and action network parameters same as Ex. 1

HDP algorithm described in Al Tamimi Le is and Ab HDP algorithm described in Al-Tamimi, Lewis, and Abu-Khalaf (2008) was also implemented using full knowledge of the nonlinear system dynamics for comparison

Fig. 4 Fig. 5

Fig. 7

Fig. 6

Challenges of PI and VI –based Schemes

Iterative-based schemes may not be suitable for hardware implementation

PI and VI-based schemes may not converge y gif suitable number of iterations are not chosen

In certain instances, model of the nonlinear ,system needs to be created before designing the PI and VI-schemes

Effect of Insufficient Iterations using HDP

Consider discrete-time nonlinear affine systemuxgxfx )()(

wherekkkk uxgxfx )()(1

kx2 085.1

k

kkkk

kk x

xgxxx

xf,2,2,2,1

,2 )(,5.1)*5.0sin(

)(

Simulation parameter Initial states:

Tx ]5.4- 5.0[0 122 RIQ

Sampling interval:1,22 RIQ

sec1.0sT

Effect of Insufficient Iterations 3000

15

1000

2000

err

ors

e1e2

5

10

15

erro

rs

e1e2

-2000

-1000

0

Reg

ulat

ion

-5

0

5

Reg

ulat

ion

e

0 0.3 0.6 0.9 1.2 1.5-3000

2000

Time (Sec)

0 2 4 6 8 10-10

-5

Time (Sec)

F th fi it i b i h b f it ti i

Fig. Number of iteration=1000Fig. Number of iteration=10

Fig. 9

From these figures it is obvious when number of iteration is decreased, policy iteration cannot force the regulation errors converge to zeros

Online Near Optimal Regulation

Optimal control

No offline scheme or internal dynamics f ( x )

1 1

1

12

T kk k

k

J( x )u R g( x )x

(20)

No offline scheme or internal dynamics To begin, the cost function (2) is assumed to have an NN

representation written as

kf ( x )

The NN approximation of (4) is written as

Tk c k cJ ( x ) W ( x )

(21)T Tˆ ˆ ˆJ ( ) W ( ) W

The cost function error (residual or TD) dynamics are formed as

T Tk c ,k k c ,k kJ ( x ) W ( x ) W , (22)

formed as1 1

T Tc,k k c,k k c,k k

ˆ ˆe r W W

1 1 1T

c,k k c,k k kˆe r W

(23)

Auxiliary Critic Error System

Define the auxiliary error vector

h1 1

1 1T x( j )

c ,k k c ,k kˆE Y W X

(24)where

1 2 1k k k k jY [ r r ... r ]

1 1k k k k jX [ ... ]

Auxiliary error dynamics

1k k k

y y

1 1T T Tc ,k k k c ,k

ˆE Y X W (25)

The system (25) can be viewed as an affine nonlinear system with control input 1c ,kW

Critic Weight Update

Therefore, the critic weight update is written as

1

1T T T

c ,k k k k c c ,k kW X X X E Y

(26)and (24) becomes

( )

1T Tc ,k c ,k c ,kE E (27)

We can show that exists 1Tk kX X

The weight update (26) is comparable to the least squares updates used in offline training Instead of summing over a mesh of training points, the update

(26) represents a sum over the system’s time history stored in Thus, the update (9) uses data collected in real time instead of

)(kEc

p ( )data formed offline

Action Network

The optimal control (3) has a NN representation written as

(28)Tk A k Au( x ) W ( x )

The NN approximation of (28) takes the form ofT

k k A k kˆˆ ˆu u( x ) W ( x ) (29)

The optimal control signal error is defined as

k k A,k k( ) ( ) (29)

1 ( x )ˆ ˆ

Action network weight update

1 1

1

12

T T ka ,k A,k k k c ,k

k

( x )ˆ ˆe W ( x ) R g ( x ) Wx

(30)

g p

1

Tk a ,k

A,k A,k a Tk k

eˆ ˆW W

(31)

In the absence of NN reconstruction errors, the action and critic errors converge to zero asymptotically and the action and critic NN weights remain bounded. That is .ˆ uu

Offline and Online Schemes: A Comparison

Simulation Results: Example 1

Example 1: Linear System

1 1 0 0 8 0,kx .x x u

Results of solving the discrete time algebraic Riccati

12 1 0 8 1 8 1k k k

,k

x x ux . .

g gequation (DARE) that requires A to be known

0 6239 1 2561k ku [ . . ] x

Initial stabilizing control:

21,k 1,k 2,k1.5063x 2.0098x xkJ 2

2,k3.7879x

g

Implementation of the proposed optimal regulator does not required the A matrix to be known

0 0 5 1 4,k ku [ . . ] x

not required the A matrix to be known

Optimal Regulation ResultsC t i t ti ti f ti Cost approximator activation functions :

Action network activation functions:},,,,,,{ 6

2231

41

2221

21 xxxxxxxx

Action network activation functions:

Tuning constants:},,,,{ 5

23121 xxxx

,10 6ca 1.0aau g co sta ts Final cost estimator weights:

,c a

0030.00025.00015.00082.07886.30097.25071.1[ˆ cW]0002.00009.00003.00008.00000.00000.00020.00014.0

21 1 21 5063 2 0098k ,k ,k ,kJ . x . x x 2

2 3 7879 ,k. x

Final weights of the action network:

]005400075000500007400049000920...0095.00338.00589.02586.16208.0[ˆ AW

]0054.0.0075.00050.00074.00049.00092.0

)(]2561.16239.0[)( kxku

Optimal Regulation Results (cont.)

Fig. 10Fig. 10

Fig. 11 Fig. 12

Simulation Results: Example 2

Regulation Example 2: Nonlinear System

1 1 1 2 0,k ,k ,kx x x 1 2

2 1 1 21 8 1, , ,

k k,k ,k ,k

x ux x . x

Tuning parameters and activation functions same as example 1

Initial stabilizing control:0 0 4 1 24,k ku [ . . ] x

Online results are compared to result obtained via offline training (Chen & Jagannathan, 2008)

Online Results for the Nonlinear System

Fig. 13 Fig. 14

Fig. 15 Fig. 16

Optimal Tracking Results (cont.)

Fig. 17

Optimal Tracking

For regulation, the optimal control inputs are constructed so that when0ku 0kx

In contrast, tracking control inputs typically consist of feedback and feedforward terms so that is not zero when the tracking error becomes zero

kukewhen the tracking error becomes zero

In the literature, finite horizon cost functions in the form ofk

N

T TJ e Qe u Ru are often considered for tracking to mitigate the fact that does not vanish (Lewis and Syrmos 1995)

ku

1k j j j j

j

J e Qe u Ru

does not vanish (Lewis and Syrmos, 1995)

Objective: Design the control input to ensure that the nonlinear t (1) t k d i d t j t i ti lsystem (1) tracks a desired trajectory in an optimal manner.

T. Dierks and S. Jagannathan, “Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics”, Proc. of the IEEE Conference on Decision and Control, pp. 6750-6755, December 2009.

Optimal Tracker Development

To begin, write the desired trajectory dynamics1d ,k d ,k k d ,kx f ( x ) g( x )u (32)

Tracking error with dynamics given by(33)

k k d ,ke x x

k e,k k e,ke f g( x )u

Infinite horizon cost function for tracking can be given by

e ,k k d ,kf f ( x ) f ( x ) e,k k d ,ku u u andwith

g g y

, , , , 1 , , 10 0

e k e k i e k e k i e k e ki i

J r r r r J

(34)

Th i l i d i ibl h i fi i

with Te,k e k e,k e e,kr Q ( e ) u R u , 0e kQ ( e ) , mxm

eR

u J The signal is admissible, thus is finite.e ,ku e,kJ

Optimal Tracking Control Input

Optimal feedback control found by solving 0e ,k e ,kJ / u

The optimal control is then given by

(35)

u

1 1

1

12

T e ke,k e k

k

J (e )u R g (x )e

The optimal control is then given byk, u ,

(36)1 1

1

12

T e kk d ,k e k

k

J ( e )u u R g ( x )e

where the feedforward term is found from (32) as12 ke

1( k ) ( ) ( f )

d ,k, u ( k ),1

1d ,k k d ,k d ,ku ( k ) g( x ) ( x f ) (37)

Optimal control requires nonlinear cost function, feedback andf df d l hi h ill b i t d b OLA’feedforward values which will be approximated by OLA’s

Feedforward Control Input Approximation

The feedforward control input (37) is approximated using an OLA and written as

The feedforward OLA update is derived using Lyapunov

11

Td ,k k d ,k d ,k d

ˆu g( x ) ( x ) (38)

The feedforward OLA update is derived using Lyapunov methods to be

Tˆ ˆ (39)

The estimate of the system control input (36) is then written

1 1T

d ,k d ,k d d k k e,kˆ( e g( x )u ) (39)

as

k d ,k e,kˆ ˆ ˆu u u (40)

Remarks

Remark 1: For offline ADP, the optimal cost function is learned using a mesh

of training points. In contrast, the optimal cost function is learned online in this work using the system’s time history.

Remark 2: Remark 2: By construction, the feedback control input error becomes

zero when the tracking error is zero. As a result, the feedback control estimator (41) cannot be updated once the

ea ,keke

feedback control estimator (41) cannot be updated once the tracking error has converged to zero.

To prevent the control input error from becoming zero before the p p gcost function and nearly optimal control policy have been learned, external probing noise may be introduced into the nonlinear system dynamics to ensure the tracking error is persistently exciting.

Online Regulation and Tracking: Comparison

Nonlinear System Consider the Nonlinear system

21 1 12 1 1 00 5,k ,k,k ,kx usin( . x )x

.

Desired trajectory:

2 1 22 1 0 11 4 0 9,k ,k,k ,kx ucos( . x ) sin( . x )

1 1 0 25d ,kx sin( . k )

Initial stabilizing control:2 1 0 25,

d ,k

.x cos( . k )

0 1 0 e

No offline training is required to implement the proposed

10

2

0 1 00 0 1

,ke

,k

e.u

e.

optimal tracking control scheme 11 past values are used to form the auxiliary cost function

errorerror

OLA and Parameter Design The OLA’s considered for implementation were neural networks:

10 hidden layer neurons were used in both the cost function and feedback networks along with the sigmoid activation function

25 hidden layer neurons were used in the feedforward estimator along with the radial basis activation function

Tuning gains were selected according the theoretical results: Tuning gains were selected according the theoretical results: Cost function parameter tuning gain: Feedback control signal parameter tuning gain: 1.0ec

10 Feedforward control signal parameter tuning gain:

All tunable parameters were initialized to zero

1.0ea05.0d

Probing noise was generated from a random distribution with zero mean and variance of 0.04 to ensure the tracking error was persistently exciting

Optimal Tracking Results

Fig. 18

Results for Optimal Tracking Control

Fig. 19 Fig. 20 All parameters remain bounded and converge to constant values as

stability theorem suggested

Hardware Implementation

Multi-modal Engine Control

Objective is to control an SI engine to minimize emissions and maximize fuel efficiency Engine dynamics are unknown States are not available Nonstrict nonlinear discrete-time system Nonstrict nonlinear discrete time system

High levels of exhaust gas recirculation (EGR) dilute cylinder contentsAlthough stoichiometric equivalence ratio is Although stoichiometric equivalence ratio is maintained, partial fuel burns and misfires become frequent

Lean combustion changes the equivalence Lean combustion changes the equivalence ratio causing partial fuel burns and misfires

Control of fuel input attempts to reduce cyclic dispersion of heat releasecyclic dispersion of heat release

Lean Engine Control

Daw model equations: states are total air and total fuel; output is heat release or accelaration

1 1 2 11x k AF k F k x k R F k CE k x k d k

2 2 21 1x k CE k F k x k MF k u k d k

kCEkxky 2

kCE

kCE )(max

lumk

)(

1001

kkx

Rk 2

This model represents spark engine dynamics as a nonlinear system in nonstrict feedback form.

kx1

S. Jagannathan and P. He, “Neural network-based state feedback control of nonlinear discrete-time system in non-strict feedback form”, IEEE Transactions on Neural Networks, vol. 19, no. 12, pp. 2073-2087, December 2008.

Lean Engine Control and Diagnostics—Experimental Results

1x

2x

1y

TV TW

(.)

(.)

(.)

1

2

Neural Network (NN) Neural Network (NN) ControllerController EngineEngine Adaptive critic neural network

controller for lean engine operation reduces NOx by over 90%, unburned hydrocarbons 2

nx

2y

my

(.)

(.)

(.)

(.)

3

LInputsHidden Layer

outputs

Measurements

ControlInputs

NN ObserverNN Observer

90%, unburned hydrocarbons by 30%, and CO by 10% from stoichiometric levels.

Fuel efficiency is enhanced by 10% NN ObserverNN Observer

0.045Uncontrolled =0.76

0.045Controlled =0.76 0.05

HR

(k)

Measured HR at =0.76MisfireMisfireMisfireMisfire

10%. Misfires have been minimized

0.02

0.025

0.03

0.035

0.04

HR

(k+1

)

0.02

0.025

0.03

0.035

0.04

HR

(k+1

)

850 855 860 865 870 875 880 885 890 895 9000

k

H850 855 860 865 870 875 880 885 890 895 9000

0.05

k

HR

(k)

Estimated HR at =0.76

0 0.01 0.02 0.03 0.040

0.005

0.01

0.015

HR(k)0 0.01 0.02 0.03 0.04

0

0.005

0.01

0.015

HR(k)

k

850 855 860 865 870 875 880 885 890 895 9000

1

2x 10-3

k

u(k)

gra

ms

Control input u(k) at =0.76

Cyclic Dispersion in Heat Release Without

58

Engine Heat Release During MisfireCyclic Dispersion in Heat Release Without

(left) and With (right) Control

Lean Engine Control Time series of Heat Release for Equivalence Ratio Set Point = 0.77

Time series of heat 800

900

1000

1100

1200

e(k)

, J

Control ActionInitiated

release shows more stable heat release output with controller

ti400

500

600

700

800

Hea

t Rel

ease

active Return maps of

current cycle heat l l tt d 1200

Uncontrolled HR: =0.77 - COV: 0.3851

1200

Controlled HR: =0.77+3.9% - COV: 0.1364

-500 -400 -300 -200 -100 0 100 200 300 400 500

300

kth cycle

release plotted against next cycle show 64% reduced cyclic dispersion

800

900

1000

1100

1200

ase(

k+1)

, J

800

900

1000

1100

1200

ase(

k+1)

, J

Without Control With Control

cyclic dispersion

300

400

500

600

700

Hea

t Rel

ea

300

400

500

600

700

Hea

t Rel

ea

Fig. 21400 600 800 1000 1200

Heat Release(k), J400 600 800 1000 1200

Heat Release(k), J

g

Reinforcement Learning Based EGR1 ( 1)dx k 2ˆ ( )x k ˆ ( )e k ( 2)x k

Nonlinear System

1 ( )dx k 2ˆ ( )dx kl5 +

1( )e k

1 ( 1)dx k

+

2 ( )x k

2ˆ ( )e kAction NN2

( 1)y k ( )u k

1( )e k 1 ( 2)dx k

5 1( )l e k

+

Action NN1

3 3ˆ ( ) ( )Tw k kCritic NN

ˆ ( )Q k

1( 1)x k

ˆ( )x k

ˆ( )k

1( )x kObserver

( )y kˆ( 1)y k ˆ ( )Q k

ˆ( )k[1 0] 2ˆ ( 1)x k

( )x k 1z( )x k

1000

Controller Output Tracking Error at =0.72

200

0

200

400

600

800

acki

ng E

rror(%

)1350 1400 1450 1500 1550

-1000

-800

-600

-400

-200

Iteration(k)Tr

aIteration(k)

P. Shih, B. Kaul, S. Jagannathan, and J. Drallmeier, “Reinforcement learning based dual-control methodology forcomplex nonlinear discrete-time systems with application to spark engine EGR operation”, IEEE Transactions on Neural Networks, vol. 19, no. 8,pp. 1369-1388, August 2008.

Experimental Datax 10

-3 Controlled Estimation Error at EGR=18%

EGR = 18%

800

900

1000

1100Uncontrolled Return Map at EGR=18%

800

900

1000

1100Controlled Return Map at EGR=18%

0 5

1

1.5

2

x 10 Controlled Estimation Error at EGR 18%

êx1 m=-4.76e-006

êx2 m=3.09e-006

ey m=-0.00229

300

400

500

600

700

800

Hea

t Rel

ease

(k+1

), J

300

400

500

600

700

800

Hea

t Rel

ease

(k+1

), J

-1

-0.5

0

0.5

Est

imat

ion

Erro

r

0 200 400 600 800 10000

100

200

Heat Release(k), J0 200 400 600 800 1000

0

100

200

Heat Release(k), JHeat Release and Control at EGR=18%. Controller on at k=5242 600 800 1000 1200 1400 1600

-2.5

-2

-1.5

Iteration(k)

600

700

800

900

Hea

t Rel

ease

(J)

Fig. 22

2

-1

0

1

Con

trol(m

g)

4000 4500 5000 5500 6000 6500 7000-2

Iteration(k)

Mobile Robot Formation Control

A wedge formation consisting of a leader L

and two followers was considered F1 F2.8ijdL 3/2ijd

Initial stabilizing controls for leader i and each follower1.3 ( ) ( )ieo i icu g k e k 1.4 ( ) ( )jeo j jcu g k e k

To evaluate the performance of the optimal controller, a second non-optimal NN control scheme is selected for each robot according to (Jagannathan 2006)(Jagannathan, 2006)

1 ˆ( ) ( )( ( ) ( ) ( ( ))T TNN cu k B k Ke k W k V z k

ˆ ˆ( 1) ( ) ( 1) ( 1) ( )Tc cW k k e k e k W k

62B. Brenner, T. Dierks, and S. Jagannathan, “Neuro optimal control of robot formations”, Proc. of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 234-241, April 2011.

Experimental Results

Primary ComponentsMST M t1. MST Mote

2. MicrostrainInertial Sensor

3. CMU2 Camera4. US Digital Wheel

Encoders5. Sharp IR Sensors

Missouri S&T Robot Testbed

63

Experimental Results

Optimal Path

leader5

x 10-3 Feedback Weights

tude

1.2

1.4

1.6

1.8

(m)

leaderfollower

0 20 40 60 80 100 120 140 160 180 200-5

0

Time (s)

Mag

nit

0 02Cost Weights

de

0.6

0.8

1

1.2

Y Po

sitio

n (

0 20 40 60 80 100 120 140 160 180 200-0.02

0

0.02

Time (s)

Mag

nitu

d

Feedforward Weights

-0.5 0 0.5 1 1.50

0.2

0.4

X Position (m)

0 20 40 60 80 100 120 140 160 180 200-0.1

0

0.1Feedforward Weights

Time (s)M

agni

tude

X Position (m)

Fig. 24

Time (s)

Fig. 25

64

Experimental Results: With Obstacle Avoidance

ˆeuNNu

Fig. 26

Fig. 27

65

Experimental Results: With Obstacle Avoidance

ˆeuNNu

Fig. 29Fig. 28

finalk

kJJ ))((

Control Policy/ ˆeu NNu

k

ceeTotal keJJ0

))((

Control Policy/Robot

L 126.04 192.71F 829 18 2635

e NN

66

F 829.18 2635

Continuous-time RL/ADP Schemes• Dierks & Jagannathan, ACC 2010 – Single OLA to learn the HJB equation

for the online optimal control of nonlinear continuous-time systems

67Optimal Regulator Block Diagram Optimal Tracker Block Diagram

Conclusions RL using Value and Policy iteration-based schemes help

solve optimal control forward-in-time without needing system dynamicsdynamics

Due to implementation challenges, online RL/ADP schemes without iterative approach for optimal control are preferred.

Discrete-time design and implementation is preferred for hardware implementation though the Lyapunov first difference i li ith t t th t t L this nonlinear with respect to the states. Lyapunov theory guarantees that all signal remain bounded while under ideal conditions, the approximated control input approaches the

ti l i t t ti lloptimal input asymptotically.

Simulation and hardware implementation results verify the theoretical resultstheoretical results

optimal adaptive control of nonlinear discrete-time...

Documents