Transcript
  • NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Basics of DHP type Adaptive Critics / Basics of DHP type Adaptive Critics / Approximate Dynamic ProgrammingApproximate Dynamic Programming

    and some application issuesand some application issues

    TUTORIALTUTORIAL

    George G. LendarisNW Computational Intelligence Laboratory

    Portland State University, Portland, ORApril 3, 2006

  • 2

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Basic Control Design Problem

    Controller Plant

    Controller designer needs following: Problem domain specifications Design objectives / Criteria for success All available a priori information about

    Plant and EnvironmentHigh Level Design Objective: Intelligent

    Control

  • 3

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    In this tutorial, we focus onAdaptive Critic Method (ACM)

    A methodology for designing an (approximately) optimal controller for a given plant according to a stated criterion, via a learning process.

    ACM may be implemented using two neural networks (also Fuzzy systems):---> one in role of controller, and---> one in role of critic.

  • 4

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Simple, 1-Hidden Layer, FeedforwardNeural Network [ to implement y=f(x) ]

    1

    2

    3 5

    64net

    4o

    31w

    42w

    32w

    41w

    53w

    64w

    54

    63

    ww

    3o

    4o

    5

    6

    o

    o

    1o

    2o 2y

    1y1x

    2x

    Can obtain partial derivatives via Dual of trained NN

  • 5

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Overview of Adaptive Critic methodUser provides the

    Design objectives / Criteria for successthrough a Utility Function, U(t) (local cost).Then, a new utility function is defined (Bellman Eqn.),

    [cost to go]

    which is to be minimized [~ Dynamic Programming].

    [We note: Bellman Recursion]

    0

    ( ) ( )kk

    J t U t k

    =

    = +

    ( ) ( ) ( 1)J t U t J t= + +

  • 6

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Family of Adaptive Critic Methods:The critic approximates either J(t) or the gradient of J(t)

    wrt state vector R(t) [ J(R)]

    Two members of this family: Heuristic Dynamic Programming (HDP)

    Critic approximates J(t)(cf. Q Learning)

    Dual Heuristic Programming (DHP)Critic approximates J(R(t))

    (t)

    [Today, focus on DHP]

  • 7

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    May describe DHP training process via two primary feedback loops:

  • 8

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 9

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Desire a training Delta Rule

    for wij to minimize cost-to-go J(t).

    Obtain this via and the chain rule of differentiation.

    To develop feel for the weight update rule, consider a partial block diagram and a little math (discrete time):

    ( )( )ij

    J tw tPLANT Critic

    u(t) R(t+1)R(t) J(t+1)Controller( )ijw

    ( )( )ij

    J tw t

  • 10

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    The weights in controller NN are updated with objective of minimizing J(t):

    where

    and

    and

    Call this term (to be output of critic)

    ( )( )( )i j ij

    J tw t lcoefw t

    =

    1

    ( )( ) ( )( ) ( )

    ak

    ij k ijk

    u tJ t J tw t u t w=

    =

    ( ) ( ) ( 1)( ) ( ) ( )k k k

    J t U t J tu t u t u t +

    = +

    1

    ( 1)( 1) ( 1)( ) ( 1) ( )

    ns

    k s ks

    R tJ t J tu t R t u t=

    + + +=

    + ( 1)s t +

  • 11

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    It follows that Controller

    training is based on:

    Similarly, Critic

    training is based on:

    [Bellman Recursion & Chain Rule used in above.]Plant model is needed to calculate partial derivatives for DHP

    1

    ( 1)( 1) ( ) ( 1)( ) ( ) ( 1) ( )

    ns

    k k s ks

    R tJ t U t J tu t u t R t u t=

    + + += +

    +

    Via CRITIC Via Plant Model

    1

    ( 1) ( 1) ( )( ) ( ) ( 1)( ) ( ) ( 1) ( ) ( ) ( )

    nk k m

    s s k s m sk m

    R t R t u tJ t dU t J tR t dR t R t R t u t R t=

    + + += + + +

    Via Plant Model

  • 12

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 13

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Key ADP Process Parameters

    Specification of:

    state variables size of NN structure, connection pattern, and

    type of activation functions learn rate of weight update rules discount factor (gamma of Bellman Equation) scaling of variable values (unit hypersphere) lesson plan

    for training

  • 14

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Pole-Cart Benchmark Problem

  • 15

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 16

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Experimental Procedure (lesson plan):A. Train 3 passes through sequence

    (5, -10, 20, -5, -20, 10) [degrees from vertical].Train 30 sec. on each angle.

    B. Accumulate absolute values of U: C(1), C(2), C(3).C. Perform TEST pass through train sequence

    (30 sec. each angle). Accumulate U values: C(4).D. Perform GENERALIZE pass through sequence

    (-23, -18, -8, 3, 13, 23) [degrees from vertical]

    andaccumulate U values: C(5).

    E. Perform GENERALIZE pass through sequence(-38, -33, 23, 38) [degrees from vertical]

    and accumulate U values: C(6).

  • 17

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    STRATEGIES TO SOLVE EARLIER EQUATIONS: Strategy 1.

    Straight application of the equation.

    Strategy 2.

    Basic 2-stage process [flip/flop].[e.g., Santiago/Werbos, Prokhorov/Wunsch]

    During stage 1, train criticNN, not actionNN;During stage 2, train actionNN, not criticNN.

    Strategy 3.

    Modified 1st stage of 2-stage process.While train criticNN during stage 1, keepparameters constant in module thatcalculates critics desired output (R).Then adjust weights all at once at end ofstage 1.

    Strategy 4.

    Single-stage process, usingmodifications introduced in Strategy 3.

  • 18

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Progress of training process under each of the strategies.[ Pole angles during the first 80 sec. of training. Note how fast Strategies 4a & 4b learn.]

  • 19

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Pole angles during part of test sequence. [Markings below and parallel to axis are plotting artifacts.]

    Controller Responses to Disturbances

  • 20

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    The following table lists the value of the performance measure for each DHP train strategy, averaged over 4 separate training runs.

    1 2a 2b /3b 3a 4a 4bC(1) 368 740 883 209 107 160C(4) 3.2 3.1 2.9 2.8 2.4 2.2C(5) 5.1 7.9 7.4 7.5 6.2 5.5C(6) 30.5 24.1 25.4 20.7 14.5 14.0

    We observe in this table that strategies 4a & 4b converge the fastest

    [lowest value of C(1)], and also

    appear to yield the best controllers

    [lowest values in C(4), C(5) & C(6) rows]. We note separately that strategy 1 (when it converges) yields controllers on par with the better ones.

  • 21

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    S1: Simultaneous/Classical StrategyS2: Flip-Flop StrategyS4: Shadow Critic Strategy

    Introduce a copy of the criticNN in the lower loop.Run both loops simultaneously. Lower loop: train the copy

    for an epoch; upload the weight values from copy (called shadow critic) into the activecriticNN; repeat.

    S5: Shadow Controller StrategyIntroduce a copy of the actionNN in the upper loop.Run both loops simultaneously. Upper loop, train the copy

    for an epoch;upload the weight values from copy (called shadow controller) into theactive actionNN, repeat.

    S6: Double Shadow StrategyMake use of the NN copies in both training loops.Run both loops simultaneously. Both loops: train the copies

    for an epoch;upload the weight values from the copies into their respective active NNs.

    Lendaris, G.G., T.T. Shannon &

    A. Rustan (1999), "A Comparison of Training Algorithms for DHP Adaptive Critic Neuro-control," Proceedings of International Conference on Neural Networks99 (IJCNN'99), Washington,DC IEEE Press

  • 22

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Autonomous Vehicle Example(Two-axle terrestrial vehicle)

    Objective: Demonstrate design via DHP(from scratch) of a steering controllerwhich effects a Change of Lanes (on astraight highway).

    Controller Task: Specify sequence ofsteering angles to effect accelerationsneeded to change orientations of vehiclevelocity vector.

  • 23

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Design Scenarios --

    general:

    Maintain approximately constant forwardvelocity during lane change.Control angular velocity of wheel.Thus, Controller needs 2 outputs:

    1.) Steering angle2.) Wheel rotation velocity

  • 24

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Major unknown for controller:Coefficient of Friction (cof)between tire/road at any point in time.

    It can change abruptly(based on road conditions).

    Requires Robustness and/or faston-line adaptation.

  • 25

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Tire side-force/side-slip-angle as a function of COF

  • 26

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 27

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 28

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Assume on-board sensorsfor constructing state information

    Only 3 accelerometers needed:--

    x direction accel. of chassis

    --

    lateral accel. at rear axle--

    lateral accel. at front axle

    Plus load cells at front & rear axles(to give estimate of c.g. location)

    [Simulation used analytic proxies]

  • 29

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Criteria for variousDesign Scenarios:

    1. Velocity Error reduce to zero.

    2. Velocity Rate limit rate of velocity corrections.

    3. Y-direction Error reduce to zero.

    4. Lateral front axle acceleration re. comfort

    design specification.

    5. Friction Sense estimate closeness to knee of cof curves.

  • 30

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Utility Functions forthree Design Scenarios:

    [different combinations of above criteria]1.

    U(1,2,3)

    2. U(1,2,3,5)3. U(1,2,3,4,5)

    All applied to task of designing controllerfor autonomous 2-axle terrestrial vehicle.

  • 31

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 32

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 33

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 34

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Design Scenario 2.

    Add Criterion 5 (friction sense) in U2.This is intended to allow aggressive lanechanges on dry pavement,

    PLUSmake lane changes on icy road conditions as aggressively as the icy road will allow.

  • 35

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 36

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 37

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 38

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 39

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Generalize Test: Ice patch at point where wheel angle is greatest indry pavement case, retrained with coefficient modified.

  • 40

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 41

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 42

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Conclusions from Utility Function Expts.Controller Designs resulting via DHP satisfy intuitive sense of being good.Control Engineer knows that controller design requires careful specification of objective, and that as change design criteria, controllers change.For DHP, control objectives are contained in the Utility Function. The DHP process embodied the different requirements for the three design scenarios in qualitatively distinct controllers --

    all yielding intuitively good

    results,

    according to the design constraints.

  • 43

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Conclusions from Utility Function Expts., cont.

    Of particular interest to the present researchers is the ability of the DHP method toaccept design criteria crafted into the Utilityfunction by the human designers, and to thenevolve a controller design whose responselooks and feels

    like one a human designer

    might have designed.

  • 44

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Expanded task for Autonomous Vehicle Controller:

    Additional road condition of encountering an ice patch while doing lane change: also, accommodate side-

    wind load disturbances.

  • 45

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 46

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Aircraft Control Augmentation System

  • 47

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    System configuration during DHP Design of Controller Augmentation System

  • 48

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Pilot stick-x doublet signal

    (arbitrary scale in the Figure), and roll-rate responses of 3 aircraft:LoFLYTE

    w/Unaugmented control,LoFLYTE

    w/Augmented Control, and LoFLYTE*.(Note: Responses of latter two essentially coincide.)

  • 49

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Stick-x doublet: pilots stick signal vs. augmented signal(the latter is sent to aircraft actuators)

  • 50

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Roll-rate error (for above stick-x signal)between LoFLYTE* and LoFLYTE

    w/Unaugmented Control, and between LoFLYTE* and LoFLYTE

    w/Augmented Control signals

  • 51

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Blue: LoFLYTE

    w/ Unaugmented

    control Red: LoFLYTE

    w/Augmented Control

    Black: LoFLYTE*

    Roll1

  • 52

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Blue: LoFLYTE

    w/ Unaugmented

    control Red: LoFLYTE

    w/Augmented Control

    Black: LoFLYTE*

    Roll2

  • 53

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Pitch-rate error (for above stick-x signal) between LoFLYTE* and LoFLYTE

    w/Unaugmented Control, and between LoFLYTE* and LoFLYTE

    w/Augmented Control signals.

  • 54

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Yaw-rate error (for above stick-x signal)between LoFLYTE* and LoFLYTE

    w/Unaugmented Control, andbetween LoFLYTE* and LoFLYTE

    w/Augmented Control signals.

  • 55

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Augmentation commands for stick-y and pedal that the controller learned toprovide to make the induced a) pitch (stick-y) and b) yaw (pedal) responsesof LoFLYTE

    match those

    of LoFLYTE*

  • 56

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Blue: LoFLYTE

    w/ Unaugmented

    control Red: LoFLYTE

    w/Augmented Control

    Black: LoFLYTE*

    Pitch w/ cg Shift

  • 57

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Conclusions re. Control Augmentation Example:

    Basic motivation behind this work is the desire to ultimately generate a non-linear controller that has control capabilities equivalent to that of an experienced pilot.

    Successful base demonstration was presented here.

  • 58

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

  • 59

    NW Computational Intelligence LaboratoryNW Computational Intelligence Laboratory

    Perform search over Criterion Function (J function) Surface in Weight (State) Space

    Basics of DHP type Adaptive Critics / Approximate Dynamic Programming and some application issuesTUTORIALSlide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Slide Number 21Slide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Slide Number 57Slide Number 58Slide Number 59


Top Related