waypoint-based generalized zem/zev feedback guidance for planetary landing...

1

WAYPOINT-BASED GENERALIZED ZEM/ZEV FEEDBACK GUIDANCE FOR PLANETARY LANDING VIA A REINFORCEMENT

LEARNING APPROACH

Roberto Furfaro,* Richard Linares†

Precision landing on large planetary bodies is a critical technology for future human and robotic exploration of the solar system. Indeed, over the past decade, landing systems for robotic Mars missions have been developed with the specif-ic goal of deploying robotic agents (e.g. rovers, landers) on the Martian surface. In this paper, we proposed a novel algorithm that can generate powered, closed-loop trajectories to enforce flight constraints (e.g. no crashing on slope surfaces) while ensuring precision landing. More specifically, we propose a waypoint-based ZEM/ZEV algorithm that employs a dynamic programming approach via Value Iteration to determine the best location of the waypoints for a set of con-strained landing over large planetary bodies (e.g. Moon and Mars). Here, the Re-inforcement Learning (RL) framework is employed to integrate ZEM/ZEV with a waypoint selection policy as function of the current state of the spacecraft dur-ing the powered descent phase (i.e. position and velocity). Here, a set of open-loop, constrained, fuel-efficient trajectories are numerically computed using pseudo-spectral methods. A set of states from the open-loop optimal trajectories are stored as candidate waypoints. The latter are employed by the ZEM/ZEV al-gorithm as intermediate targets to steer the spacecraft toward the final target point on the planetary surface. The problem is cast as a Markov Decision Pro-cess (MDP) and the resulting dynamics programming problem is solved via generalized policy evaluation to select the next best intermediate target point as function of the previous one. The behavior of the integrated guidance algorithm is evaluated in Mars powered landing scenarios that involve demanding re-quirements both in landing location and flight path. Both constraints satisfaction and fuel efficiency are analyzed to show the effectiveness of the proposed ap-proach.

INTRODUCTION

Precision landing on large planetary bodies is a critical technology for future human and ro-botic exploration of the solar system. Over the past decade, landing systems for robotic Mars mis-sions have been developed with the specific goal of deploying robotic agents (e.g. rovers, landers)

* Associate Professor, Department of Systems and Industrial Engineering, Department of Aerospace and Mechanical Engineering, University of Arizona, 1127 E. James E. Roger Way, Tucson, Arizona, 85721,USA † Assistant Professor, Department of Aerospace Engineering and Mechanics, University of Minnesota, Minnesota, USA

AAS 17-112

2

on the Martian surface1,2. Considering the strong interest for sending humans to Mars within the next few decades, the landing system technology will continue to progress to keep up with the demand for more stringent requirements. Indeed, more demanding planetary exploration require-ments implies a technology development program that calls for more precise guidance systems capable of delivering rovers and/or landers with higher and higher degree of precision. For exam-ple, the landing accuracy, usually described by a 3-sigma landing ellipse, has been improved from 100 km (e.g. Phoenix Mission, MER) to the recent landing of the Mars Science Laboratory (MSL) designed to deliver to the surface the “Sky Crane” system within 10 km accuracy3. De-spite these improvements, future missions may require pushing the technology to have systems capable to delivering spacecrafts to Martian locations with pinpoint accuracy (< 10 𝑚𝑚). Im-portantly, additional requirements to deliver landers on scientifically interesting areas, may call for the development of guidance systems that optimally satisfy stringent constraints (e.g. do not crash on the surface due to elevated slopes).

Recently, substantial research efforts have been devoted toward determining reference open-loop trajectories that are fuel-optimal, i.e. trajectories that satisfy both the desired boundary con-ditions and any additional constraints while minimizing the fuel usage4,5,6. Generally, fuel-efficient trajectories can be only numerically computed using either direct or indirect methods, i.e. no general analytical solutions exist. Solutions based on direct methods are generally obtained by converting the infinite-dimensional optimal control problem into a finite constrained Non-Linear Programming (NLP) problem7. Recently, Acikmese et al.6 devised a convex optimization approach where the minimum-fuel soft landing problem is cast as a Second Order Cone Pro-gramming problem (SOCP). Consequently, the resulting optimal problem can be solved in poly-nomial time using interior-point method algorithms. In such a case and for a prescribed accuracy, convergence is guaranteed to the global minimum within a finite number of iterations. The latter makes the method attractive for possible future on-board implementation.

However, solving optimal control problems to generate reference trajectories implies compu-ting off-line open-loop trajectories. Conversely, generating closed-loop feedback landing trajecto-ries that are rooted in optimal control theory is not an easy task. Recently, generalized Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) feedback guidance8 and its robustified version known as Optimal Sliding Guidance (OSG)9 have been proposed and analyzed for planetary land-ing on large and small bodies. The ZEM/ZEV feedback guidance has been studied extensively and can be found in the literature for intercept, rendezvous, terminal guidance and landing appli-cations. Such analytical closed-loop guidance has been originally conceived by Battin10 who de-vised an energy optimal, feedback acceleration command for powered planetary descent. Ebrahi-mi et al.11 introduced the ZEV concept, as a partner for the well-known ZEM and integrated it with a sliding surface for missile guidance with fixed-time propulsive maneuvers. Furfaro et al.12 extended the idea to the problem of lunar landing guidance and set the basis for the theoretical development of a robust closed-loop algorithm for precision landing. The ZEM/ZEM feedback guidance is attractive because of its analytical simplicity as well as the quasi-optimal fuel perfor-mance in constant gravitational field. When robustified by a time-dependent sliding term, the re-sulting OSG can be proven to be Globally Finite-Time Stable (GFTS) in spite of perturbation with known upper bound. The ZEM/ZEV feedback guidance and its robustified OSG version is generally simple to implement and mechanize on the onboard lander computer and can theoreti-cally drive the spacecraft to a target landing site autonomously and with high precision. However, it is not generally capable of generating feedback constrained trajectories. Whereas few studies have been conducted on how to implement constraints in the ZEM/ZEV algorithmic formula-tion13, much needs to be done to address the generalized feedback ZEM/ZEV constrained prob-lem.

3

In this paper, we propose a waypoint-based ZEM/ZEV algorithm that employs a Value Itera-tion approach to determine the best waypoints for constrained landing over large planetary bodies (e.g. Moon and Mars). More specifically, a Reinforcement Learning (RL) framework is con-structed to integrate ZEM/ZEV with a waypoint selection policy as function of the current state of the spacecraft during the powered descent phase (i.e. position and velocity). Here, a set of open-loop, constrained, fuel-efficient trajectories are numerically computed using pseudo-spectral methods. A set of states from the open-loop optimal trajectories are stored as candidate way-points. The latter are employed by the ZEM/ZEV algorithm as intermediate targets to steer the spacecraft toward the final target point on the planetary surface. The problem is cast as a Markov Decision Process (MDP) and the resulting dynamics programming problem is solved via value iteration by using the action-state value function (Q-function) to select the next best intermediate target point as function of the previous one. The behavior of the integrated guidance algorithm is evaluated in landing scenarios that involve demanding requirements both in landing location and flight path. Both landing precision, constraints satisfaction and fuel efficiency are analyzed to show the effectiveness of the proposed approach via Monte Carlo Simulations.

GUIDANCE PROBLEM FORMULATION

In formulating the spacecraft landing guidance problem, we model the spacecraft dynamics using a 3 DOF dynamical model. Assuming that the spacecraft has negligible mass in comparison to the primary body, the equations of motion for the mass-varying guided system can be ex-pressed as:

�̇�𝒓 = 𝒗𝒗 (1)

�̇�𝒗 = 𝒈𝒈(𝒓𝒓) + 𝒂𝒂𝐶𝐶 + 𝒑𝒑 = 𝒈𝒈 + 𝑻𝑻𝑚𝑚

+ 𝒑𝒑 (2)

�̇�𝑚 = − ‖𝑻𝑻‖𝐼𝐼𝑠𝑠𝑠𝑠𝑔𝑔𝑐𝑐

(3)

Here, 𝒓𝒓 and 𝒗𝒗 are the position and velocity vectors of the spacecraft, 𝒂𝒂𝐶𝐶 is the commanded ac-celeration provided by the on-board thrusters, 𝑻𝑻 is the thrust vestor, 𝑚𝑚 is the spacecraft mass, 𝐼𝐼𝑠𝑠𝑠𝑠 is the specific impulse of the propulsion system, and 𝒈𝒈 is the gravitational acceleration. Im-portantly, 𝒑𝒑 is the acceleration due to perturbations and unmodelled dynamics. Importantly, for the problem of terminal descent on a large planetary body, the gravitational acceleration can be assumed to be constant, i.e. 𝒈𝒈(𝒓𝒓) = 𝒈𝒈. Fig. 1 shows an example of the definition of the vectors for the power descent scenario on a planetary body.

4

Figure 1: Schematic of the forces and accelerations involved in the landing guidance problem.

GENERALIZED ZEM/ZEV FEEDBACK GUIDANCE ALGORITHM

The ZEM/ZEV feedback guidance is derived by setting an appropriate guidance model suita-ble for the analytical derivation of a closed-loop acceleration command. The model used in the ZEM/ZEV law derivation does not take into account any perturbing acceleration (i.e. 𝒑𝒑 = 𝟎𝟎) and does not account for variations in the spacecraft mass, i.e. Eq. (3) does not apply. The equations representing the guidance model are expressed as follows:

�̇�𝒓 = 𝒗𝒗 (4)

�̇�𝒗 = 𝒈𝒈 + 𝒂𝒂𝐶𝐶 (5)

Eqs. (4) and (5) can be integrated starting from the knowledge of the position and velocity at time 𝑡𝑡 to formally determine spacecraft position and velocity at a specified final time 𝑡𝑡𝑓𝑓:

𝒗𝒗�𝑡𝑡𝑓𝑓� = 𝒗𝒗(𝑡𝑡) + ∫ �𝒈𝒈 + 𝒂𝒂𝐶𝐶(𝜏𝜏)�𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓𝑡𝑡 (6)

𝒓𝒓�𝑡𝑡𝑓𝑓� = 𝒓𝒓(𝑡𝑡) + 𝒗𝒗(𝑡𝑡)𝑡𝑡𝑔𝑔𝑔𝑔 + ∫ ∫ �𝒈𝒈 + 𝒂𝒂𝐶𝐶(𝜏𝜏′)�𝑑𝑑𝜏𝜏′𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓𝜏𝜏

𝑡𝑡𝑓𝑓𝑡𝑡 (7)

Here, 𝑡𝑡𝑔𝑔𝑔𝑔 = 𝑡𝑡𝑓𝑓 − 𝑡𝑡 is defined as the time-to-go, i.e. the time required to reach the desired posi-tion (target) with the desired velocity. Subsequently, the following two zero-effort miss terms are defined.

5

Definition: Given the current time 𝑡𝑡, we define the Zero-Effort-Miss (𝒁𝒁𝒁𝒁𝒁𝒁) as the vector dis-tance the spacecraft will miss the target if no acceleration command (guidance) is generated after time 𝑡𝑡:

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 𝒓𝒓𝑓𝑓 − 𝒓𝒓�𝑡𝑡𝑓𝑓�, 𝒂𝒂𝐶𝐶(𝜏𝜏) = 𝟎𝟎, 𝜏𝜏 ∈ (𝑡𝑡, 𝑡𝑡𝑓𝑓] (8)

Definition: Given the current time 𝑡𝑡, we define the Zero-Effort-Velocity (𝒁𝒁𝒁𝒁𝒁𝒁) as the vector distance the spacecraft will miss the target velocity if no acceleration command (guidance) is generated after time 𝑡𝑡:

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 𝒗𝒗𝑓𝑓 − 𝒗𝒗�𝑡𝑡𝑓𝑓�, 𝒂𝒂𝐶𝐶(𝜏𝜏) = 𝟎𝟎, 𝜏𝜏 ∈ (𝑡𝑡, 𝑡𝑡𝑓𝑓] (9)

Here, 𝒓𝒓𝑓𝑓 and 𝒗𝒗𝑓𝑓 are the desired position and velocity at the final time. Both 𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) and 𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) can be explicitly expressed as functions of the current position, velocity, and time-to-go by substituting Eq.(6), and (7) with 𝒂𝒂𝐶𝐶 = 𝟎𝟎 into Eq. (8) and (9):

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 𝒗𝒗𝑓𝑓 − 𝒗𝒗(𝑡𝑡) − ∫ 𝒈𝒈𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓𝑡𝑡 (10)

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 𝒓𝒓𝑓𝑓 − 𝒓𝒓(𝑡𝑡) − 𝒗𝒗(𝑡𝑡)𝑡𝑡𝑔𝑔𝑔𝑔 − ∫ ∫ 𝒈𝒈𝑑𝑑𝜏𝜏′𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓𝜏𝜏

𝑡𝑡𝑓𝑓𝑡𝑡 (11)

The feedback guidance law is generated by solving an energy-optimal control problem as function of 𝒁𝒁𝒁𝒁𝒁𝒁 and 𝒁𝒁𝒁𝒁𝒁𝒁. Indeed, given the actual spacecraft position and velocity, both quanti-ties can be estimated on-line by the numerical integration of the (unperturbed) equations of mo-tion as functions of the time-to-go and the targeted conditions. One of the key ingredients is the ability to obtain a closed loop guidance law that minimizes the overall guidance effort, i.e. a guidance law that minimizes the overall acceleration command. Let the optimal control problem in terms of the 𝒁𝒁𝒁𝒁𝒁𝒁/𝒁𝒁𝒁𝒁𝒁𝒁 defined as follows:

𝑚𝑚𝑚𝑚𝑚𝑚𝒂𝒂𝐶𝐶

𝐽𝐽 = 12 ∫ 𝒂𝒂𝐶𝐶𝑇𝑇𝒂𝒂𝐶𝐶𝑑𝑑𝜏𝜏

𝑡𝑡𝑓𝑓𝑡𝑡 , 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 𝑡𝑡𝑡𝑡

𝒁𝒁𝒁𝒁𝒁𝒁̇ = −𝒂𝒂𝐶𝐶𝑡𝑡𝑔𝑔𝑔𝑔 (12)

𝒁𝒁𝒁𝒁𝒁𝒁̇ = −𝒂𝒂𝐶𝐶

With the following boundary conditions:

𝒁𝒁𝒁𝒁𝒁𝒁�𝑡𝑡𝑓𝑓� = 𝟎𝟎 (13)

𝒁𝒁𝒁𝒁𝒁𝒁�𝑡𝑡𝑓𝑓� = 𝟎𝟎

6

The constraints in Eq. (12) represent the dynamics of the 𝒁𝒁𝒁𝒁𝒁𝒁/𝒁𝒁𝒁𝒁𝒁𝒁 system. Such equations can be derived by taking the derivative of Eq. (10) and (11) and by applying the Fundamental Theorem of Calculus:

𝒁𝒁𝒁𝒁𝒁𝒁̇ = 𝑑𝑑𝑑𝑑𝑡𝑡�𝒗𝒗𝑓𝑓 − 𝒗𝒗(𝑡𝑡) − ∫ 𝒈𝒈𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓

𝑡𝑡 � = −�̇�𝒗(𝑡𝑡) + 𝒈𝒈 = −𝒂𝒂𝐶𝐶 (14)

𝒁𝒁𝒁𝒁𝒁𝒁̇ = 𝑑𝑑𝑑𝑑𝑡𝑡�𝒓𝒓𝑓𝑓 − 𝒓𝒓(𝑡𝑡) − 𝒗𝒗(𝑡𝑡)𝑡𝑡𝑔𝑔𝑔𝑔 − ∫ ∫ 𝒈𝒈𝑑𝑑𝜏𝜏′𝑑𝑑𝜏𝜏𝑡𝑡𝑓𝑓

𝜏𝜏𝑡𝑡𝑓𝑓𝑡𝑡 � = −�̇�𝒗(𝑡𝑡)𝑡𝑡𝑔𝑔𝑔𝑔 + 𝒈𝒈𝑡𝑡𝑔𝑔𝑔𝑔 = −𝒂𝒂𝐶𝐶𝑡𝑡𝑔𝑔𝑔𝑔 (15)

In order to find an analytical solution to the optimal guidance problem, the acceleration com-mand is assumed to be unbounded. The problem is solved by a straightforward application of the Pontryagin Minimum Principle (PMP). First, the Hamiltonian is defined as follows:

𝐻𝐻 = 12𝒂𝒂𝐶𝐶𝑇𝑇𝒂𝒂𝐶𝐶 + 𝝀𝝀1𝑇𝑇�−𝒂𝒂𝐶𝐶𝑡𝑡𝑔𝑔𝑔𝑔� + 𝝀𝝀2𝑇𝑇(−𝒂𝒂𝐶𝐶) (16)

The condition for optimality of the Hamiltonian according to Pontryagin’s Minimum Principle is such that the following is true:

𝜕𝜕𝜕𝜕𝜕𝜕𝒂𝒂𝐶𝐶

= 𝒂𝒂𝐶𝐶 − 𝝀𝝀𝟏𝟏𝑡𝑡𝑔𝑔𝑔𝑔 − 𝝀𝝀2 = 𝟎𝟎 (17)

With associated costate equations found as:

𝝀𝝀1̇ = − 𝜕𝜕𝜕𝜕𝜕𝜕𝒁𝒁𝒁𝒁𝒁𝒁

= 𝟎𝟎 (18)

𝝀𝝀2̇ = −𝜕𝜕𝐻𝐻𝜕𝜕𝒁𝒁𝒁𝒁𝒁𝒁

= 𝟎𝟎

Thus, the optimal acceleration command is found to be expressed as a linear function of time with two unknown coefficients, to be solved for by utilizing the final conditions of the equations of motion:

𝒂𝒂𝐶𝐶 = 𝝀𝝀1𝑡𝑡𝑔𝑔𝑔𝑔 + 𝝀𝝀2 (19)

Using this result, we can integrate the equations of motion to analytically solve for the La-grange multipliers 𝝀𝝀1and 𝝀𝝀2 . Beginning with Eq. (14) and using the results above in Eq. (19) plus the boundary conditions in Eq. (13):

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 𝝀𝝀2𝑡𝑡𝑔𝑔𝑔𝑔 + 12𝝀𝝀1𝑡𝑡𝑔𝑔𝑔𝑔2 (20)

7

This result can be similarly employed to integrate Eq. (15) to find the following:

𝒁𝒁𝒁𝒁𝒁𝒁(𝑡𝑡) = 12𝝀𝝀2𝑡𝑡𝑔𝑔𝑔𝑔2 + 1

3𝝀𝝀1𝑡𝑡𝑔𝑔𝑔𝑔3 (21)

The Lagrangian multipliers can be therefore found by substitution to express the acceleration command as function of the current vehicle state, desired final positon and the time-to-go:

𝝀𝝀1 = 12𝑡𝑡𝑔𝑔𝑔𝑔3

𝒁𝒁𝒁𝒁𝒁𝒁− 6𝑡𝑡𝑔𝑔𝑔𝑔2

𝒁𝒁𝒁𝒁𝒁𝒁 (22)

𝝀𝝀2 = − 6𝑡𝑡𝑔𝑔𝑔𝑔2

𝒁𝒁𝒁𝒁𝒁𝒁 + 4𝑡𝑡𝑔𝑔𝑔𝑔

𝒁𝒁𝒁𝒁𝒁𝒁

Thus, substituting into Eq. (19), the optimal acceleration is found to be the following:

𝒂𝒂𝐶𝐶 = 6𝑡𝑡𝑔𝑔𝑔𝑔2

𝒁𝒁𝒁𝒁𝒁𝒁− 2𝑡𝑡𝑔𝑔𝑔𝑔

𝒁𝒁𝒁𝒁𝒁𝒁 (23)

The methodology employed to determine the generalized feedback guidance algorithm as a function of ZEM and ZEV is very similar to the analysis presented by D’Souza,14 who derived the optimal acceleration command for a power landing descent as a function of error in position (actual position minus target position), velocity (actual velocity minus target velocity) and time-to-go. The same analysis was further analyzed and developed by Guo, et. Al8, who named the resulting algorithm as generalized ZEM/ZEV feedback guidance law. As a drawback, note that as formulated, the ZEM/ZEV algorithm does not impose any constraints in term of maximum thrust or minimum altitude. Nevertheless, the guidance algorithm is generally easy to implement and mechanize which may justify the attractiveness of the guidance approach. Numerical simulations of the closed-loop trajectories may be analyzed a-posteriori to verify that both constraints are neither never violated or alternatively verify that the ZEM/ZEV algorithm works (i.e. guides the spacecraft to the target) even in the presence of thrust saturation.

REINFORCEMENT LEARNING AND VALUE ITERATION

Sometimes called approximate dynamic programming, Reinforcement Learning15 is a ma-chine learning technique concerned with how an agent (e.g. the lander) must take actions in un-certain environments to maximize a cumulative reward (e.g. minimize the landing errors, mini-mizing fuel consumption). The dynamical environment is conventionally formulated as a Markov Decision Process (MDP) where the transition between states for a given action is modeled using appropriate transition probability. A MDP consists of:

• A set of states 𝑆𝑆 (where 𝑠𝑠 denotes a state 𝑠𝑠 ∈ 𝑆𝑆 )

• A set of actions 𝐴𝐴 (where 𝑎𝑎 denotes an action 𝑎𝑎 ∈ 𝐴𝐴)

8

• A reward function 𝑅𝑅(𝑠𝑠) → ℜ that maps a state (or possibly a state-action pair) to the set of real numbers

• State transition probabilities, which defines the probability distribution over the new state 𝑠𝑠′ ∈ 𝑆𝑆 that will be transitioned to given action 𝑎𝑎 is taken while in state 𝑠𝑠

• An optional discount rate 𝛾𝛾 that is typically used for infinite horizon problems

RL algorithms can solve/learn a policy 𝜋𝜋(𝑠𝑠): 𝑆𝑆 ↦ 𝐴𝐴 that maps each state to an optimal action. Many RL-based algorithms approach the problem of determining the optimal policy indirectly, i.e. through the value function. For a given policy 𝜋𝜋(𝑠𝑠), the value function is defined as the ex-pected sum of rewards given that the system is in state 𝑠𝑠 and executes policy 𝜋𝜋(𝑠𝑠):

𝑉𝑉𝜋𝜋(𝑠𝑠) = 𝐸𝐸𝜋𝜋[𝑅𝑅𝑡𝑡+1 + 𝛾𝛾𝑅𝑅𝑡𝑡+2 + 𝛾𝛾2𝑅𝑅𝑡𝑡+3 + ⋯ |𝑆𝑆𝑡𝑡 = 𝑠𝑠] (24)

The value function can be expressed in terms of the reward associated with being in the cur-rent state plus the discounted expectation over the distribution of the next states given that the controller is following policy 𝜋𝜋(𝑠𝑠):

𝑉𝑉𝜋𝜋(𝑠𝑠) = 𝑅𝑅(𝑠𝑠) + 𝛾𝛾 � 𝑃𝑃𝑠𝑠𝑎𝑎(𝑠𝑠′)𝑉𝑉𝜋𝜋(𝑠𝑠′)𝑠𝑠′∈ 𝑆𝑆

(25)

The optimal value function, 𝑉𝑉𝜋𝜋(𝑠𝑠) is generally found using dynamic programming algorithms, e.g. value iteration. Once the optimal value function is determined, the optimal policy is deter-mined by taking the action that maximizes the expected sum of future rewards for the given cur-rent state; using Eq. (25), the optimally policy can be formally expressed as follows:

𝜋𝜋∗(𝑠𝑠) = arg max𝑎𝑎∈𝐴𝐴

� 𝑃𝑃𝑠𝑠𝑎𝑎(𝑠𝑠′)𝑉𝑉∗(𝑠𝑠′)𝑠𝑠′∈ 𝑆𝑆

(26)

An additional important function is the action- state value function (Q-function), defined as follows:

𝑄𝑄𝜋𝜋(𝑠𝑠,𝑎𝑎) = 𝐸𝐸𝜋𝜋[𝑅𝑅𝑡𝑡+1 + 𝛾𝛾𝑅𝑅𝑡𝑡+2 + 𝛾𝛾2𝑅𝑅𝑡𝑡+3 + ⋯ |𝑆𝑆𝑡𝑡 = 𝑠𝑠,𝐴𝐴𝑡𝑡 = 𝑎𝑎] (27)

The Q-function can be similarly decomposed into the immediate reward plus the discounted value of the successor state to obtain the Bellman optimality equation:

𝑄𝑄𝜋𝜋(𝑠𝑠,𝑎𝑎) = 𝑅𝑅(𝑠𝑠,𝑎𝑎) + 𝛾𝛾 ∑ 𝑃𝑃𝑠𝑠𝑎𝑎(𝑠𝑠′)𝑄𝑄𝜋𝜋(𝑠𝑠′,𝜋𝜋(𝑠𝑠′))𝑠𝑠′∈ 𝑆𝑆 (28)

9

The optimal policy is then formulated as follows:

𝑄𝑄∗(𝑠𝑠,𝑎𝑎) = 𝑅𝑅(𝑠𝑠,𝑎𝑎) + 𝛾𝛾 ∑ 𝑃𝑃𝑠𝑠𝑎𝑎(𝑠𝑠′)𝑚𝑚𝑎𝑎𝑚𝑚𝑎𝑎′∈𝐴𝐴

𝑄𝑄∗(𝑠𝑠′,𝑎𝑎′)𝑠𝑠′∈ 𝑆𝑆 (29)

𝜋𝜋∗(𝑠𝑠) = 𝑎𝑎𝑎𝑎𝑎𝑎𝑚𝑚𝑎𝑎𝑚𝑚𝑎𝑎∈𝐴𝐴

𝑄𝑄∗(𝑠𝑠,𝑎𝑎) (30)

The optimal 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) can be found by a Value Iteration approach (policy evaluation). Starting from initial random value for the Q-function, the contraction theorem ensures convergence to the optimal value.

Figure 2: A) Comparison between opne-loop, fuel-efficient constrained trajectories and the

ZEM/ZEV guidance algorithm. B) Set of 9 open-loop, fuel-efficient constrained trajectories and selected wyapoints. C) Spacecraft mass history (rightmost trajectory). D) Optimal thsut magnitude hystory (rightmost trajectory)

WAYPOINT ZEM/ZEV GUIDANCE VIA Q-VALUE ITERATION

The ZEM/ZEV algorithm has been developed under specific assumptions (e.g. unconstrained thrust/acceleration, constant gravitational field etc.). As a result, the guided trajectories are not guaranteed to satify specific state/control constraints that may be necessary for a precise soft landing on a planetary surface. An example of such occurrence is given in Fig.2A. Here, we compare open-loop fuel optimal descent trajectories (marked in red) with ZEM/ZEV guided trajectories (marked in blue). The open-loop trajectories have been numerically computed to satisfy a glide-slope constraint where the trajectories are required not to go below a 30 𝑑𝑑𝑠𝑠𝑎𝑎 terrain slope to avoid crashing on the surface of the planet. Clearly, while the open-loop optimal

10

trajectories satisfy the constraint, the ZEM/ZEV feedback trajectories do not. To solve the problem, we consider a waypoint-based guidance scheme that combine the basic ZEM/ZEV algorithm with a waypoint selection scheme based on RL via value iteration. More specifically, we solve off-line a set of open-loop, constrained fuel-efficient trajectories and select a finite number of state/waypoints (position and velocity) from the set of trajectories. Then, we employ dynamic programming instantiated via value iteration to find the optimal 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) that enables the selection of action 𝑎𝑎 as the next waypoint. Importantly, the ZEM/ZEV algorithm is employed to drive the system in between waypoints until the final (selected) landing site.

Waypoint Selection Based on Open-Loop Optimal Trajectories

We consider the Mars landing guidance problem subject to thrust and state constraints. A set of candidate waypoints (position and velocity) is selected by numerically computing open-loop, constrained fuel optimal trajectories that solve the following optimal control problem:

Minimum-Fuel Problem: Find the thrust program that minimizes the following cost function (negative of the lander final mass; equivalent to minimizing the amount of propellant during descent):

max𝑡𝑡𝐹𝐹,𝐓𝐓(∙)

𝑚𝑚𝐿𝐿(𝑡𝑡𝐹𝐹) = min𝑡𝑡𝐹𝐹,𝐓𝐓(∙)

�‖𝐓𝐓‖

𝑡𝑡𝐹𝐹

0

𝑑𝑑𝑡𝑡 (31)

Subject to the following constraints (equations of motion):

�̈�𝒓𝐿𝐿 = −𝒈𝒈𝑳𝑳 +𝑻𝑻𝑚𝑚𝐿𝐿

(32)

𝑑𝑑𝑑𝑑𝑡𝑡𝑚𝑚𝐿𝐿 = −

‖𝑻𝑻‖𝐼𝐼𝑠𝑠𝑠𝑠𝑎𝑎c

(33)

and the following boundary conditions and additional constraints:

0 < 𝑇𝑇𝑚𝑚𝑚𝑚𝑚𝑚 < ‖𝑻𝑻‖ < 𝑇𝑇𝑚𝑚𝑎𝑎𝑚𝑚 (34)

𝒓𝒓𝐿𝐿(0) = 𝒓𝒓𝐿𝐿0, 𝒗𝒗𝐿𝐿(0) = �̇�𝒓𝐿𝐿(0) = 𝒗𝒗𝐿𝐿0 (35)

𝒓𝒓𝐿𝐿(𝑡𝑡𝐹𝐹) = 𝒓𝒓𝐿𝐿𝐹𝐹 , 𝒗𝒗𝐿𝐿(𝑡𝑡𝐹𝐹) = �̇�𝒓𝐿𝐿(𝑡𝑡𝐹𝐹) = 𝒗𝒗𝐿𝐿𝐹𝐹 (36)

𝑚𝑚𝐿𝐿(0) = 𝑚𝑚𝐿𝐿𝐿𝐿𝐿𝐿𝑡𝑡 (37)

𝑻𝑻 = ‖𝑻𝑻‖∑ �̂�𝑡𝑚𝑚3𝑚𝑚=1 , �̂�𝑡1 + �̂�𝑡2 + �̂�𝑡3 = 1 (38)

𝜃𝜃𝑎𝑎𝑎𝑎𝑡𝑡(𝑡𝑡) = 𝑎𝑎𝑎𝑎𝑠𝑠𝑡𝑡𝑎𝑎𝑚𝑚 (�𝑟𝑟𝐿𝐿2(𝑡𝑡)2+𝑟𝑟𝐿𝐿3(𝑡𝑡)2

𝑟𝑟𝐿𝐿1(𝑡𝑡) ) ≤ 𝜃𝜃�𝑎𝑎𝑎𝑎𝑡𝑡 (39)

11

Here, the thrust is constrained to operate between a minimum value (𝑇𝑇𝑚𝑚𝑚𝑚𝑚𝑚) and a maximum value (𝑇𝑇𝑚𝑚𝑎𝑎𝑚𝑚) and subject to glide-slope constaints Eq.(39). The problem formulated in Eq. (31-39) does not have an analytical solution and must be solved numerically. Open-loop, fuel-optimal thrust programs can be obtained using optimal control software packages such as the General Pseudospectral Optimal Control Software (GPOPS16). A set of nine (9) 2-D (vertical plane) fuel-optimal trajectories have been computed that are initiated at an altitude of 1500 meters, with a downrange comprised between −500 𝑚𝑚 and −100 𝑚𝑚 (vertical descent). For each case, both the initial descent and lateral (downrange) velocity has been kept constant with values −75 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠 and 75 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠, respectively. The lander is assumed to be a small robotic vehicle, with six throttlable engines (𝐼𝐼𝑠𝑠𝑠𝑠 = 292 sec ). For these simulations, the only dynamical force included is the gravitational force of the Mars, as seen in Eq. (32). The lander is assumed to weigh 1905 kg (wet mass) and is capable of a maximum (allowable) thrust of 13 kN as well as a minimum (allowable) thrust of 5 kN. The slope is set to be 𝜃𝜃�𝑎𝑎𝑎𝑎𝑡𝑡 = 10 𝑑𝑑𝑠𝑠𝑎𝑎. Results are shown in Fig. 2B-D where the optimal constrained trajectories are reported together with a sample of the optimal thrust magnitude and spacecraft mass history. Fig. 2B show also the selection of a set of waypoint along the trajectories. In this specific case, we considered three possible altitudes, i.e. 1500 𝑚𝑚, 500 𝑚𝑚 and 250 𝑚𝑚. For each altitude, 9 points were selected for a total of 28 waypoints (including the final landing state). At this stage there is no specific rationale for selecting points at that specified altitude, but a parametric study will be conducted to understand how performances change as function of altitude. Simulations studies will show cases where 4 altitudes are considered for a total of 37 waypoints.

Value Iteration for Waypoint Guidance: Computing the Q-function

Once the waypoint have been established, we employ RL and Value iteration to determine the sequence of states that can be targeted by the ZEM/ZEV algorithm until soft landing is successfully achieved. In the algorithm, we establish a finite number of states and actions. More specifically, the number of states is established to be the selected number of waypoints, i.e. 𝑠𝑠 ∈𝑆𝑆 = {𝑾𝑾𝑷𝑷𝒊𝒊}𝑚𝑚=1𝑁𝑁 where 𝑊𝑊𝑃𝑃𝑚𝑚 = [𝒓𝒓𝑚𝑚 ,𝒗𝒗𝑚𝑚]𝑇𝑇 is the i-th waypoint and 𝑁𝑁 is the number of selected waypoints. The set of actions 𝑎𝑎 ∈ 𝐴𝐴 ≡ 𝑆𝑆 is set exactly equal to the states. The rationale is that when the spacecraft is in state 𝑠𝑠 ∈ 𝑆𝑆, the algorithm will select an action corresponding to the next state to be targeted by the ZEM/ZEV guidance algorithm. The system is trained by iteratively solving the Bellman optimality equation (Eq. (29)) until convergence to the 𝑄𝑄∗(𝑠𝑠,𝑎𝑎). The optimal waypoint selection policy 𝜋𝜋∗(𝑠𝑠) is computed according to Eq. (30). One key point is to establish the rewars function 𝑅𝑅(𝑠𝑠,𝑎𝑎). In our case, we have the following prescription:

1. Reward is set to −1000 (𝑚𝑚𝑠𝑠𝑎𝑎𝑎𝑎𝑡𝑡𝑚𝑚𝑛𝑛𝑠𝑠) if the selected waypoint is at an altitude above the current state

2. Reward is set to −1000 (𝑚𝑚𝑠𝑠𝑎𝑎𝑎𝑎𝑡𝑡𝑚𝑚𝑛𝑛𝑠𝑠) if the selected waypoint generates a ZEM/ZEV guided trajectory that violate glide-slope constraints

3. Reward is set to −1000 (𝑚𝑚𝑠𝑠𝑎𝑎𝑎𝑎𝑡𝑡𝑚𝑚𝑛𝑛𝑠𝑠) if the selected waypoint is achieved with error more than 1𝑚𝑚 in position and more than 1 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠 in absolute velocity

4. Reward is set to the final spacecraft mass in any other case (positive reward)

The general approach is to penalize the system if glide-slope constraints are violated during flight (2) and if accuracy requirements are not met (3). However, positive rewards are assigned in any other case, crediting higher points for lower propellant mass consumption (higher final mass). Fig. 3 shows the results of the Value Iteration for the case with 𝑁𝑁 = 28 and waypoints selected as

12

in Fig. 2B. The 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) is a 28 × 28 matrix and converges to the optimal value in 30 iterations. For a given state, 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) is employed to select the next action (i.e. waypoint to target) based on the highest value of the 𝑄𝑄∗(𝑠𝑠, 𝑎𝑎).

Figure 3: A) 𝑸𝑸∗(𝒔𝒔,𝒂𝒂) after one iteration of the Bellman optimality equation. B) 𝑸𝑸∗(𝒔𝒔,𝒂𝒂) after

two iteration of the Bellman optimality equation. C) Converged 𝑸𝑸∗(𝒔𝒔,𝒂𝒂). D) Top view of 𝑸𝑸∗(𝒔𝒔,𝒂𝒂). Values are encoded in a colorbar for visual representation.

RESULTS AND PERFORMANCE ANALYSIS

Once the 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) is determined via Value Iteration, the Q-function is employed to select the sequence of waypoints that ensures constraints satisfaction. The resulting waypoint-based ZEM/ZEV guidance algorithm is simulated to parametrically understand performances as function of the number of altitude of the selected points. The simulations assume the same 2-D Mars powered descent scenarios employed to compute the open-loop optimal descent trajectories. Fig. 4, 5 and 6 show the guided trajectories generated by the waypoint-based ZEM/ZEV feedback guidance and the mass propellant difference with respect to the correpondent fuel-optimal solution as function of the altitude and number of the selected waypoints. In the case represented in Fig. 4, the selected points are located at an altitude 1500 𝑚𝑚, 1000 𝑚𝑚, 250 𝑚𝑚 for a total of 28 waypoints. In all cases, the glide-slope constraints are not violated. Case one (rightmost) is the most stressful as the algorithm selects a sequence of points that ensure maintaining constraints. However, the price to avoid crashing on the ground is to select a sequence that yieldsa substantially larger amount of propellant. Fig. 5 shows a set of simulations where the 28 waypoints are located at an altitude of 1500 𝑚𝑚, 500 𝑚𝑚, 75 𝑚𝑚. In this case, constaints are maintained with a reasonable amount of additional propellant (less than 12%). Fig. 6 shows a

13

case where 37 waipoints were selected at altitude 1500 𝑚𝑚, 500 𝑚𝑚, 250 𝑚𝑚, 75 𝑚𝑚. Performance are reasonable but no improvement in propellant consumption is recorded.

Figure 4: Waypoint-based ZEM/ZEV guidance evaluation for 28 waypoints scheme. The altitudes

of the waypoints is seelcted to be 𝟏𝟏𝟏𝟏𝟎𝟎𝟎𝟎 𝒎𝒎,𝟏𝟏𝟎𝟎𝟎𝟎𝟎𝟎 𝒎𝒎,𝟐𝟐𝟏𝟏𝟎𝟎 𝒎𝒎. The blue lines show the descent closed-loop trajectories for four different intial conditions. The waypoints are reported to ilustrate the selection strategy. B shows the mass propellant increase w.r.t. the optimal solution.


of the waypoints is seelcted to be 𝟏𝟏𝟏𝟏𝟎𝟎𝟎𝟎 𝒎𝒎,𝟏𝟏𝟎𝟎𝟎𝟎 𝒎𝒎, 𝟕𝟕𝟏𝟏 𝒎𝒎. The blue lines show the descent closed-loop trajectories for four different intial conditions. The waypoints are reported to ilustrate the selection strategy. B shows the mass propellant increase w.r.t. the optimal solution.

14


of the waypoints is seelcted to be 𝟏𝟏𝟏𝟏𝟎𝟎𝟎𝟎 𝒎𝒎,𝟏𝟏𝟎𝟎𝟎𝟎 𝒎𝒎,𝟐𝟐𝟏𝟏𝟎𝟎 𝒎𝒎,𝟕𝟕𝟏𝟏 𝒎𝒎. The blue lines show the descent closed-loop trajectories for four different intial conditions. The waypoints are reported to ilustrate the selection strategy. B shows the mass propellant increase w.r.t. the optimal solution.

Figure 6: Monte Carlo simulations. 500 closed-loop trajectories have been generated to show

waypoint selection scheme and accuracy. The right panel shows the histogram of the additional propellant mass w.r.t. the fuel-optimal.

Monte Carlo Simulations

A set of 500 Monte Carlo simulation are executed to evaluate the behavior of the proposed algorithm. For a similar Mars descent scenario, the downrange position has been randomly drawn from a Gaussian distribution with mean downrange of 250 𝑚𝑚 and standard deviation of 100 𝑚𝑚 (1𝜎𝜎). Likewise, the downrange velocity has been drawn from a Gaussian distribution with mean downrange velocity of 75 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠 and standard deviation of 10 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠 (1𝜎𝜎). Vertical mean velocity is 75 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠 with standard devisation of 10 𝑚𝑚/𝑠𝑠𝑠𝑠𝑠𝑠. Initial altitude is set at 1500 𝑚𝑚 and 28 waypoints have been selected from the open-loop optimal trajectories at altitude

15

1500 𝑚𝑚, 500 𝑚𝑚, 75 𝑚𝑚. The initial state 𝑠𝑠 ∈ 𝑆𝑆 is selected based on a minimum euclidean distance. Results are reported in Fig. 7. Here, the guided trajectories are reported together with the mass of propellant difference (with respect to the fuel efficient case). Importantly, glide-slope constaints are satisfied for all case. The sequence of target waypoints is selected by the 𝑄𝑄∗(𝑠𝑠,𝑎𝑎) function depending on the initial conditions. The average mass increase is computed to be 14%.

CONCLUSIONS AND FUTURE WORK

In this paper, we proposed, implemented and analyzed a novel closed-loop algorithm for plan-etary power descent guidance. The waypoint-based ZEM/ZEV feedback guidance integrates the generalized ZEM/ZEV feedback guidance into a waypoint selection scheme to ensure that flight constraints are satisfied by targeting appropriate states distributed along the descent state space. Such points are determined by computing a set of open-loop, fuel-optimal trajectories and select-ing them from fixed selected altitudes. A Reinforcement Learning framework is subsequently employed to determine the optimal selection strategy. The Value Iteration scheme solves the re-sulting dynamical programming problem and compute the optimal Q-function. The latter is di-rectly interrogated to select the next waypoint until the desired landing location is achieved within the desired accuracy. Simulations show that flight constraints are always satisfied with relatively low additional propellant depending on the initial position and velocity. Monte Carlo simulations show that the proposed closed-loop guidance average additional propellant is about 14% with respect to the computed open-loop solution.

This preliminary study demonstrates the feasibility of the proposed approach. However, addi-tional analyses are required to evaluate if a substantial increase in the number of waypoints may yield closed-loop solutions that have performance near the open-loop, fuel efficient constrained solutions. Ideally, increasing the number of points may give the algorithm more flexibility to achieve the desired goal. However, the latter may result in increasing the computational time re-quired for the Value Iteration to converge. Note that the proposed method works assuming that both sets of states 𝑎𝑎 ∈ 𝐴𝐴 and 𝑠𝑠 ∈ 𝑆𝑆 are finite. However, one can extend this approach to design an algorithm capable of generating a waypoint selection strategy as function of any point in the state space. We are working on an efficient implementation of the Q-learning method to train a neural network capable of learning an approximation of the Q-function that models the waypoint selec-tion as function of the current position and velocity. Within this framework, the Stochastic Gradi-ent Descent (SGD) algorithm is employed to compute the optimal network parameters in a simu-lated landing environment. Future work will report on the current on-going Q-learning effort.

REFERENCES 1 Shotwell, R. (2005). Phoenix—the first Mars Scout mission. Acta Astronautica, 57(2), 121-134. 2 Grotzinger, J. P., Crisp, J., Vasavada, A. R., Anderson, R. C., Baker, C. J., Barry, R., ... & Gellert, R. (2012). Mars Science Laboratory mission and science investigation. Space science reviews, 170(1-4), 5-56. 3 A. D., Steltzner, D. M., Kipp, A., Chen, P. D.,Burkhart, C. S., Guernsey, G. F., Mendeck, R. A., Mitcheltree, R. W., Powell, T. P., Rivellini, A. M., San Martin, D. W., Way, Mars Science Laboratory Entry, Descent, and Landing Sys-tem, IEEE Aerospace Conference Paper No. 2006-1497, Big Sky, MT,(2006). 4 U., Topcu, J., Casoliva, and K., Mease, Fuel Efficient Powered Descent Guidance for Mars Landing, AIAA Paper 2005-6286, 2005. 5 F., Najson, and K., Mease, A Computationally Non-Expensive Guidance Algorithm for Fuel Efficient Soft Landing, AIAA Guidance, Navigation, and Control Conference, San Francisco, AIAAPaper 2005-6289, 2005.

16

6 B., Acikmese, and S. R., Ploen, Convex Programming Approach to Powered Descent Guidance for Mars Landing, Journal of Guidance, Control, and Dynamics, Vol. 30, No. 5, 2007, pp. 1353–1366. 7 D. A., Benson, G. T., Huntington, T. P., Thorvaldsen, and A. V., Rao, Direct trajectory optimization and costate esti-mation via an orthogonal collocation method. Journal of Guidance, Control, and Dynamics, 29(6), 2006, 1435-1440. 8 Guo, Yanning, Matt Hawkins, and Bong Wie. "Applications of generalized zero-effort-miss/zero-effort-velocity feed-back guidance algorithm." Journal of Guidance, Control, and Dynamics 36, no. 3 (2013): 810-820. 9 Wibben, Daniel R., and Roberto Furfaro. "Optimal sliding guidance algorithm for Mars powered descent phase." Ad-vances in Space Research 57, no. 4 (2016): 948-961. 10 Battin, Richard H. An introduction to the mathematics and methods of astrodynamics. Aiaa, 1999. 11 Ebrahimi, Behrouz, Mohsen Bahrami, and Jafar Roshanian. "Optimal sliding-mode guidance with terminal velocity constraint for fixed-interval propulsive maneuvers." Acta Astronautica 62, no. 10 (2008): 556-562. 12 Furfaro, Roberto, Scott Selnick, Michael L. Cupples, and Matthew W. Cribb. "Non-linear sliding guidance algo-rithms for precision Lunar landing." In 21st AAS/AIAA Space Flight Mechanics Meeting. 2011. 13 Guo, Y., Hawkins, M., & Wie, B. (2013). Waypoint-optimized zero-effort-miss/zero-effort-velocity feedback guid-ance for mars landing. Journal of Guidance, Control, and Dynamics, 36(3), 799-809. 14 C. D’Souza, An Optimal Guidance Law for Planetary Landing, AIAA Guidance, Navigation, and Control Confer-ence, AIAA Paper 1997-3709, 1997. 15 Sutton, R., Barto, A., Reinforcement Learning, MIT Press, 1998, pp. 100-103. 16 A. V., Rao, D. A., Benson, C., Darby, M. A., Patterson, C., Francolin, I., Sanders, et al. Algorithm 902: GPOPS, a MATLAB software for solving multiple phase optimal control problems using the Gauss pseudospectral method. ACM Transactions on Mathematical Software, 37(2), 2010, 22:1-22:39.

waypoint-based generalized zem/zev feedback guidance for planetary landing...

Documents