adaptive generalized zem-zev feedback guidance for ...arclab.mit.edu › wp-content › uploads ›...
TRANSCRIPT
Adaptive Generalized ZEM-ZEV Feedback Guidancefor Planetary Landing via a Deep Reinforcement
Learning Approach
Roberto Furfaro1,∗
Department of Systems & Industrial Engineering, Department of Aerospace and Mechanical
Engineering, University of Arizona, Tucson, AZ 85721, USA
Andrea Scorsoglio2
Department of Systems & Industrial Engineering, University of Arizona, Tucson, AZ
85721, USA
Richard Linares3
Department of Aeronautics and Astronautics, Massachusetts Institute of Technology,
Cambridge, MA, 02139, USA
Mauro Massari4
Department of Aerospace Science and Technology, Politecnico di Milano, Milan, 20156, ITA
Abstract
Precision landing on large and small planetary bodies is a technology of utmost
importance for future human and robotic exploration of the solar system. In
this context, the Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) feedback
guidance algorithm has been studied extensively and is still a field of active
research. The algorithm, although powerful in terms of accuracy and ease of
∗Corresponding author: Roberto FurfaroEmail addresses: [email protected] (Roberto Furfaro),
[email protected] (Andrea Scorsoglio), [email protected] (RichardLinares), [email protected] (Mauro Massari)
1Professor, Department of System & Industrial Engineering, Department of Aerospace andMechanical Engineering, University of Arizona, Tucson, AZ 85721, USA
2PhD student, Department of System & Industrial Engineering, University of Arizona,Tucson, AZ 85721, USA
3Charles Stark Draper Assistant Professor, Department of Aeronautics and Astronautics,Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
4Associate Professor, Department of Aerospace Science and Technology, Politecnico diMilano, Milan, 20156, ITA
Preprint submitted to Acta Astronautica February 2, 2020
implementation, has some limitations. Therefore with this paper we present
an adaptive guidance algorithm based on classical ZEM/ZEV in which machine
learning is used to overcome its limitations and create a closed loop guidance
algorithm that is sufficiently lightweight to be implemented on board spacecraft
and flexible enough to be able to adapt to the given constraint scenario. The
adopted methodology is an actor-critic reinforcement learning algorithm that
learns the parameters of the above-mentioned guidance architecture according
to the given problem constraints.
Keywords: Optimal Landing Guidance, Deep Reinfocement Learning,
Closed-loop Guidance
2019 MSC: 00-01, 99-00
1. Introduction
Precision landing on large and small planetary bodies is a technology of
utmost importance for future human and robotic exploration of the solar sys-
tem. Over the past two decades, landing systems for robotic Mars missions
have been developed and successfully deployed robotic assets on the Martian5
surface (e.g. rovers, landers)[1, 2]. Considering the strong interest in sending
humans to Mars within the next few decades, as well as the renewed interest
in building infrastructure in the Earth-Moon system for easy access to the Lu-
nar surface [3], the landing system technology will need to progress to satisfy
the demand for more stringent requirements. The latter will call for guidance10
systems capable of delivering landers and/or rovers to the selected planetary
surface with higher degree of precision. In the case of robotic Mars landing,
the 3-sigma ellipse, which describes the landing accuracy, has seen a dramatic
improvement from the 100 km[1] required by the Phoenix mission to 5 km fea-
tured by the newly developed ”Sky Crane” system which delivered the Mars15
Science Laboratory (MSL) to the martian surface in 2012[4]. Although such
improvements were needed to deliver a better science through robotic devices,
future missions may require delivering cargo (including humans) in specified
2
location with pinpoint accuracy (somewhere between 10 and 100 meters). Im-
portantly, delivering scientific packages in geologically interesting locations may20
require guidance systems that are fuel-optimal while satisfying stringent flight
constraints (e.g. do not crash on the surfaces with elevated slope).
One of the most important enabling technology for planetary landing is the
powered descent guidance algorithm. Generally, powered descent indicates a
phase in the landing concept of operation where rockets provide the necessary25
thrust to steer the spacecraft trajectory toward the desired location on the plane-
tary surface. The corresponding guidance algorithm must determine in real-time
both thrust magnitude, directions and time of flight. The original Apollo guid-
ance algorithm, which was used to drive the Lunar Exploration Module (LEM)
to the lunar surface, was based on an iterative approach that computed on the30
ground a flyable reference trajectory in the form of a quartic polynomial[5].
The real-time guidance algorithm generated an acceleration command that tar-
gets the final condition of the trajectory. A variation of the Apollo guidance
was also employed for the MSL powered descent phase [6]. Over the past two
decades, there has been a tremendous interest in developing new classes of guid-35
ance algorithms for powered descent that improve performance over the classical
Apollo algorithm both in precision and fuel-efficiency. Trajectory optimization
methods are currently playing a major role in generating feasible, fuel-efficient
trajectories that can be potentially computed in real-time. Much effort has been
placed in transforming a fuel-optimal constrained landing problem in a convex40
optimization problem that can guarantee finding the global optimal solution in
a polynomial time [7, 8]. Such approach yielded the G-FOLD algorithm [10]
which has been recently tested in real landing systems. Importantly, the con-
vexification methodology has been recently applied to other aerospace guidance
problems. A review of the application areas can be found in [11]. Conversely,45
another class of popular methods generally employed to solve optimal guidance
problems, rely on the application of Pontryagin Minimum Principle (PMP).
Named indirect methods, such algorithms solve the Two-Point Boundary Value
Problems (TPBVP) arising from the necessary conditions for optimality. Re-
3
cently, a three-dimensional, fuel-optimal, powered descent guidance algorithm50
based on indirect methods has been developed [12]. The approach, generally
named Universal Powered Guidance (UPG), relies on a general powered descent
methodology which has been developed and applied to ascent and orbital trans-
fer problems by Ping Lu over the past decade [14, 15, 16]. The algorithm is
capable of delivering both human and robotic device on planetary surfaces effi-55
ciently and accurately[13]. UPG provides a robust approach based on indirect
methods to 1) analyze the thrust profile structure (i.e. analyze the bang-bang
profile)and 2) find the optimal numbers of burn times. Importantly, the ad-
vantage over G-FOLD is due to its simplicity and flexibility as it does not
require customization of the algorithm[12]. However, UPG has the disadvan-60
tage that both inequality and thrust direction constraints are generally difficult
to enforce[12].
Besides the above mentioned methods, over the past few years, researchers
have been exploring the performances of the generalized Zero-Effort-Miss/Zero-
Effort-Velocity (ZEM/ZEV) feedback guidance algorithm [23, 24] in the con-65
text of landing on large and small bodies of the solar system. The feedback
ZEM/ZEV guidance law is analytical in nature and derived by a straightfor-
ward application of the optimal control theory to the power descent landing
problem. The algorithm generates a closed-loop acceleration command that
minimizes the overall system energy (i.e. the integral of the square of the ac-70
celeration norm). The ZEM/ZEV feedback guidance is attractive because of its
analytical simplicity and accuracy: guidance mechanization is straightforward
and it can theoretically drive the spacecraft to a target location on the planetary
surface both autonomously and with minimal guidance errors. Moreover, it has
been shown to be globally finite time stable and robust to uncertainties in the75
model if a proper sliding parameter is added (Optimal Sliding Guidance)[26].
Although attractive because of its simplicity and analytical structure, the al-
gorithm is not generally capable of enforcing either thrust constraints and/or
flight constraints. There have been attempts to incorporate constraints in the
classical ZEM/ZEV algorithm with the utilization of intermediate waypoints80
4
[17, 25]. Although they report good performances, they lack of flexibility and
ability to adapt in real-time.
In this paper, we propose a ZEM/ZEV-based guidance algorithm for pow-
ered descent landing that can adaptively change both guidance gains and time-
to-go to generate a class of closed-loop trajectories that 1) are quasi-optimal85
(w.r.t. the fuel-efficiency) and 2) satisfy flight constraints (e.g. thrust con-
straints, glide slope). The proposed algorithm exploit recent advancements in
deep reinforcement learning (e.g. deterministic policy gradient [29]), and ma-
chine learning (e.g. Extreme Learning Machines, ELM [27, 28]). The overall
structure of the guidance algorithm is unchanged with respect to the classical90
ZEM/ZEV, but the optimal guidance gains are determined at each time step as
function of the state via a parametrized learned policy. This is achieved using
a deep reinforcement learning method based on an actor-critic algorithm that
learns the optimal policy parameters minimizing a specific cost function. The
policy is stochastic, but only its mean, expressed as a linear combination of95
radial basis functions, is updated by stochastic gradient descent. The variance
of the policy is kept constant and is used to ensure exploration of the state
space. The critic is an Extreme Learning Machine (ELM) that approximates
the value function. The approximated value function is then used by the actor
to update the policy. The power of the method resides in its capability, if an100
adequate cost function is introduced, of satisfying virtually any constraint and
in its model-free nature that, given an accurate enough dynamics simulator for
the generation of sample trajectories, allows learning of the guidance law in any
environment, regardless of its properties. This greatly expands the capabilities
of classical ZEM/ZEV guidance, allowing for its use in a wide variety of environ-105
ment and constraint combinations, giving results that are generally close to the
constrained fuel optimal off-line solution. Additionally, because the guidance
structure is left virtually unchanged, we are able to ensure that the adaptive
algorithm is maintained globally stable regardeless of the gain adaptation.
The paper is organized as follows. In section 2 the landing problem set-up110
is described. In section 3, the theoretical background is provided, including
5
Figure 1: Problem setup
derivation of the classical ZEM/ZEV algorithm, deep Actor-Critic method and
ELM. In section 4, the proposed adaptive ZEM/ZEV algorithm is described. In
section 5, numerical results for Mars landing scenarios are reported. In section
6, a stability analysis is conducted to show the global stability properties of the115
proposed algorithm. Conclusions are reported in section 7.
2. Problem setup
The algorithm is being developed for a Mars soft landing scenario described
in Figure 1. The problem is described in a orthonormal reference system cen-
tered on the nominal landing target on the ground. The equations of motion
governing the dynamics of the problem expressed in the above mentioned refer-
6
ence frame are:
x = g(r) +T
m(1)
m = − ‖T‖Ispg0
(2)
where x = [r,v]T
is the state, g(r) is the gravity vector at position r and T is
the thrust vector:
T =[Tx , Ty , Tz
]T(3)
It should be noted that the control policy is based on Zero-Effort-Miss/Zero-
Effort-Velocity guidance which outputs an acceleration command rather than
a thrust command. The thrust T is recovered indirectly knowing engine spec-120
ifications and mass. The gravitational acceleration is aligned with the vertical
direction at all times. The rotation of the planet is neglected as we consider
only the terminal guidance of the power descent phase for a pinpoint landing
problem where the altitude is small with respect to the radius of the planet. The
interaction with the thin martian atmosphere is also neglected. The spacecraft125
is constrained to remain above the ground which has a constant slope angle of
4◦with respect to the horizontal direction except for 5 meter radius flat area
around the target.
3. Theoretical Background
3.1. Classical Zero-Effort-Miss/Zero-Effort-Velocity algorithm130
Consider the problem of landing on a large planetary body of interest with
mission from time t0 to tf . The unconstrained, energy-optimal guidance prob-
lem can be formulated as follows: Find the overall acceleration a that minimizes
the performance index:
J =1
2
∫ tf
t0
aTa dt (4)
7
for a spacecraft subjected to the following general dynamic equations, valid in
any case, even for non-inertial systems:
r = v
v = a + g
a = T/m
(5)
with r, v, T and a being position, velocity, thrust and acceleration command
vectors respectively and g is the gravitational acceleration. In the remainder of
the paper, g is assumed to be constant. The latter works well for modeling the
powered descent guidance starting close to the planetary surface of a large body.
Additionally, note that additional forces (e.g. aerodynamics forces experienced
by bodies close to the Mars surface) are considered negligible. The following
boundary conditions are given:
r(t0) = r0, r(tf ) = rf (6)
v(t0) = v0, v(tf ) = vf (7)
Importantly, no constraints on acceleration and on the spacecraft state are as-
sumed. The necessary conditions can be derived by a straightforward appli-
cation of the PMP. Indeed, the Hamiltonian function for this problem is then
defined as
H =1
2aTa + pr
Tv + pvT (g + a) (8)
where pr and pv are the costate vectors associated with position and velocity
vector respectively. The time-to-go is defined as: tgo = tf − t. The optimal
acceleration at any time t, can be found by directly applying the optimality
condition as
a = −tgopr(tf )− pv(tf ) (9)
By substituting equation 9 into the dynamics equations to solve for pr(tf ) and
pv(tf ), the optimal control solution with specified rf and vf and tgo is obtained
8
as:
a =6[rf − (r + tgov)]
t2go− 2(vf − v)
tgo+
6∫ tft
(τ − t)gdτt2go
−4∫ tft
gdτ
tgo(10)
The Zero-Effort-Miss (ZEM) and the Zero-Effort-Velocity (ZEV) are defined,
respectively, as the distance between the desired final position and velocity and
the projected final position and velocity if no additional control is commanded
from time t onward. Consequently, ZEM and ZEV have the following expres-
sions:
ZEM = rf −[r + tgov +
∫ tf
t
(tf − τ)g(τ)dτ
]ZEV = vf −
[v +
∫ tf
t
g(τ)dτ
] (11)
Then the optimal control law 10 can be expressed as:
a =6
t2goZEM− 2
tgoZEV (12)
Note that the solution holds also in the case where g = g(t). In any other case
in which g is neither constant nor time dependant, the control law is still usable
but it will not be necessarily optimal. In case the equations of motion are non-
linear and in general when 11 do not apply, ZEM and ZEV are expressed in a
slightly different way. The projected position and velocity cannot be recovered
analytically: they must be obtained through an integration of the equations of
motion from the current time instant to the end of the mission with control
actions set to zero.
ZEM = rf − rnc
ZEV = vf − vnc
(13)
where rnc and vnc are, respectively, the position and velocity at the end of mis-
sion if no control action is given from the considered time onward. It should be
noted that using the formulation in 12, which will be called classical ZEM/ZEV
from now on, can result in valid trajectories even for cases when the gener-
alized acceleration term is arbitrary. In these types of environment however,
using a definition of ZEM and ZEV as in 13, the control gains that solve the
9
optimal problem are no longer the ones in 12. This leads to the definition of
the Generalized-ZEM/ZEV algorithm [23], which is valid in any environment
and will be used as starting point for the development of the proposed adaptive
algorithm:
a =KR
t2goZEM +
KV
tgoZEV (14)
3.2. Reinforcement learning
Reinforcement Learning (RL) can be conceived as the formalization of learn-
ing by trial and error: it is based on the idea that a machine can autonomously
learn the optimal behavior, or policy, to carry out a particular task, given the
environment, by maximizing (or minimizing) a cumulative reward (or cost). RL135
algorithms work on systems that are formalized as Markov Decision Processes
[30, 31, 29].
3.2.1. Markov decision processes
The reinforcement learning problem is generally modeled as a Markov De-
cision Process (MDP) which is composed by: a state space X, an action space
U , an initial state distribution with density p1(x1) representing the initial state
of the system, a transition dynamics distribution with conditional density
p(xt+1|xt, ut) =
∫xt+1
f(xt, ut, x′)dx′ (15)
representing the dynamic relationship between a state and the next, given action
u and a reward function r: S ×U → R that depends in general on the previous
state, the current state and the action taken. It should be noted that if the
dynamics of the system is considered completely deterministic, this probability
is always 0 except when action ut brings the state from xt to xt+1. The reward
function r is assumed to be bounded. A policy is used to select actions by
the agent given a certain state. The policy is stochastic and denoted by πθ
: X → P (U) where P (U) is the set of probability measures of U , θ ∈ Rn is
a vector of n parameters and πθ(ut|xt) is the probability of selecting action ut
given state xt. The agent uses the policy to interact with the MDP and generate
10
a trajectory made of a sequence of states, actions and rewards. The return
rγt =
∞∑k=t
γk−tr(xk, uk) (16)
is the discounted reward along the trajectory from time step t onward, with
0 < γ ≤ 1. The agent’s goal is to obtain a policy that maximizes the discounted
cumulative reward from the start state to the end state, denoted by the per-
formance objective J(π) = E[rγ1 |π]. By denoting the density at state x′ after
transitioning for t time steps from state x by p(x→ x′, t, π) and the discounted
state distribution by
ρπ(x′) :=
∫X
∞∑t=1
γt−1P1(x)p(x→ x′, t, π)dx (17)
The performance objective can then be written as an expectation:
J(πθ) =
∫X
ρπ(x)
∫U
πθ(x, u)r(x, u)dudx = Ex∼ρπ,u∼πθ [r(x, u)] (18)
where Ex∼ρπ denotes the expected value with respect to discounted state dis-
tribution ρ(x).140
During training, the agent will have to estimate the reward-to-go function
J for a given policy π: this procedure is called policy evaluation. The resulting
estimate of J is called value function. The latter may depend either on the state
or both on state and action, yielding two different possible definitions. The state
value function
V π(x) = E
[ ∞∑k=0
γkrk+1|x0 = x, π
](19)
only depends on the state x. The state-action value function
Qπ(x, u) = E
[ ∞∑k=0
γkrk+1|x0 = x, u0 = u, π
](20)
depends on the state x but also on the action u. The relationship between the
two is:
V π(x) = E [Qπ(x, u)|u ∼ π(x, ·)] (21)
The above mentioned V and Q in recursive form become:
V π(x) = E [r(x, u, x′) + γV π(x′)] (22)
11
and
Qπ(x, u) = E [r(x, u, x′) + γQπ(x′, u′)] (23)
which are called Bellman Equations. Optimality for both V π and Qπ is gov-
erned by the Bellman optimality equation. Let V ∗(x) and Q∗(x, u) be the op-
timal value and action-value functions respectively, the corresponding Bellman
optimality equations are:
V ∗(x) = maxu
E [r(x, u, x′) + γV ∗(x′)]
Q∗(x, u) = E[r(x, u, x′) + γmax
u′Q∗(x′, u′)
] (24)
The goal of reinforcement learning is to find the policy π that maximizes V π, Qπ
or J(πθ), or in other words, find V ∗ or Q∗ that satisfy the Bellman optimality
equation.
3.2.2. Stochastic policy gradient theorem
Policy gradient algorithms are among the most popular classes of continuous
action and state space reinforcement learning algorithms. The fundamental idea
on which they are based on is to adjust the parameters θ of the policy πθ in the
direction of the performance objective gradient ∇θJ(πθ). The biggest challenge
is to compute effectively the gradient ∇θJ(πθ) so that at each iteration the
policy becomes better than the one at the previous iteration. It turns out, from
the work by Williams [32] who theorized the REINFORCE algorithms, that
the gradient of the performance objective can be estimated using samples from
experience, so without actually computing it and without a complete knowledge
of the environment (sometimes referred to as model-free algorithms). A direct
implication of [32] is the policy gradient theorem:
∇θJ(πθ) =
∫X
ρπ(x)
∫U
∇θπθ(u|x)Qπ(x, u)dudx =
= Ex∼ρπ,u∼πθ [∇θ log πθ(u|x)Qπ(x, u)] (25)
where Q(x, u) is the state-action value function expressing the expected total
discounted reward being in state x taking action u. The theorem is important
12
because it reduces the computation of the performance gradient, which could
be hard to compute analytically, to an expectation that can be estimated using
a sample-based approach. It is important to note that this estimate is demon-
strated to be unbiased so it assures that a policy is at least as good as the one in
the previous iteration. Once ∇θJ(πθ) is computed, the policy update is simply
done in the direction of the gradient
θk+1 = θk + αk∇θJk (26)
where α is the learning rate and is supposed to be bounded.145
One important issue to be addressed is how to estimate the Q function
effectively; all of the above is in fact valid in the case Q represent the true
action-value function. In case of continuous action and states spaces, obtaining
an unbiased estimate of this is difficult. One of the simplest approach is to use
the single sample discounted return rγt to estimate Q which is the idea behind150
the REINFORCE algorithm [32]. This is demonstrated to be unbiased5, but
the variance is high, which leads to slow convergence. One way of estimating
the action-value function in a way that reduces the variance while keeping the
error contained is the introduction of a critic in the algorithm.
3.2.3. Stochastic actor-critic algorithm155
The actor-critic is a widely used architecture based on policy gradient. It
consists of two major components. The actor adjusts the parameters θ of the
stochastic policy πθ(x) by stochastic gradient ascent (or descent). The critic
evaluates the goodness of the generated policy by estimating some kind of value
function. If a critic is present, instead of the true action-value function Qπ(x, u),
an estimated action-value function Qω(x, u) is used in equation 25. Provided, in
fact, that the estimator is compatible with the policy parametrization, meaning
5The discounted return is unbiased because it comes directly from experience and no
approximation is introduced.
13
that∂Qw(x, u)
∂w=∂π(x, u)
∂θ
1
π(x, u)(27)
then Qw(x, u) can be substituted to Qπ(x, u) in 25 and the gradient would still
assure improvement by moving in that direction. It is important to note that
Qw(x, u) is required to have zero mean for each state:∑u π(x, u)Qw(x, u) =
0, ∀x ∈ X. In this sense it is better to think of Qw(x, u) as an approximation
of the advantage function Aπ(x, u) = Qπ(x, u) − V π(x) rather than Qπ(x, u).160
This is in fact what will be used in the following.
In general introducing an estimation on the action-value function may intro-
duce bias but the overall variance of the method is decreased which ultimately
leads to faster convergence. The critic goal is to estimate the action-value func-
tion, providing a better estimate of the expectation of the reward with respect165
to using the single sample reward-to-go given state x and action u. This hap-
pens because the action-value function is estimated from an average over all
the samples, not just from the ones belonging to a single trajectory. This will
become clearer in section 4 where the details of the algorithm will be discussed.
It should be noted that both the critic and the deterministic part of the policy170
are represented by a Single Layer Feedforward Network (SLFN). Specifically
the critic is represented by an Extreme Learning Machine which is a particular
instance of them and will be presented in the following section.
3.3. Extreme learning machines
Extreme Learning Machines (ELM) are a particular kind of Single Layer175
Feedforward Networks (SLFN) with a single layer of hidden neurons which do
not make use of back-propagation as learning algorithm. Backpropagation is
a multiple step iterative process; ELM instead uses a learning method which
allows for learning in a single step. The concepts behind ELM had already
been in the scientific community for years before Huang theorized and formally180
introduced them as Extreme Learning Machines in 2004 [28, 27]. According to
their creator, they can produce very good results with a learning time that is a
fraction of the time needed for algorithms based on back-propagation.
14
Figure 2: Single layer feedforward network
Consider a simple SLFN, the universal approximation theorem states that
any continuous target function f(x) can be approximated by SLFNs with a set
of hidden nodes and appropriate parameters. Mathematically speaking, given
any small positive ε, for SLFNs with enough number of neurons L, it is verified
that:
‖fL(x)− f(x)‖ < ε (28)
where
fL(x) =
L∑i=1
βihi(x) = H(x)β (29)
is the output of the SLFN, β being the output weights matrix and H(x) =
σ(Wx + b) the output of the hidden layer for input x, with W and b being
the input weights and biases vectors respectively, σ is the activation function
of the hidden neurons. A representation of an SLFN can be seen in Figure 2.
In conventional SLFN, input weights wi, biases bi and output weights βi are
learned via backpropagation6. ELM are a particular type of SLFN that have
the same structure but only βi are learned, while input weights and biases are
assigned randomly at the beginning of training without the knowledge of the
training data and are never changed. It is demonstrated that, for any randomly
6Backpropagation is an optimization technique based on the concept of updating iteratively
weights and biases of a neural network according to the gradient of the loss function to be
minimized. It is called backpropagation because the error is calculated at the output and
distributed back through the network layers. Details in [34]
15
generated set {W,b} of input weights and biases,
limL→∞
‖f(x)− fL(x)‖ = 0 (30)
holds if the output weights matrix β is chosen so that it minimizes 28, which is
equivalent to saying that it minimizes the loss function ‖f(x)−fL(x)‖. Equation
28 after some manipulation, becomes
‖Hβ −Y‖ < ε (31)
where Y = [y1, ...,yN]T are the target labels and H = [hT(x1), ...,hT(xN)] the
hidden layer output. Given N training samples {xi, yi}Ni=1, the training problem
is reduced to:
Hβ = Y (32)
The output weights are then simply:
β = HY (33)
Where H is the Moore-Penrose generalized inverse matrix7 of H. This is demon-
strated to minimize the loss 31 given a large enough sets of training points and185
neurons. This is another way of saying that H are the weights that represent the
minimum norm least square solution of 32. This will be used in the actor-critic
algorithm and will be explained in section 4.
4. Adaptive-ZEM/ZEV algorithm
The Adaptive-ZEM/ZEV (A-ZEM/ZEV) is based on the idea of learning the190
parameters KR, KV and the time of flight Tf , which is related to the time-to-go
tgo of the generalized ZEM/ZEV algorithm in Equation 14. The overall idea
is that the guidance gains and the time-to-go can be adapted during the pow-
ered descent phase to satisfy specific constraints while maintaining quasi-fuel
7Used here because the numbers of neurons and samples are not equal so the system is not
squared.
16
Figure 3: Schematic representation of the actor-critic algorithm
optimality and close-loop characteristics. The guidance adaptation is achieved195
by using a customization of the actor-critic algorithm described in Section 3.2.
More specifically, we have developed a fast RL framework based on a combina-
tion of the REINFORCE algorithm [32] and a critic network based on Extreme
Learning Machines for estimating the value function. The goal is to show that
learning of the adaptive generalized ZEM/ZEV guidance can occur fastly and200
effciently. The proposed RL-based learning algorithm can be broken down in
three major blocks:
1. Samples generation
2. Critic neural network fitting
3. Policy update205
A high level schematic representation of the algorithm can be seen in Figure
3. Overall, at each global iteration, a batch of sample trajectories is generated
giving a set of states, actions, costs and next states (x, u, c, xt+1). These are then
fed to the critic that outputs an approximation of the expected cost-to-go given
a particular action and state Qw(x, u) that is then used to update the policy210
by the actor. Importantly, we do not aim at learning the full guidance policy,
which is represented by the generalized ZEM/ZEV algorithm (Eq. 14). Here,
we call policy specifically KR, KV and Tf as function of the lander state and
parametrized/approximated by a neural networks (see Figure 4 for a description
of the policy network). Indeed, the parameters of such networks are learned215
17
during the training phase. Details of the three phases are reported in the next
sections.
4.1. Samples generation
At each global iteration, a batch of trajectories are generated by letting
the agent interact with the environment using policy πθ(u|x), which is a rep-220
resentation of the guidance gains in equation 14, giving a series of samples
(xi,t, ui,t, ci,t, xi,t+1), where i represents the trajectory number and t is the time-
step along that trajectory. At the start of each episode, the time of flight Tf is
sampled using the policy and kept constant for the entire episode. The start-
ing position is randomly chosen by sampling a gaussian distribution around the225
nominal starting position. This ensures exploration of the state space around
the nominal starting state and also avoids singularities in the policy evaluation
step8. The time is discretized in a fixed number of time steps: at the begin-
ning of each time step the policy is sampled and KR and KV obtained, the
acceleration command calculated with 14, and the equations of motion inte-230
grated forward in time. The acceleration command is kept constant during the
time interval. The cost, whose value depend on the particular case addressed,
is assigned at each time step. It should be noted that here the reward in the
definitions in Section 3.2 in substituted with a cost for reasons that will become
clearer later. Importantly, the whole machinery described in Section 3.2 is valid235
also in case a cost is used to evaluate actions instead of a reward. The final time
for each episode is also fixed and the agent runs until the end time is reached
unless an impact with the ground is detected in which case the episode ends.
4.1.1. Policy
The policy is described by a gaussian distribution with fixed variance σ2
and variable mean from which actions are sampled. The mean is parametrized
8The critic network works well only if each state is associated with a single value. If each
episode starts from the exacts same position, there are equal states associated with different
costs, which makes the regression perform poorly.
18
over a certain weight vector θ which is learned through gradient descent. The
stochasticity of the policy is essential for learning because 1) it enables explo-
ration of the action space and 2) the machinery developed for stochastic policy
gradient can then be applied. Since the parameters of the guidance algorithm to
learn are three, i.e. KR, KV and Tf , the policy is subdivided in three separate
parts and parametrized with (θKR , θKV and θTf ). The policy can be formally
expressed as:
KR = πθKR = N (µKR , σ2) (34)
KV = πθKV = N (µKV , σ2) (35)
Tf = πθTf = N (µTf , σ2) (36)
where:
µKR = φ(x)TθKR (37)
µKV = φ(x)TθKV (38)
µTf = φ(x)TθTf (39)
φ(x) is the vector of feature functions evaluated in state x and θKR , θKV and θTf
are the weight vectors associated with each output. Note that the mean values
learned during the training phase are employed in the generalized ZEM/ZEV
algorithm. The features are comprising two sets of three dimensional radial
basis functions (RBF) with centers distributed evenly across the position and
velocity spaces. They are represented by the expression:
φ(r) = e−βR‖r−cr‖2 (40)
φ(v) = e−βV ‖v−cv‖2 (41)
with βR and βV being constant parameters related to the variance of the radial240
functions which is set accordingly to the particular case, r and v being respec-
tively the position and velocity and cr and cv the centers of the RBFs. The
centers are generated by dividing the state space of the problem in a set of in-
tervals, thus creating a grid of equally spaced points in the position and velocity
19
Figure 4: Policy neural network
spaces. The deterministic part of this policy can be seen as a neural network245
with two three-dimensional inputs (r,v), a single hidden layer of neurons with
radial basis activation functions and a three dimensional output layer (KR, KV ,
Tf ). A scheme can be seen in Figure 4. The parameters θ are the weights that
multiplied by the features give the output, which is the mean of the stochastic
policy as stated above.250
4.2. Critic neural network
One key part of the algorithm is the fitting of the neural network that ap-
proximates the value function. As explained in 3.2.3, in actor-critic algorithms
the expectation in equation 25 is not computed exactly, but it is rather expressed
using an approximated value function Qw(x, u). Here, we employ the advantage
function Aπ(x, u) = Qπ(x, u)− V π(x) rather than Qπ(x, u). The approximated
advantage function can be rewritten, using the definition of Q, as function of V
only:
Qw(x, u) = Aπ(u, x) = Qπ(x, u)− V π(x) = r(x, u) + V π(xt+1)− V π(x) (42)
where Aπ(u, x), Qπ(u, x) and V π(x) are the approximated versions of Aπ(u, x),
Qπ(u, x) and V π(x). Clearly, in order to compute the approximated advantage
20
function, only V π(x) must be obtained. The latter is done by modeling the
value function via a single layer forward networks with the following sigmoid
activation function
σ(si) =1
1 + e−siwith si = wix+ bi (43)
The SLFN is used as a function approximator that maps the inputs, in
this case the 6D states, into the scalar representing the discounted cost and
trained at each step using ELM theories as introduced earlier. The latter is
done by generating at each global iteration step, a training set on which the255
SLFN is trained using the training algorithm described in Section 3.3. There
are normally two ways to define this training set referring to two different types
of methods:
• Monte Carlo (MC): the value function is approximated at any given state
xi,t by the return, which is the discounted cost-to-go y =∑Tt′=t γ
t′−tc(xi,t′ , ui,t′).
In this case the training set is defined by the couples:{(xi,t,
T∑t′=t
γt′−tc(xi,t′ , ui,t′)
)}(44)
This is an unbiased way of expressing the value function but could suffer
from high variance.260
• Temporal Difference (TD): the value function is approximated by a boot-
strapped estimate of the cost-to-go, meaning that the previously fitted
value function is used as an estimation of the cost-to-go from time step
t+ 1 onward. The training set in this case is given by the couples:{(xi,t, c(xi,t, ui,t) + V π(xt+1)
)}(45)
this way of expressing the value function introduces a bias, because the
estimation of V is not perfect, but reduces the variance.
In this case the possibility of using the TD errors for value function approxima-
tion as in 45 was explored but discarded in favor of the MC version 44. Here, 45
21
works well only when the bias introduced by the approximation is small. In this265
case, in two consecutive global iterations the visited states could be very differ-
ent. Consequently, the neural net approximating the value function and trained
on a particular portion of the state space could lead to very big extrapolation
errors. For this reason, we decided to employ the MC version of the algorithm
to keep the bias contained and find other means of reducing the variance. It270
should be noted right away that, even if the variance is higher with respect to
the bootstrapped version, it is still lower than that of the vanilla REINFORCE
algorithm. Indeed, the samples come from all the generated episodes, and there-
fore the learned value function is an average of the expected cost-to-go, which
is a better estimate of the value function with respect to the simple sample275
estimate. Note also that the approximated value function is the discounted cost
and not the discounted reward. This is a choice made for this particular case in
which the goodness of an action is more clearly represented by a cost instead of
a reward.
4.3. Policy update280
During the training phase, the policy is optimized using gradient descent
instead of gradient ascent. The latter requires gradient estimation to execute the
policy update step. Once the value function is approximated by the critic net,
it is used to estimate the gradient of the objective function J(πθ). In stochastic
policy gradient, the expectation in equation 25 is not computed directly but
is approximated by averaging the gradient over the samples. In this case a
batch of trajectories is used to estimate the gradient. The expression of the
approximated gradient becomes:
∇θJ(πθ) ≈1
N
N∑i=1
T∑t=1
∇θ log πθ(ui,t|xi,t)Aπ(ui,t, xi,t) (46)
where N is the number of sample trajectories in the batch, T is the num-
ber of time instants in each trajectory, ∇θ log πθ(u|x) is the gradient of the
log-probability of the stochastic policy which, for a gaussian policy like 34, is
22
obtained analytically as:
∇θ log πθ =πθ − µσ2
φ(s) (47)
Here, Aπ(ut, xt) is the approximated advantage function described in 4.2 and
indicates how much better the action ui,t performs with respect to the average
action. Using the advantage function generally reduces the variance (3.2.3) but
it relies on an approximation that introduces bias into the process. A way to
reduce the effect of bias is to use the advantage function formulated as:
Aπn(ui,t, xi,t) =
T∑t′=t
γt′−tc(xi,t′ , ui,t′)− V π(xi,t) (48)
which is often referred to as the Monte-Carlo formulation of the advantage func-
tion, with the discount factor being introduced as 0 < γ < 1. This is unbiased
because the real cost to go is used to estimate the action-value function but is
low in variance because the average value associated to state xi,t is subtracted.
To implement the gradient descent algorithm, the policy parameters update
is simply done by taking a step in the opposite direction of the gradient∇θJ(πθ):
θk+1 = θk − α∇θJ(πθ) (49)
where α is the bounded learning rate. After each update, the algorithm is tested
and the cumulative cost is computed
Ck =
T∑t=0
c(xt, ut) (50)
where k stands for k -th iteration. The algorithm stops if the average cumulative285
cost difference among the last 5 iteration is less than a tolerance ε or it has
reached the maximum number of iterations. A summary of the algorithm in
form of pseudo-code is given in Figure 5.
5. Numerical Results
To evaluate the performance of the proposed algorithm, two powered descent290
examples for Mars pinpoint landing are presented. The spacecraft parameters
23
Figure 5: Summary of the A-ZEM/ZEV algorithm
for the selected problems are the following:
g = [−3.7114 , 0 , 0]Tm/s2 mdry = 1505 kg
mwet = 1905 kg Isp = 225 s T = 3.1 kN
Tmin = 0.3T Tmin = 0.8T n = 6
(51)
Here, mwet and mdry are the spacecraft mass both wet and dry, respectively,
and g is the Martian gravity vector, assumed to be constant. Additionally, n is
the number of thrusters, each with a full throttle capability of T , limited to a
thrust level between 0.3 and 0.8 at all times. The thrusters are mounted on the
spacecraft body with a cant angle φ with respect to the net thrust direction T .
If T as the magnitude of instantaneous thrust for each individual thruster, the
instantaneous net thrust is:
Tn = nT cos(φ)T (52)
The A-ZEM/ZEV algorithm was tested on two cases: a 2D case where the
spacecraft is constrained to move on the x-z plane and a full 3D case. The
guidance gains and the final time of the adaptive algorithms are learned to
ensure safe landing at a selected location of the Martian surface with minimum
24
fuel. it is assumed that the target point is at the origin of the reference frame
fixed with the Martian surface which needs to be achieved with zero velocity.
In both cases a glide constraint is introduced:
θ(t) = arctan(
√rx(t)2 + ry(t)2
rz(t)) ≤ θlim (53)
with an angle θlim = 4◦ with respect to the horizon. During the descent,
the gains adaptation must ensure that this constraint is always satisfied. This is
achieved by terminating the episodes whenever the agent violates the constraint,
which also leads to an increase in cost. The cost function c(t) that enforces that
constraint while searching for fuel optimal solutions is defined as follows:
C(t) = wmdmt + δ(t− Tf )[wfr ‖rt − rf‖2 + wfv‖vt − vf‖2 + bf
]+ δ(t− ti)
[wir‖rt − rf‖2 + bi
](54)
Where wm, wfr , wfv and wir are weights associated with the burned mass, the
end position and velocity errors and the impact point position error respectively,
ti and Tf are the time of impact and the final time respectively, and bf and bi295
are biases added at the end of episodes with bi > bf > 0.
Importantly, bi > bf ensures that the collision-less solution has a lower cost
than a solution that impacts on the constraint. Conversely, bf > 0 ensures that
the value function close to the target does not get too close to 0. The latter
may cause problems during training phase because the error introduced by the300
function approximator might be high relative to the actual value. It is important
to note here that the introduction of the positive bias bi is what ensures that the
agent is incentivised to look for a collisionless solution, enforcing the constraint
in equation 53.
Setting up the cost is the hardest hustle because the agent can easily fall into305
a local minimum due to one of the multiple terms in the cost function prevailing
over the others. Since the guidance adaptation has to minimize fuel cost without
violating the constraints, a careful tuning of the weights values is mandatory.
Here, we have decided to add a high bias cost every time an episode ended with
an impact. The latter ensures that the minimum cost is always achieved with310
25
Table 1: Initial state distributions
x (m) y (m) z (m) x (m/s) y (m/s) z (m/s)
Nominal 2D 1500 0 1500 100 0 -60
Nominal 3D -500 -1000 1500 100 -60 -60
Bounds ± 500 ± 500 ± 0 ± 5 ± 5 ± 5
Table 2: Hyperparameters
wm wfr wfv wir bi bf
0.5 1e-1 1e-1 5e-4 100 10
a collision-less solution. In this fashion, we have observed that the algorithm
always tries first to avoid the constraint, then to lower the fuel consumption.
The introduction of this constant biases leads to discontinuous jumps in the cost
profile as training progresses. The reason is that a trajectory without collisions
has a much lower cost than one that collides with the constraint as bi > bf .315
For both 2D and 3D guidance simulations, the starting position for each
episode was sampled from a uniform distribution around the nominal starting
state. Table 1 shows the initial state distributions for both cases. Table 2
instead shows the values of the hyperparameters used in the definition of the
cost function in 54.320
Additionally, in the 3D case, at each iteration 25 test trajectories were gen-
erated after the policy update step. To evaluate the performances of the current
version of the policy, the initial state is selected from the same distribution used
for the training samples. The latter slows down the learning process but allows
to learn a policy that works for any starting state within the above mentioned325
distribution. In both 2D and 3D guided descents, the solutions obtained with A-
ZEM/ZEV is compared with the classical ZEM/ZEV solution and a fuel optimal
solution obtained with GPOPS [36] (General Pseudospectral OPtimal Control
Software). The 2D guided descent scenario results are reported in Figure 6.
26
Clearly, the algorithm manages to find a solution that comply with the glide330
slope constraint whereas the classical-ZEM/ZEV solution violates it. Here, the
ELM-based critic is trained using 80% of the training set defined in 44 and the
rest is used for testing. Note that this step is repeated at each time-step with a
different training set. The number of neurons of the ELM is set as 1/10 of the
number of samples. The latter is demonstrated to work well in all situations,335
the most challenging ones being iterations where only some of the sample trajec-
tories would hit the constraints and the other would arrive at the target. This
condition created a stiff value function with neighbouring states having very
different values given by a different future development (one hitting the con-
straint, the other reaching the target). Overall, A-ZEM/ZEV has shown good340
performance in terms of constraint avoidance. Fuel consumption is considered
in the cost but there is still space for improvement as the GPOPS fuel optimal
solution is not reached.
Some details on the training process are shown in Table 3. One can appre-
ciate that the ELM has a very short learning time if compared to the overall345
iteration time. This ensures that most of the computational time is spent gen-
erating the sample trajectories and updating the policy.
Table 3: RL performance
Case No
itera-
tions
Total train-
ing time
(hours)
Average
iteration
time (s)
Average
critic
training
time (s)
Average
critic
NRMSE9
2D 503 1.78 12.72 7.849e-2 2.860e-1
3D 804 20.68 92.58 2.727e-1 1.742e-1
The resuls in terms of fuel consumption are shown in Table 4. A-ZEM/ZEV
performs less optimally with respect to the optimal GPOPS solution as ex-
pected but slightly more optimally than Classical-ZEM/ZEV, while managing350
9Normalized Root Mean Squared Error
27
(a) Trajectory
(b) Position (c) Velocity
(d) Norm of the position error with respect to the GPOPS solution.
Figure 6: r0 = [1500, 0, 1500]Tm, r0 = [100, 0,−60]Tm/s. Note that although achieving the
target state, the classical ZEM/ZEV violate the slope constraints. Conversely, as seen in (a)
the Adaptive ZEM/ZEV guidance law does not violate the slope constraints.
28
(a) Thrust (b) Spacecraft mass
Figure 7: Thrust, mass and guidance gains for 2D case
Figure 8: Cost during training
29
(a) Trajectory - ∗ is the constraint violation location
(b) Position (c) Velocity
Figure 9: r0 = [−500,−1000, 1500]Tm, r0 = [100,−60,−60]Tm/s
(a) Thrust (b) Spacecraft mass
Figure 10: Thrust, mass and guidance gains for 3D case
30
Figure 11: Cost during training
(a) Guidance gains - 2D (b) Guidance gains - 3D
31
Table 4: Performance comparison
Test case Algorithm Mass depleted (kg) TOF (s)
2D
A-ZEM/ZEV 382.75 84.1
Classical-ZEM/ZEV 385.51 84.1
GPOPS 352.59 64.7
3D
A-ZEM/ZEV 376.54 84.1
Classical-ZEM/ZEV 378.81 84.1
GPOPS 357.25 64.8
to avoid collision with the constraint. Classical ZEM/ZEV instead reaches the
target but violates the constraint. This shows the major shortcoming of classical
ZEM/ZEV: the impossibility to enforce path constraints and why out algorithm
overcomes this by making the gains state dependent.
One of the major strengths of the algorithm is the ability to provide a closed-355
loop guidance control that is both close to optimal and compliant with the
constraints. An interesting remark emerges from Figures 12a and 12b. Here the
evolution of the control gains Kr and Kv is shown. It possible to see that the
learning algorithm adjusts the values of the gains according to the constraint
scenario in a way that allows the lander to avoid collisions and get to the target360
safely. The power of the method resides also in the fact that the underlying
structure of the ZEM-ZEV guidance allows the algorithm to achieve pinpoint
landing accuracy as shown in the following section. The TOF (Tf ) is optimized
by the learning algorithm as a function of the initial state, as explained in section
4.1.1, and is not modified during the trajectory. The optimal value can be seen365
in Table 4. It should be noted that the TOF for the Classical-ZEM-ZEV is
the one obtained after the training process so it has the same value as the one
for A-ZEM/ZEV. The TOF of the optimal solution is instead optimized with
GPOPS itself.
32
5.1. Monte-Carlo analysis370
A Monte-Carlo analysis was carried out on the 3D case. The objective is
to prove that the trained agent is able to perform pinpoint landing with a high
degree of accuracy both in terms of final position and velocity. In this case,
following the procedure described above, the neural network was trained by
selecting the initial state of each sample trajectory from a quite large uniform375
distribution around the nominal start state. In particular the x and y are
taken from a distribution with bounds ±500 m, the z coordinate is kept at
1500 m. The velocity instead has bounds ±5 m/s . Figure 12b shows the
distribution of the final position on the ground across 1000 trials after training.
Figure 12c shows the distribution of final velocity magnitude. The trained policy380
clearly manages to drive the spacecraft to the target without ever violating the
constraint and with a high degree of accuracy in terms of position, as well as
keeping the final velocity below a safe 5 cm/s.
6. Stability Analysis
A guidance algorithm should in general be stable so that it can be safely385
used in practice. In the case of ideal, unperturbed dynamics, it has already
been demonstrated that the classical ZEM/ZEV described in Section 3.1 is
stable [26]. In this section we study the stability of the A-ZEM/ZEV algorithm.
6.1. Closed-loop Dynamics
The formulation of the guidance acceleration as function of ZEM and ZEV
results is a linear, non-autonomous, feedback dynamical system. Consequently,
classical linear system method of analysis can be employed. The acceleration
command for the generalized ZEM/ZEV, as expressed in Section 3.1, is
ac =KR
t2goZEM +
KV
tgoZEV (55)
considering then that
˙ZEM = −actgo
˙ZEV = −ac
(56)
33
(a) Trials
(b) Final position (c) Final velocity
Figure 12: Monte-Carlo analysis
34
the guidance system can be expressed as follows: ˙ZEM
˙ZEV
=
−KRtgo −KV
−KRt2go −KVtgo
ZEM
ZEV
(57)
In order to study the stability of the Linear Time-Varying (LTV) system one390
can utilize known properties of linear systems in order to reduce the system to
an equivalent Linear Time-Invariant (LTI) system that is much more convenient
for stability analysis.
6.2. Transformation of LTV systems into LTI systems
Following the work by Wu [33], let us consider a linear time-varying system
x = A(t)x (58)
Definition 1. A linear time-varying system of the form of Eq.(58) is said to be395
invariable if it can be transformed into a linear time-invariant system of the form
z = Fz by some valid transformations (such as the algegraic transformation and
the t←→ τ transformation defined below), where F is a constant matrix and z
may or may not be an explicit function of t.
Definition 2. An algebraic transformation is a transformation of states defined400
by x(t) = T (t)x(t) where T (t) is a non-singular matrix for all t and T (t) exists.
Definition 3. A t ←→ τ transformation is a transformation of time scale from
t to τ and is defined by a function of the form τ = g(t)
The invariable systems can be of two different kinds:
1. Algebraically invariable systems: the LTV systems that can be trans-405
formed into LTI by means of an algebraic transformation alone.
2. τ -algebraically invariable systems: the LTV systems that can be trans-
formed into LTI systems using an algebraic transformation plus the t←→
τ transformation.
35
It can be demonstrated that an LTV of the form Eq.(58) is invariable [33].410
It is also algebraically invariable if the state transition matrix (STM) of the
system can be found. Unfortunately, in this case, the definition of such STM
is extremely cumbersome. Consequently, a t ←→ τ transformation must be
employed. The following theorem is valid for τ -algebraically invariable systems.
Theorem:. The linear time-varying system
x = A(t)x (59)
is τ -algebraically invariable if the STM of the system in Eq. (59) is of the form
Φ(t, t0) = T (t, t0) exp[Rg(t, t0)], T (t0, t0) = I (60)
where g(t) exists and t0 is chosen so that g(t0, t0) = 0.415
In particular, the algebraic transformation
x(t) = T (t, t0)x(t) (61)
together with the t←→ τ transformation
τ = g(t, t0) (62)
will transform the system of Eq. (59) into the time-invariant system
z(τ) = Rz(τ) (63)
where z(τ) = x(t) and z(τ) = dz(τ)/dτ
Note that by using the definition of Φ(t, t0) and the fact that exp[Rg(t, t0)]
is non-singular, from 60 we have:
A(t)T (t, t0)− T (t, t0) = T (t, t0)Rg(t, t0) (64)
which means
Rg(t, t0) = T−1(A(t)T − T
)(65)
which will be important in the following section. Once the LTV system has
been transformed into a LTI one, the problem of stability is addressed. There
are two paths that can be taken in order to prove stability:
36
• The eigenvalues of the LTI system matrix R are computed; if the real part420
of all the eigenvalues is negative or at most 0, the system is stable.
• The State Transition Matrix (STM) of the original LTV system is re-
trieved. If it is possible to demonstrate that it is bounded at all time,
then the system is stable.
6.3. Stability of the A-ZEM/ZEV algorithm425
For the A-ZEM/ZEV guidance, the matrix A is the following:
A =
−KRtgo −KV
−KRt2go −KVtgo
(66)
using Eq.(65) with
T =
1 0
0tftgo
(67)
the system in Eq.(66) can be transformed in the algebraically equivalent system,
as follows:
Rg(t, t0) =
−KR −KV tf
−KRtf −(KV + 1)
1
tgo(68)
with
R =
−KR −KV tf
−KRtf −(KV + 1)
(69)
and
g(t, t0) =1
tgo(70)
The resulting system is simpler but still dependent on time. In order to make
it time-invariant, the t ←→ τ transformation must be applied. The time basis
transformation is
τ = g(t, t0) =
∫ t
t0
1
tgodτ = − log
tgotf
(71)
where t0 has been chosen so that g(t0, t0) = 0. With this transformation, the
system is now a LTI system
z(τ) = Rz(τ) (72)
37
with system matrix R. The stability of the system can be proven by finding the
eigenvalues of such matrix and prove they have negative real part at all times.
The R matrix in Eq.(69) has eigenvalues
λ1,2 =−K ±
√K2 − 4KR
2, K = KR +KV + 1 (73)
The stability conditions can be found in two cases: ∆ ≥ 0 and ∆ < 0, where
∆ = K2 − 4KR.
Case 1: When ∆ ≥ 0.
. The condition ∆ ≥ 0 translates into
K2V +K2
R + 2KRKV + 2KV − 2KR + 1 ≥ 0 (74)
and means that the eigenvalues are purely real. 74 must be verified in order for
the following stability condition to hold:
−K ±√
∆ < 0 (75)
or
−K +√
∆ < 0 → K >√
∆
−K −√
∆ < 0 → K > −√
∆(76)
which means that, since√
∆ > 0, the condition for stability is
K >√
∆ =√K2 − 4KR (77)
Case 2: When ∆ < 0.
. If ∆ < 0, so
K2V +K2
R + 2KRKV + 2KV − 2KR + 1 < 0 (78)
the eigenvalues have a real and an imaginary part. In order for the system to
be stable, the real part must be negative. So in this case the stability condition
is simply
−K < 0 → K > 0 → KR +KV + 1 > 0 (79)
38
(a) Case 1: 2D (b) Case 2: 3D
Figure 13: Eigenvalues during A-ZEM/ZEV-guided powered descent
These two conditions offer a quick way to check for the instantaneous stability of
the guidance algorithm. This can be added as a checking step inside the control
loop with a relatively low computational cost and action can be taken in case the430
algorithm goes in unstable regions. To prove that the algorithm remains stable
in our test cases, the eigenvalues were computed along the descent trajectories
in both cases (both 2D and 3D) and the results are reported in Figure 13. It
is clear that the real part of the eigenvalues all remains strictly negative which
ensures stability.435
As stated in Section 6.2, another way of addressing the stability of the closed-
loop dynamics is to employ the State Transition Matrix (STM). In particular,
the STM of the LTV system must be bounded at all times in order for it to
be stable. The calculation of the state transition matrix of the LTV system is
derived from the STM of the LTI system. Letting Φ∗ be the STM of the LTI
system, then the STM of the original LTV system is Φ(t, t0) = TΦ∗. Accord-
ing to [35], the STM of a linear system can be found from the knowledge of
eigenvalues and eigenvectors as
Φ(t, t0) = MeΛtM−1 (80)
Where
M =[m1 | m2 | . . . | mn
](81)
39
is the matrix whose columns are the eigenvectors and
Λ =
λ1 0 . . . 0
0 λ2 . . . 0...
.... . .
...
0 0 . . . λn
(82)
is the diagonal matrix of the eigenvalues. In this case, applying such definitions
to the algebraically equivalent system in 68 and then the T transformation, the
STM of the original system turns out to be:
Φ =
Φ11 Φ12
Φ21 Φ22
(83)
with
Φ11 = λ1+krλ1−λ2
(tgotf
)−λ2
− λ2+krλ1−λ2
(tgotf
)−λ1
(84)
Φ12 = − kvtfλ1−λ2
(tgotf
)−λ1
+kvtfλ1−λ2
(tgotf
)−λ2
(85)
Φ21 = λ1+krkvtf
λ2+krλ1−λ2
(tgotf
)−λ1−1− λ2+kr
kvtfλ1+krλ1−λ2
(tgotf
)−λ2−1(86)
Φ22 = λ1+krλ1−λ2
(tgotf
)−λ1−1− λ2+kr
λ1−λ2
(tgotf
)−λ2−1(87)
where
λ1,2 =−K ±
√K2 − 4KR
2, K = KR +KV + 1 (88)
According to the theory on non-autonomous linear systems, the LTV in equation440
58 is stable if and only if the STM in 83 is bounded at all time 0 ≤ t ≤ tf . The
STM in equation 83 was computed along the entire guided descent trajectories
for both 2D and 3D cases. Figure 14 shows the STM for two sampled guided
trajectories. It is clear that the components of the STM are bounded at all
times so we can conclude that, at least in these cases, the algorithm is stable.445
7. Conclusions and future efforts
The work has shown a novel closed-loop spacecraft guidance algorithm for
soft landing. Using machine learning, in particular an actor-critic algorithm
40
(a) Case 1: 2D (b) Case 2: 3D
Figure 14: State transition matrix components
based on policy gradient, with advantage function estimation, it has been pos-
sible to expand the capabilities of classical ZEM/ZEV feedback guidance. The450
resulting adaptive algorithm (A-ZEM/ZEV) improves its performance in terms
of fuel consumption and allows path constraints to be implemented directly into
the guidance law. The actor-critic algorithm outputs a trained guidance neural
network that can be implemented directly on-board. Provided that the space-
craft is equipped with the appropriate set of sensor for state determination, the455
guidance is recovered as function of the state in real time and can be followed by
the control system, allowing for a completely autonomous soft landing maneuver
in presence of path constraints. Moreover the stability of the method has been
addressed: by transforming the LTV system in a LTI one, it has been possible
to easily check for instability condition, either during post processing or, more460
importantly, online in the control loop. This ensures that countermeasures can
be applied if the unstable regions are visited.
From a machine learning prospective, ELM have shown to work well as critic
in an actor-critic algorithm. This shows that when the task is a simple regression
problem like in this case, deep learning is not needed. A shallow network is465
enough and the one-step learning process of ELM makes them perfect for the
task. They scale well with input dimension achieving good accuracy with a
negligible computational time if compared with the total iteration time.
41
The work presented in this paper shows the advantages of using machine
learning to learn state dependent parameters in an existing parametrized pol-470
icy. In this case, using ZEM/ZEV guidance as a baseline allows for extremely
precise terminal guidance with the increased flexibility given by the usage of
reinforcement learning. This can be expanded to other guidance problem that
rely on a parametrized law. As long as the state space can be discretized ef-
fectively, this ELM-based actor-critic algorithm can be applied. Convergence of475
the method is guaranteed by the fact that the advantage function estimation is
unbiased and the cost function is set up correctly but convergence is still slow.
Work can be done to improve the learning performance. For example, meta-
learning could be used to learn different sequential task in order to speed up
the process, especially if the environmental condition change (i.e. actuator or480
sensor failure). Meta-learning could also be used to create an algorithm that is
more robust mis-modelled dynamics. By learning over a distributions of MDPs
corresponding to different randomized instances of the environment, the agent
should in fact be able to adapt to a wider set of environmental parameters,
ultimately improving the performance in a more realistic set up.485
References
[1] Shotwell, Robert, “Phoenix—the first Mars Scout mission” Acta Astro-
nautica, Vol. 57, No. 2-8, pp. 121-134, 2005.
[2] Grotzinger, John P and Crisp, Joy and Vasavada, Ashwin R and Ander-
son, Robert C and Baker, Charles J and Barry, Robert and Blake, David490
F and Conrad, Pamela and Edgett, Kenneth S and Ferdowski, Bobak ,
“Mars Science Laboratory mission and science investigation” Space sci-
ence reviews, Vol. 170, No. 1-4, pp. 5-56, 2012.
[3] Burns, Jack O and Mellinkoff, Benjamin and Spydell, Matthew and Fong,
Terrence and Kring, David A and Pratt, William D and Cichan, Timothy495
and Edwards, Christine M , “Science on the lunar surface facilitated by
42
low latency telerobotics from a Lunar Orbital Platform-Gateway” Acta
Astronautica, 2018.
[4] Steltzner, Adam D and Burkhart, P Dan and Chen, Allen and Comeaux,
Keith A and Guernsey, Carl S and Kipp, Devin M and Lorenzoni, Leila V500
and Mendeck, Gavin F and Powell, Richard W and Rivellini, Tommaso P
and others , “Mars science laboratory entry, descent, and landing system
overview” Pasadena, CA: Jet Propulsion Laboratory, National Aeronau-
tics and Space Administration, 2010.
[5] Klumpp, Allan R , “Apollo lunar descent guidance” Automatica, Vol. 10,505
No. 2, pp. 133-146, 1974.
[6] Singh, Gurkirpal and SanMartin, Alejandro M and Wong, Edward C ,
“Guidance and control design for powered descent and landing on Mars”
Aerospace Conference, 2007 IEEE, pp. 1-8, 2007.
[7] Acikmese, B. and Ploen, S.R., “Convex programming approach to pow-510
ered descent guidance for mars landing” Journal of Guidance, Control,
and Dynamics, Vol. 30, No. 5, pp. 1353-1366, 2007.
[8] Blackmore, Lars and Acikmese, Behcet and Scharf, Daniel P , “Minimum-
landing-error powered-descent guidance for Mars landing using convex
optimization” Journal of guidance, control, and dynamics, Vol. 33, No. 4,515
pp. 1161-1171, 2010.
[9] Acıkmese, Behcet and Carson, John M and Blackmore, Lars , “Lossless
convexification of nonconvex control bound and pointing constraints of
the soft landing optimal control problem” IEEE Transactions on Control
Systems Technology, Vol. 21, No. 6, pp. 2104-2113, 2013.520
[10] Trawny, Nikolas and Benito, Joel and Tweddle, Brent E and Bergh,
Charles F and Khanoyan, Garen and Vaughan, Geoffrey and Zheng, Jason
and Villalpando, Carlos and Cheng, Yang and Scharf, Daniel P and others
, “Flight testing of terrain-relative navigation and large-divert guidance
43
on a VTVL rocket” AIAA SPACE 2015 Conference and Exposition, pp.525
4418, 2015.
[11] Liu, Xinfu and Lu, Ping and Pan, Binfeng , “Survey of convex optimiza-
tion for aerospace applications” Astrodynamics, Vol. 1, No. 1, pp. 23-40,
2017.
[12] Lu, P., “Propellant-Optimal Powered Descent Guidance” Journal of Guid-530
ance, Control, and Dynamics, Vol. 41, No. 4, pp. 813-826, 2017.
[13] Lu, Ping and Sostaric, Ronald R and Mendeck, Gavin F , “Adaptive
Powered Descent Initiation and Fuel-Optimal Guidance for Mars Appli-
cations” 2018 AIAA Guidance, Navigation, and Control Conference, pp.
0616, 2018.535
[14] Lu, Ping and Pan, Binfeng , “Highly constrained optimal launch ascent
guidance” Journal of Guidance, Control, and Dynamics, Vol. 33, No. 2,
pp. 404-414, 2010.
[15] Lu, Ping and Griffin, Brian J and Dukeman, Gregory A and Chavez, Frank
R , “Rapid optimal multiburn ascent planning and guidance” Journal of540
Guidance, Control, and Dynamics, Vol. 31, No. 6, pp. 1656-1664, 2008.
[16] Lu, Ping and Forbes, Stephen and Baldwin, Morgan , “A versatile pow-
ered guidance algorithm” AIAA Guidance, Navigation, and Control Con-
ference, pp. 4843, 2012.
[17] Guo, Yanning and Hawkins, Matt and Wie, Bong , “Waypoint-optimized545
zero-effort-miss/zero-effort-velocity feedback guidance for mars landing”
Journal of Guidance, Control, and Dynamics, Vol. 36, No. 3, pp. 799-809,
2013.
[18] Pinson, R. and Lu, P. , “Trajectory design employing convex optimization
for landing on irregularly shaped asteroids”, AIAA/AAS Astrodynamics550
Specialist Conference, 2016.
44
[19] Zhang, C. and Topputo, F. and Bernelli-Zazzera, F. and Zhao, Y., “Low-
thrust minimum-fuel optimization in the circular restricted three-body
problem” Journal of Guidance, Control, and Dynamics, Vol. 38, No. 8,
pp. 1501–1510, 2018.555
[20] Lu, P. and Liu, X., “Autonomous trajectory planning for rendezvous and
proximity operations by conic optimization” Journal of Guidance, Con-
trol, and Dynamics, Vol. 36, No. 2, pp. 375-389, 2013.
[21] Rao, A.V, “A survey of numerical methods for optimal control” Advances
in the Astronautical Sciences, Vol. 135, No. 1, pp. 497-528, 2009.560
[22] Betts, J.T, “Practical methods for optimal control and estimation using
nonlinear programming” Siam, Vol. 19, 2010.
[23] Guo, Y. and Hawkins, M. and Wie, B. , “Applications of generalized zero-
effort-miss/zero-effort-velocity feedback guidance algorithm” Journal of
Guidance, Control, and Dynamics, Vol. 36, No. 3, pp. 810-820, 2013.565
[24] Guo, Y. and Hawkins, M. and Wie, B. , “Optimal feedback guidance
algorithms for planetary landing and asteroid intercept” AAS/AIAA as-
trodynamics specialist conference, Vol. 36, No. 3, pp. 588, 2011.
[25] Furfaro, R. and Linares, R. , “Waypoint-Based Generalized ZEM/ZEV
Feedback Guidance for Planetary Landing via a Reinforcement Learning570
Approach” 3rd IAA Conference on Dynamics and Control of Space Sys-
tems, Moscow, Russia, 2017.
[26] Furfaro, R. and Wibben, R. D. , “Robustification of a class of guid-
ance algorithms for planetary landing: Theory and applications” 26th
AAS/AIAA Space Flight Mechanics Meeting, 2016. Univelt Inc., 2016.575
[27] Huang, G. B. , “What are extreme learning machines? Filling the gap
between Frank Rosenblatt’s dream and John von Neumann’s puzzle” Cog-
nitive Computation 7.3, 2015.
45
[28] Huang, G. B. and Wang, D. H. and Lan, Y. , “Extreme learning machines:
a survey” International journal of machine learning and cybernetics, Vol.580
2, No. 2, pp. 107-122, 2011.
[29] Silver, D. and Lever, G. and Heess, N. and Degris, T. and Wierstra, D.
and Riedmiller, M. , “Deterministic Policy Gradient Algorithms” ICML,
2014.
[30] Sutton, R. S. and Barto, A. G. , “Reinforcement learning: An introduc-585
tion” Cambridge: MIT press, Vol. 1, No. 1, 1998.
[31] Sutton, R. S. and McAllester, D. and Singh, S. and Mansour, Y. , “Policy
gradient methods for reinforcement learning with function approximation”
Advances in neural information processing systems, 2000.
[32] Williams, R. J. , “Reinforcement learning” Springer, 1992.590
[33] Wu, M. , “Transformation of a linear time-varying system into a linear
time-invariant system” International Journal of Control, Vol. 24, No. 4,
pp. 589-602, 1978.
[34] Hagan, M. T. and Menhaj, M. B. , “Training feedforward networks with
the Marquardt algorithm” IEEE transactions on Neural Networks, Vol. 5,595
No. 6, pp. 989-993, 1994.
[35] Rowell, D. , “Time-domain solution of LTI state equations” Class Handout
in Analysis and Design of Feedback Control System, 2002.
[36] Patterson, Michael A., and Anil V. Rao. “GPOPS-II: A MATLAB
software for solving multiple-phase optimal control problems using hp-600
adaptive Gaussian quadrature collocation methods and sparse nonlinear
programming“ ACM Transactions on Mathematical Software (TOMS)
41.1 (2014): 1-37.
46