arxiv:2111.15228v1 [cs.lg] 30 nov 2021

42
Global Convergence Using Policy Gradient Methods for Model-free Markovian Jump Linear Quadratic Control *, 1 Manoj Bhadu, *, 1 Santanu Rathod, 1 Abir De 1 IIT-Bombay * equal contribution December 1, 2021 Abstract Owing to the growth of interest in Reinforcement Learning in the last few years, gradient based policy control methods have been gaining popularity for Control problems as well. And rightly so, since gradient policy methods have the advantage of optimizing a metric of interest in an end-to-end manner, along with being relatively easy to implement without complete knowledge of the underlying system. In this paper, we study the global convergence of gradient-based policy optimization methods for quadratic control of discrete-time and model-free Markovian jump linear systems (MJLS). We surmount myriad challenges that arise because of more than one states coupled with lack of knowledge of the system dynamics and show global convergence of the policy using gradient descent and natural policy gradient methods. We also provide simulation studies to corroborate our claims. 1 Introduction Oftentimes, in reality, control systems don’t behave as they’re theoretically modelled. And sometimes the changes can be too abrupt for the system to be expected to observe a fixed prior behaviour, component failures and repairs or for example environmental disturbances or changes in subsystems interconnec- tions or a change in network based models, like air-transportation or disease epidemics [13], due to some confounders, etc. In some cases these systems can be modelled by discrete-time linear systems with state-transition coming from an underlying Markov chain; like for instance in the case of ship steering [15] the ship dynamics vary according to the speed, which can be measured from appropriate speed sensors, and autopilots for these systems can be improved by taking these changes into account. MJLS as a class of control problem has 1 arXiv:2111.15228v1 [cs.LG] 30 Nov 2021

Upload: others

Post on 14-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Global Convergence Using Policy GradientMethods for Model-free Markovian Jump Linear

Quadratic Control*, 1 Manoj Bhadu, *, 1 Santanu Rathod, 1 Abir De

1IIT-Bombay∗equal contribution

December 1, 2021

Abstract

Owing to the growth of interest in Reinforcement Learning in thelast few years, gradient based policy control methods have been gainingpopularity for Control problems as well. And rightly so, since gradientpolicy methods have the advantage of optimizing a metric of interest in anend-to-end manner, along with being relatively easy to implement withoutcomplete knowledge of the underlying system. In this paper, we studythe global convergence of gradient-based policy optimization methodsfor quadratic control of discrete-time and model-free Markovian jumplinear systems (MJLS). We surmount myriad challenges that arise becauseof more than one states coupled with lack of knowledge of the systemdynamics and show global convergence of the policy using gradient descentand natural policy gradient methods. We also provide simulation studiesto corroborate our claims.

1 IntroductionOftentimes, in reality, control systems don’t behave as they’re theoreticallymodelled. And sometimes the changes can be too abrupt for the system to beexpected to observe a fixed prior behaviour, component failures and repairs orfor example environmental disturbances or changes in subsystems interconnec-tions or a change in network based models, like air-transportation or diseaseepidemics [13], due to some confounders, etc. In some cases these systems canbe modelled by discrete-time linear systems with state-transition coming froman underlying Markov chain; like for instance in the case of ship steering [15]the ship dynamics vary according to the speed, which can be measured fromappropriate speed sensors, and autopilots for these systems can be improvedby taking these changes into account. MJLS as a class of control problem has

1

arX

iv:2

111.

1522

8v1

[cs

.LG

] 3

0 N

ov 2

021

been widely studied, [6], [7], [5], with wide array of practical applications like[8] modelling the time-varying networks and switching topologies of networkedsystems such as air-transportation via discrete-time, positive Markov JumpLinear Systems. Often in these systems the jump probabilities are assumed tobe known, although not when, if at all, the jumps will occur [2], in subsequentsections we entertain cases without this assumption as well. Another mainquestion that arises is of whether or not the operation mode ω(t) is known, andoftentimes it can be known by placing appropriate sensors. Having said that [9]explores the scenario of no mode observation in MJLS. We in this work assumethat the operation mode ω(t) can be observed.

Also, knowing the system dynamics before hand might not be the fastest andautonomous way to go about finding optimal policy, especially in systems wherethe states alter dynamically. And in this paper we’ve shown that optimal policycan simply be achieved by gathering data and statistics, and hence it’s somesense more autonomous. [26] show model-free convergence guarantees for LQRsystems, but the results can’t be trivially extended to MJLS because of thedynamic switch between states instead of being static. [1] explores MJLS policyconvergence when the system parameters are known, and thus our paper tries tobridge this gap between model-given MJLS [1] and model-free LQR [26]

1.1 Our workWe start by mentioning notations and conditions that we use throughout thepaper, and then write down Algorithm-1 which gives us a procedure to estimategradients and state correlation matrix without knowing anything about the statedynamics. Before stating the main policy convergence convergence results weconvey some intermediary key ideas that help us get there.Primarily, we establish that a. For sufficient time horizon the cost functionand state-correlation matrix can estimated near accurately, b. That slightperturbations in policy doesn’t results in large variations in estimated gradientsor correlation matrix and the variations are bounded. We also show that for agiven policy the estimates of gradients and correlation matrix are bounded whenthen transition probability matrix differ by a small εp.These intermediary resultsalong with a few other arguments mentioned in the Appendix section then helpup prove policy convergence using gradient descent and natural gradient descentwhich is our main result.

1.2 Summary of contributions• We provide an algorithm for estimating gradient and state correlationmatrix used in gradient descent and natural gradient descent for policyiteration in model-free MJLS case.

• We prove intermediary results showing that a sufficient time horizon isenough for a good estimated of concerned quantities, and that variations

2

resulting due slight perturbations in policy and transition probabilities canbe bounded.

• We then prove policy convergence using both gradient descent and naturalgradient techniques.

• We also provide simulation studies to corroborate our convergence claims.

1.3 Related worksRecently there have been several advances in deep Reinforcement Learning, likeAlphaGo [25] with an over-arching goal of a general learning agent, but thosetechniques are not theoretically well understood and as such this lack of firmtheoretical guarantees might prove disastrous as ML systems get more ambitious.And hence there’s been growth in studies trying to solve well studied controlproblems like LQR using policy gradient techniques [26] which are comparativelyeasier to analyse and provide provable convergence guarantees of the optimalpolicy. And as an extension one of our primary motivations to study MJLS from amodel-free policy gradient perspective was understand the theoretical challengesbetter more generally, since LTI LQR is but a special case of MJLS LQR withNs = 1 or just a single operating mode. [1] studies the MJLS problem from amodel-based point of view, and provide novel Lyapunov arguments regardingthe stability of gradient based policy iterations and its convergence there-of.As ML evolves to tackle more ambitious tasks it’s imperative that systems areable to progress and learn using data-driven approaches on the go, which callsfor us to pay attention to model-free settings where one doesn’t have access tosystem/cost dynamics and is yet supposed to obtain the optimal policy usingwhatever sampled data is available from the environment. And along this lineof thinking, apart from [26], there have been studies like [22] which solve theoptimal LQR contoller when the dynamics are unknown using a multi-stageprocedure, procedure called Coarse-ID control, that estimates a model from afew experimental trials, estimates the error in that model with respect to thetruth. [24] follows a DP approach, wherein the Q-function is estimated usingRLS(recursive least squares) and is used to prove convergence. Unlike LQRmodel-based control settings which have been well studied classically [16], [17]or with [18] using bandit approaches or [21] studying it in an online manner,extension of these techniques to model-free settings hasn’t been as prolific dueto inherent difficulties one faces when one doesn’t know about system dynamics.And studying MJLS LQR without the knowledge of system dynamics is harderstill due to the jump-parameter ω(t), and the coupling between state/inputmatrices and ω(t). And thus finding proper gradient estimates and provingconvergence in MJLS setting becomes a non-trivial task. To that end we provideresults in following sections to tackle those problems.

3

2 Preliminaries and Background

2.1 NotationWe denote the set of real numbers by R. Let A be a matrix, then we use thenotation A>, ‖A‖, tr(A), Γmin(A), and ρ(A) to denote its transpose, maximalsingular value, trace, minimum singular value, and spectral radius respectively.Given matrices Dimi=1, let diag(D1, · · · , Dm) denote the block diagonal matrixwhose (i, i)-th block is Di. We use vec(A) to denote the vectorization of matrixA. The positive definiteness and positive semi-definiteness of symmetric matrixZ is denoted by Z 0 and Z 0.

We now define some MJLS literature motivated notations along with oper-ators/notations that we’ll be using in this paper. Let MNs

n×m denote the spacemade up of all Ns- tuples of real matrices V= (V1, · · · , VNs) with Vi ∈ Rn×n,i ∈ N. For V= (V1, · · · , VN ) ∈MN we define:

‖V ‖1 :=

Ns∑i=1

‖Vi‖, ‖V ‖22 :=

N∑i=1

tr(V >i Vi)

‖V ‖max := maxi=1,··· ,N

‖Vi‖,Λmin(V ) := mini=1,··· ,N

σmin(Vi),

trace(V ) =

Ns∑i=1

tr(Vi)

For V, S ∈MN , their inner product is defined as:

〈V, S〉 :=

N∑i=1

tr(V >i Si)

We now introduce some new operators: FK(v), TK(V ) that make analysis easy,will be used while proving intermediary results.

FK(V ) = (FK,1(V ), . . . ,FK,Ns(V ))

FK,j(V ) =

ns∑i=1

pij(Ai −BiKi)Vi(Ai −BiKi)>

TK(V ) =

∞∑t=0

F tK(V ),

2.2 Markovian Jump Linear SystemsA Markovian jump linear system (MJLS) is governed by the following discrete-time state-space model

4

xt+1 = Aω(t)xt +Bω(t)ut (1)

where xt ∈ Rd is the system state, and ut ∈ Rk corresponds to the controlaction. The initial state x0 is assumed to have a distribution D. The systemmatrices Aω(t) ∈ Rd×d and Bω(t) ∈ Rd×k depend on the switching parameter ω(t),which takes values on Ω := 1, · · · , Ns. We will denote A= (A1, · · · , ANs)∈MNsd×d and B= (B1, · · · , BNs)∈ MNs

d×k. The jump parameter ω(t)∞t=0 comesfrom a time homogenous Markov chain whose transition probability is given as

pij = P(ω(t+ 1) = j|ω(t) = i) (2)

Let P be the probability transition matrix whose (i, j)-th entry is pij , wherepij ≥ 0 and

∑Nsj=1 pij = 1. The initial distribution of ω(0) is given by π =

[π1, · · · , πNs ]>, and we have∑Nsi=1 πi = 1. Moreover we assume that 1 can be

mean square stabilized.For our work we focus on quadratic optimal control where the objective is tochoose control actions ut∞t=0 to minimize the following cost function:

C = Ex0∼D,ω0∼π

[ ∞∑t=0

x>t Qω(t)xt + u>t Rω(t)ut

](3)

Throughout the paper it’s assumed that Q = (Q1, · · · , QNs) 0, and R =(R1, · · · , RNs) 0, πi > 0 so that there’s non-zero probability of the systemstarting from a particular state, and Ex0∼D[x0x

>0 ] 0 so that the expected

covariance of the initial state is full-rank. The assumptions are quite standardin the learning-based control and can be surmised as the persistently excitationcondition in the system identification literature. The above problem can becalled as "MJLS LQR" as mentioned in [1]. The optimal controller for thisproblem is defined by dynamics 1, transition probabilities 2, and cost 3 can beobtained by solving a system of coupled Algebraic Riccati Equations (AREs) [2]

Now, it is well known that the optimal cost can be achieved by a linearfeedback of the form,

ut = −Kωtxt (4)

with K = (K1, · · · ,KNs) ∈ MNsk×d. Combining the linear policy 4 with 1, we

obtain the closed-loop dynamics:

xt+1 = (Aω(t) −Bω(t)Kω(t))xt = Γω(t)xt (5)

with Γ = (Γ1, · · · ,ΓNs) ∈MNsd×d. Note that using 4 we can rewrite the cost 3 as,

C = Ex0∼D,ω0∼π

[ ∞∑t=0

x>t (Qω(t) +K>ω(t)Rω(t)Kω(t))xt

]And for finding the optimal policy K∗i i∈Ω for i ∈ Ω, we first define

an operator E : MNsd×d → MNs

d×d as E(V ) := (E1(A), · · · , ENs(V )) where V =

5

(V1, · · · , VNs) ∈MNsd×d and Ei(V ) :=

∑Nsj=1 pijVj . Now let Pii∈Ω be the unique

positive definite solution to the following AREs:

Pi = Qi +ATi E(P )Ai −ATi E(P )Bi×(Ri +BTi E(P )Bi

)−1BTi E(P )Ai.

(6)

It can then be shown that the linear state feedback controller minimizing thecost function 3 is given by

K∗i = (Ri +B>i Ei(P )Bi)−1B>i Ei(P )Ai (7)

2.3 Problem SetupIn this section we reintroduce the policy optimization reformulation of the MJLSproblem as was done in [1] and elaborate on its differences with our model-freesetting. We also introduce a few additional operators which are useful whileproving certain results we use in later sections.

Policy Optimization problem for model-based MJLS:

minimize : cost(C(K)), given in 3subject to : a. state dynamics, given in 1

b. control actions given in 4c. transition probabilities, given in 2d. K stabilizes 3 in the mean square sense

Let K denote the set of feasible set of stabilizing policies. We know that fora given K, the resultant closed-loop MJLS 5 is mean square stable(MSS) if forany initial condition x0 ∈ Rd and ω(0) ∈ Ω we get E[xtx

>t ] → 0 as t → ∞ [3],

and thus we can trivially expand MSS arrive at the conclusion that C(K) isfinite if and only if K stabilizes closed-loop dynamics in the mean square senseor if K ∈ K.It is useful to see how cost gradients are calculated in model-based settings toappreciate why we can’t use similar methods and resort to estimating gradientsusing zeroth order optimization methods [4] shown later. For K ∈ K, the costC(K) is finite and differentiable, and we can rewrite the cost 3 as:

C(K) = Ex0∼D

[x>0

(Ns∑i=1

πiPKi

)x0

](8)

Where π is the initial distribution and PKi i∈Ω are the solution to the coupledLyapunov equations

PKi = Qi +K>i RiKi + (Ai −BiKi)>Ei(PK)(Ai −BiKi) (9)

6

Where Ei(PK) =∑Nsj=1 pijP

Kj as defined earlier. We also a define a new variable

X(t) = (X1(t), · · · , XNs(t)) -we’ll use it throughout the paper- such thatXi(t) :=E[xtx

>t 1ω(t)=i] and this matrix also satisfies the recursion relation:

Xj(t+ 1) =

Ns∑i=1

pij(Ai −BiKi)Xi(t)(Ai −BiKi)> (10)

owing to 5, with Xi(0) = πiEx0∼D[x0x>0 ],∀i ∈ Ω, being the value at t = 0. Now

from Lemma 1 in [1] we know that for K ∈ K the gradient for cost-function 3wrt policy K is:

∇C(K) = 2[L1(K), L2(K) · · ·LNs ]χK (11)

where Li(K) = (Ri +B>i Ei(PK)Bi)Ki −B>i Ei(PK)Ai andχK = diag (

∑∞t=0X1(t), · · · ,

∑∞t=0XNs(t))

One can see how the empirical estimates of gradient from 11 can’t be used sinceit requires information about the model, and this is partly what also makes themodel-free problem challenging.

Policy Optimization problem for model-free MJLS: For the model-freeMJLS setting we only have a black-box access to the system, meaning that thestate (xt) still evolves according to 1 and the cost is still accrued accordingto 3 only we don’t know the system variables Ai, Bi and cost-variablesQi, Ri. Or formally we have,

minimize : cost(C(K)), given in 3subject to : a. Aii∈Ω, Bii∈Ω from the state dynamics 1

are not knownb. Qii∈Ω, Rii∈Ω from the cost-function 3are not knownc. control actions given in 4d. K stabilizes 3 in the mean square sensee. For the transition probabilities, given in 2, we havetwo cases, when pij is given and when it’s not.

As seen from the constraints above we have to devise empirical estimatesof gradient and other relevant parameters without the knowledge of model andcost parameters. We’ll elaborate more on that in the following sections. We alsoneed to make sure that our policy iterations coming from empirical estimatesdon’t wander into the region of unstability while converging towards optimalpolicy. We’ll expand on Lemma 4 in [1], which shows that policy update K

from a stabilizing policy K using natural gradient with appropriate step-size ηis also stabilizing, to show the stability of our policy updates based on empiricalestimates.

7

3 The AlgorithmIn this section we present our gradient estimation algorithms and discuss theintermediary results required to achieve convergence. For data driven (or modelfree) approaches in Reinforcement Learning or control in general a standard wayis to somehow estimate the terms of interest using simulations, [24], [23] andthen show that if the simulations are run for a sufficient amount of time thenthe estimated term resembles the actual term to a large extent, using some sortof concentration or other inequality results. We elaborate on the above pointsfurther in the following subsections.

3.1 Description of Algorithm-1We know from [1] that for natural policy gradient method, the policy update

rule is given by:Kn+1 = Kn − η∇C(Kn)χ−1

Kn(12)

And from 11 we know that cost-gradient ∇C(Kn) depends on state dynamicsAi, Bi, Qi, Rii∈Ns and thus we can’t exactly use empirical estimates of 11. Tothat end we use zeroth-order optimization (Lemma 8) also present in [4] where inthe cost-function-gradients can be estimated by using appropriate cost-functionsvalues only.

From Algorithm-1 we essentially want estimates of two quantities, a. gra-dient ∇C(K), b. state-correlation matrix χK , and use those estimates duringeach policy iteration (Line 22 of Algorithm-1)

In the initial part of the algorithm depending on whether or not the transitionprobability matrix is given we either use the provided transition probabilitymatrix or estimate it via sampling a Markov chain, the appropriate Pinp is thenused for rolling out simulated trajectories.

As can be seen in Algorithm-1 we input current policy estimate K, num-ber of trajectories m which is nothing but the number of parallel trajectories weaverage on, roll out length l which is the time-horizon for each of our m trajec-tories, r which bounds the Frobenius norm of our random matrices Uii∈[m]

used for zeroth-order optimization, and state dimension d.For each trajectory i, as given in the Algorithm-1, based on the operating-modeω(j) we accumulate cost cj for appropriate Ci,ω(j) for each time instance j andalso modify the correlation matrix X accordingly. At the end of all the m tra-jectories we finally get the empirical ∇C(K) and XK to be used for next policyiteration 12 with proper step-size η. And this process continues till convergence,we’ll validate the choice our empirical estimates shortly.

8

Algorithm 1 Model-Free Policy Gradient (and Natural Policy Gradient) Esti-mation for MJLS1: if Probability transition matrix P = [p]ij is known then2: we have n=0, with Pinp = P3: else4: Let,

n1 =1

ε2π∗maxd, ln 1

εδp

n2 =1

γpsπ∗ln(

d‖µ/π‖2,πδp

)

5: Then n = cmaxn1, n26: Now let ω(1), · · · , ω(n) ∼ P be the be the state values sampled from

the Markov chain, and Ni =∑n−1t=1 Iω(t) = i, Nij =

∑n−1t=1 Iω(t) =

i, ω(t+ 1) = j and so we have,

7: Pinp = [p]ij with pij =NijNi

when Ni 6= 0 and 1/d when Ni = 0

8: end if9: Input: K0, Pinp,∇C(K0) = [0]ij , number of trajectories m, roll out length`, smoothing parameter r, dimension d, step-size η, error-tolerance error

10: for t = n+ 1 to ∞ do11: for i = 1, · · ·m do12: Choose Ui s.t. it is drawn uniformly at random over matrices whose

(Frobenius) norm is r13: choose a starting state x0 ∼ D14: for j = 1, · · · l do15: Choose K from Kt−n with transition probability and initial probability

[pinp]ij and ρj respectively16: Sample a policy Kj = K + Ui17: Simulate Kj . Now we calculate empirical estimates as follows:

Ci,ω(j) +=Uicjr2

, Xi,ω(j) += xjx>j

where cj and xj are the costs and states on this trajectory.18: end for19: end for20: Get the (biased) estimates:

∇C(K) =1

m

m∑i=1

d Ci

χK =1

m

m∑i=1

diag (Xi,1, · · · , Xi,Ns)

21: We thus update the policy estimate as:22: Kt−n+1 = Kt−n − ∇C(K)χ−1

K

23: if ‖Kt−n+1 − Kt−n‖ < error then24: Return Kt−n+1

25: end if26: end for

9

3.2 Estimating the transition probability matrixThe estimator for the transition probability matrix of a Markov chain– fromits observed samples ω(1), · · · , ω(n) ∼ P and number of state i samples Ni =∑n−1t=1 Iω(t) = i and number of ij state transitions Nij =

∑n−1t=1 Iω(t) =

i, ω(t+ 1) = j– can be written as:

Pinp = [p]ij (13)

with pij =NijNi

when Ni 6= 0 and 1/d when Ni = 0.

While the estimator is fairly simple to write down proving tight boundsand sample complexity guarantees are not very straightforward because of thepresence of random variables in both the numerator and denominator or the

transition probability estimate pij =

∑n−1t=1 Iω(t) = i, ω(t+ 1) = j∑n−1

t=1 Iω(t) = i

[28] circumvent this complexity in their Theorem 3.1 by first fixing thedenominator and using Marton Coupling arguments to bound the numerator,and then bound the probability of the denominator lying in some region. Wewrite the Theorem 3.1 from [28] below.

Theorem 1. (Sample complexity upper bound w.r.t ‖.‖∞ when |Ω| < ∞) Letε ∈ (0, 2), δp ∈ (0, 1), and let X = (X1, · · · , Xm) ∼ (M,µ), M be ergodic withstationary distribution π. Then an estimator M : Ωm → MΩ exists such thatwhenever

m ≥ cmax 1

ε2π∗maxd, ln 1

εδp, 1

γpsπ∗ln(

d‖µ/π‖2,πδp

)

we have, with probability at least 1− δp,

‖M − M‖∞ < ε,

where c is a universal constant, d = |Ω|, γps is the pseudo-spectral gap, π∗ theminimum stationary probability, and ‖µ/π‖22,π ≤ 1/π∗ Where these expressionsare defined as:

• γps = maxk≥1γ((M∗)kMk)/k

• π∗ = mini∈Ω π(i)

• ‖µ/π‖22,π =∑i∈Ω µ(i)2/π(i) ∈ [1,∞]

3.3 Approximation and Perturbation AnalysisIn this section we introduce some intermediary results which’ll later be requiredfor proving convergence. More specifically we’ll convey two things, a. Cost

10

and correlation matrix can be well approximated by finite time horizons, b.Perturbating the policy slightly doesn’t change the cost-value and cost-policy-gradient by a lot.

Lemma 1. For any K with finite C(K),let Cl(K) = Eω(0)∼π,x0∼D

[∑l−1t=0 x

>t Qω(t)xt + u>t Rω(t)ut

], if

l ≥d · C2(K) ∗ (

∑Nsi=1 ‖Qi‖+ ‖Ri‖‖Ki‖2)

εµΛ2min(Q)

then:C(K)− Cl(K) ≤ ε

For proof, refer Appendix. From Lemma 1 we can surmise that the roll-outlength l has a lower bound above which the accrued cost if ε close to the costwhere our policy is rolled-out till infinity. This helps us set an appropriateinitialization for the roll-out length l, as we’ll see later. Next, we introduce ananalogous result pertaining to the state correlation matrix χ.

Lemma 2. For any K with finite C(K)let χlK = diag

(∑lt=0X1(t), · · · ,

∑lt=0XNs(t)

), where Xi(t) := E[xtx

>t 1ω(t)=i],

now if

l ≥ d · C2(K)

εµΛ2min(Q)

then

‖χ(l)K − χK‖ ≤ ε

Akin to Lemma 1, here get an idea about the roll-out length required suchthat the correlation matrices can be well approximated without the need ofrolling the simulation to infinity and thus saving time.

Next we’ll elaborate on perturbation results which are conceptually necessaryfor proving convergence. It was argued in [1] that the almost smoothnessrequirement, which expresses ∆C(K) in terms of ∆K, from [26] can’t be triviallyextended to MJLS’s case and provide novel Lyapunov argument to that end.Hinging on that, now introduce lemmas catering to policy perturbation.

Lemma 3. Suppose K ′ is such that:

‖K ′ −K‖ ≤min(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖)

11

then:|C(K ′)− C(K)| ≤ cdiff (‖K −K ′‖)

Where cdiff is a function of system variables or

cdiff = f

(‖A‖, ‖B‖, ‖Q‖, ‖R‖, ‖K‖, C(K)

Λmin(Q)

)Above Lemma 3 makes sure that the region around a stable policy K is also

stable, something we’ll use while concluding convergence in coming sections.Proof can be referred to in the Appendix. The lemma can also be thought of asa sanity check for policy iterations, as we perturb policy along the direction thatdecreases cost.

Lemma 4. Suppose K′is such that:

‖K ′ −K‖ ≤ min

(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖

)then there is a polynomial gdiff , a polynomial of system variables

such that:

‖∇C(K ′)−∇C(K)‖ ≤ gdiff‖K ′ −K‖.

Lemma 4 intuitively says that for a stable policy K the region around won’tblow up the cost, since the difference in gradients is bounded, which is somethingwe’ll need while showing that gradient estimates ∇C(K), based on zeroth-orderoptimization [4], are close to the actual gradients. Now that we have some ideaabout how the landscape of C(K) behaves as we move slightly from stabilizingpolicies, we’ll progress further to proving our main convergence results.

3.4 Convergence for Algorithm-1In this section we introduce the policy convergence result for Algorithm-1 whenwe use natural policy gradient [27], and prove that we can converge to optimalpolicy with lower bounds on both sample and time being polynomial in relevantsystem variables. Exact proof is shifted to the Appendix which uses severalintermediary results but we do give an outline of the proof below.

Theorem 2 (Main Result). Suppose C(K0) is finite and µ ≥ 0, and that thegradient ∇C(Kt) and state correlation matrix χK are estimated via Algorithm-1then for different policy update rules with corresponding conditions such that,

For gradient descent:The update rule

Kt+1 = Kt − η∇C(Kt)

with step size

η ≤ 1

etagrad

12

andr = 1/fGD,r(1/ε)

along withm ≥ fGD,sample(d, 1/ε, L2/µ)

samples, both are truncated to fGD,`(d, 1/ε) iterations, then with high probability(at least 1− exp(−d)) in T iterations where

T >‖XK∗‖µΛmin(R)

(etagrad) log2(C(K0)− C(K∗))

Where the etagrad is defined in the equation 25 in the appendix.The gradient descent update rule satisfies:

C(KT )− C(K∗) ≤ ε

For natural gradient descent:With update rule

Kt+1 = Kt − η∇C(Kt)(χKt)−1

with step size

η ≤ 1

2(‖R‖+ ‖B‖2C(K0)

µ

)and

r = 1/fNGD,r(1/ε)

along withm ≥ fNGD,sample(d, 1/ε, L2/µ)

samples, both are truncated to fNGD,`(d, 1/ε) iterations, then with high probability(at least 1− exp(−d)) in T iterations where

T >‖XK∗‖µ

(‖R‖

Λmin(R)+‖B‖2C(K0)

µΛmin(R)

)log

2(C(K0)− C(K∗))

The natural gradient descent update rule satisfies:

C(KT )− C(K∗) ≤ εProof. We obtain the proof by combining results of Theorem-3 and Theorem-4from the appendix.

Proofs for Theorem-3 and Theorem-4, which result in Theorem-1, can befound in the appendix below, the schema of the proofs involves intermediarysteps showing us that with large enough rollout length l, for a given policy, thecost C(K) and state-correlation matrix χ are ε close to the case when the rolloutlength is infinity. Next, we show that our estimated ∇C(K) and χ can be ε closeto the exact gradient and state correlation matrix with enough samples – whichis then used to prove one step bound, which is then extended to prove the abovemain result.

13

4 ExperimentsIn this section we have preformed the experiments. We have implementedModel free Gradient descent and Natural Gradient Descent and compared theirconvergence with Exact (Model Based) Gradient Descent and Natural GradientDescent.

We have performed the experiments for 2,4 and 6 number of states and samenumber of action and state dimension as states.

We have chosen A and B to be random and done scaling of matrices suchthat λmax(A) ≤ 1, to keep system stable. Also for the system we have ensurethat the C(K0) is finite where K0 = 0 for all states. Probability of Markov chainobtained using Numpy Dirichlet function.

These are the results obtained. Below figures shows the variation of Normal-ized cost difference with number of iteration of update of policy.

Figure 1: Results for 2 states, 2 inputs, 2 modes

Remarks

• From the plots we can see that Model free Natural Gradient Descentperforms approximately similar to the case of Model Based GradientDescent(A,B are known)

• As the Number of states increased the selection of hyper-parameters becomedifficult and with not so good hyper parameters the algorithm may noteven converge

• As number of states increases the Difference between the Model BasesNGD and Model Free NGD increases

• Roll out length, Number of Trajectory and value of r (smoothing parame-ters) should be chosen very selectively for convergence

14

Figure 2: Results for 4 states, 4 inputs, 4 modes

Figure 3: Results for 6 states, 6 inputs, 6 modes

• We have performed the every experiment 15 times and taken average toshow smooth convergence

5 ConclusionWe’ve thus shown policy convergence both theoretically and simulation wise, andthus approached the open problem mentioned in [1]. We envision our work toprovide basis for real world applications that can be modelled using MJLS wheresystem parameters are unknown, and the optimal policy needs to be arrived atby gathering indirect information. Furthermore, it needs to be explored howpolicy convergence would occur in cases where we don’t observe states directlyor where system changes continuously instead of changing in discrete time steps.

15

We keep these questions for further study.

6 AppendixLemma 1. For any K with finite C(K),let Cl(K) = Eω(0)∼π,x0∼D

[∑l−1t=0 x

>t Qω(t)xt + u>t Rω(t)ut

], if

l ≥d · C2(K) ∗ (

∑Nsi=1 ‖Qi‖+ ‖Ri‖‖Ki‖2)

εµΛ2min(Q)

then:C(K)− Cl(K) ≤ ε

Proof. define:PK = (P1, . . . , Pns)

where Pi is defined as in equation (6)Now cost function can be written as :

C(K) = Ex0∼D,ω(0)∼ρ

[xT0

(∑i∈Ω

ρiPKi

)x0

]Hence C(K) can be written as:

C(K) = 〈PK , X(0)〉= 〈Q+K>RK, T (X(0)〉= 〈Q+K>RK,χK〉

=

ns∑i=1

tr((Qi +K>i RiKi)χK,i)

≥ Λmin(Q) trace(χK)

So :

trace(χK) ≤ C(K)

Λmin(Q)

Now consider:

16

`−1∑i=0

trace(F i(X(0)))

= trace(

`−1∑i=0

F i(X(0)))

≤ trace(∞∑i=0

F i(X(0)))

= trace(TK(X(0)))

= trace(χK)

≤ dC(K)

Λmin(Q)

Since all traces are nonnegative, then there must exist j ≤ ` such that:

trace(χjK) ≤ d.C(K)

Λmin(Q)`

also:χK

C(K) ∗X(0)

µΛmin(Q)

and

trace(F j(χK)) ≤ C(K) ∗ trace(F j(X(0)))

µΛmin(Q)

≤ d.C2(K)

lµΛ2min(Q)

so as long as l ≥ d.C2(K)εµΛ2

min(Q)

‖χ(`)K − χK‖ ≤ ‖χ

(j)K − χK‖ = ‖F j(χK)‖ ≤ trace(F j(χK)) ≤ ε

Lemma 2. For any K with finite C(K)let χlK = diag

(∑lt=0X1(t), · · · ,

∑lt=0XNs(t)

), where Xi(t) := E[xtx

>t 1ω(t)=i],

now if

l ≥ d · C2(K)

εµΛ2min(Q)

17

then

‖χ(l)K − χK‖ ≤ ε

Proof. we know:

C(K) = 〈χK , Q+K>RK〉

C`(K) = 〈χ`K , Q+K>RK〉

Now:

C(K)− C`(K) ≤ trace(χK − χ`K)(‖Q‖+ ‖R‖‖K‖2)

therefore if :

` ≥ d · C2(K) ∗ (‖Q‖+ ‖R‖‖K‖2)

εµΛ2min(Q)

then:trace(χK − χ`K) ≤ ε

‖Q‖+ ‖R‖‖K‖2

above is from lemma 1.Hence:

C(K)− C`(K) ≤ ε

Lemma 3. Suppose K ′ is such that:

‖K ′ −K‖ ≤min(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖)

then:|C(K ′)− C(K)| ≤ cdiff (‖K −K ′‖)

Where cdiff is a function of system variables or

cdiff = f

(‖A‖, ‖B‖, ‖Q‖, ‖R‖, ‖K‖, C(K)

Λmin(Q)

)Proof. Let LKi = (Ri+B

>i Ei(PK)Bi)Ki−B>i Ei(PK)Ai, and ψ = R+B>E(PK)B.

And so we can write,

C(K′)− C(K) = −2〈∆K>LK , χK′ 〉+ 〈∆K>ψ∆K,χK′ 〉 (14)

18

Thus expanding equation (14) we get,

C(K′)− C(K) = 2〈∆K>[B>i Ei(PK)Ai − (Ri +B>i Ei(PK)Bi)Ki],

χK′ 〉+ 〈∆K>(R+B>E(PK)B)∆K,χK′ 〉

We know that

〈A,B〉 = Tr(A>B)

Tr(ABC) = Tr(CAB)

Tr(ABC) ≤ Tr(AB)Tr(C)

≤ Tr(AB)d× ‖C‖ · · · d: dimension of C

Let ∆C(K) = C(K′)− C(K), therefore we can write,

∆C(K) = 2

Ns∑i=1

Tr(

[BiE(PKi )Ai − (RiB>i E(PKi )Bi)Ki]

>∆KiXK′

i

)+

Ns∑i=1

Tr(

∆K>i (Ri +B>i E(PKi Bi)>)∆KiX

K′

i

)≤ 2

Ns∑i=1

(‖∆Ki‖.‖BiC(PKi )Ai − (RiB>i E(PKi )Bi)Ki‖.T r(XK

i ))

+

Ns∑i=1

‖∆Ki‖2.‖BiE(PKi Bi) +Ri‖ × Tr(XK′

i )

≤ 2‖∆K‖max.(‖B‖max‖A‖max‖PK‖max + ‖R‖max.‖B‖max.‖PK‖max.‖K‖max

)× (

Ns∑i=1

Tr(XK′

i ))

+ ‖∆K‖2max.(‖B‖2max‖PK‖max + ‖R‖max

)× (

Ns∑i=1

Tr(XK′

i ))

We can also write,

‖∆K‖2max = ‖K −K′‖2max

= ‖K −K′‖max × ‖K −K

′‖max

≤ ‖K‖max‖K −K′‖max

Hence we get,

∆C(K) ≤h1diff .‖∆K‖max.

(∑i∈Ω

Tr(XK′

i )

)

h2diff .‖∆K‖max.

(Ns∑i=1

Tr(XK′

i )

)

19

Where h1diff , h

2diff ∼ f

(‖B‖, ‖A‖, ‖PK‖, ‖K‖, ‖R‖

)=⇒ C(K

′)− C(K) ≤ (h1

diff + h2diff ).‖∆K‖max.(

Ns∑i=1

Tr(XK′

i )) (15)

from A.3 in [1] we get,∑Nsi=1 Tr(X

Ki ) ≤ C(K)

Λmin(Q). For gradient/natural gradient

descent K → K′updates we know: C(K

′) < C(K). Thus we get,

Ns∑i=1

Tr(XK′

i ) ≤ C(K′)

Λmin(Q)≤ C(K)

Λmin(Q)(16)

Using equations (14 and 15)we get,

|C(K′)− C(K)| ≤ h4

diff .‖K −K′‖max

where h4diff ∼ f(‖A‖, ‖B‖, ‖Q‖, ‖R‖, ‖K‖, C(K)

Λmin(Q)) or

h4diff =

C(K)

Λmin(Q).(2.‖B‖max

C(K)

µ(‖A‖max

+ ‖R‖max‖K‖max) + ‖B‖2max

C(K)

µ

+ ‖R‖max) · · · · · · using sideresult from Theorem 3 in [1]

Lemma 4. Suppose K′is such that:

‖K ′ −K‖ ≤ min

(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖

)then there is a polynomial gdiff , a polynomial of system variables

such that:

‖∇C(K ′)−∇C(K)‖ ≤ gdiff‖K ′ −K‖.

Proof. We know that ∇C(K) = 2LKχK where, LKi = (Ri +B>i Ei(PK)Bi)Ki −B>i Ei(PK)AiWe can write,

∇C(K′)−∇C(K) = 2(LK

χK′

− LKχK)

= 2(LK′

χK′

− LKχK′

+ LKχK′

− LKχK)

= 2((LK′

− LK)χK′

+ LK(χK′

− χK))

We now have 4 terms to bound:

20

• LK′

− LK

• χK′

• LK

• (χK′

− χK)

Now from equation A.3 in the proof of Lemma 3 from [1] that

Ns∑i=1

tr(XK′

i ) ≤ C(K′)

Λmin(Q)≤ C(K)

Λmin(Q)

=⇒ ‖χK′‖ ≤C(K)

Λmin(Q)(17)

Bounding LK is relatively straightforward. We know that,

C(K) = Ex0∼D[tr((

Ns∑i=1

πiPKi )x0x

>0 )]

=⇒ ‖PK‖max ≤C(K)

µ

Using the above inequality we can thus bound LK as,

‖LK‖ =

Ns∑i=1

‖(Ri +B>i Ei(PK)Bi)Ki −B>i Ei(PK)Ai‖

≤Ns∑i=1

‖(Ri +B>i Ei(PK)Bi)Ki +B>i Ei(PK)Ai‖

≤ Ns((‖R‖max + ‖B‖maxC(K)

µ)‖K‖max

+ ‖B‖max‖A‖maxC(K)

µ)

Thus, using results from Lemma 6, 5 and the above bounds for χK′ and LK we

get that ‖∇C(K′)−∇C(K)‖ ≤ gdiff‖K

′−K‖ where gdiff ∼ f(‖A‖, ‖B‖, ‖Q‖, ‖R‖, ‖K‖)

Lemma 5. Suppose K′is such that:

‖K ′ −K‖ ≤ min

(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖

)

21

then there is a polynomial gχdiffsuch that:

‖χK′

− χK‖ ≤ gχ,diff‖K ′ −K‖.

Proof. Let us define xi,t to be the state matrix evolving with time when thesystem is LQR, pij = 1 if i = j else 0, with state dynamics Ai, Bi, Qi, Ri insteadof MJLS with same initial state x0.

XKi =

∞∑t=0

Xi(t) =

∞∑t=0

E[xtx>t 1ω(t)=i]

‖χK′ − χK‖ =

Ns∑i=1

‖XK′

i −XKi ‖

We can write,

χK′i− χKi =

∞∑t=0

E[(xK′

t xK′

t − xKt xKt ).1ω(t)=i]

≤ maxi∞∑t=0

E[(xK′

i,t xK′

i,t − xKi,txKi,t)]

Now, we know from Lemma 16 in [26] that ifK′−K ≤ σmin(Q)µ

4C(K)‖B‖(‖A−BK‖+ 1)it holds that

‖ΣK′ − ΣK‖ ≤ 4(C(K)

σmin(Q))2 ‖B‖(‖A−BK‖+ 1)‖K −K ′‖

µ

Using the above result we can write

‖XK′

i −XKi ‖ ≤

maxi4(

CLQR(K)i

σmin(Qi))2 ‖B‖i(‖Ai −BiKi + 1‖)‖Ki −K

i‖µ

Where CLQR(K)i is the cost accrued were the system LQR with state dy-namics Ai, Bi, Qi, Ri instead of MJLS with same initial state x0.

We can thus write,

‖χK′ − χK‖ =

Ns∑i=1

‖XK′

i −XKi ‖

≤ N2s .max

i4(

CLQR(K)i

σmin(Qi))2 ‖B‖i(‖Ai −BiKi + 1‖)

µ.‖K −K

′‖

22

Lemma 6. Suppose K′is such that:

‖K ′ −K‖ ≤ min

(Λmin(Q)µ

4C(K)∑Nsi ‖Bi‖(‖Ai −BiKi‖+ 1)

,

Ns∑i

‖Ki‖

)then there is a polynomial gχdiff

such that:

‖LK′

− LK‖ ≤ gL,diff‖K ′ −K‖.Proof. We know that,

LKi = (Ri +B>i Ei(PK)Bi)Ki −B>i Ei(PK)Ai

also, ‖LK′

− LK‖ =∑Nsi=1 ‖LK

i − LKi ‖ thus we get,

LK′

i − LKi = Ri(K′

i −Ki)−B>i (Ei(PK′

− PK))Ai

+B>i (Ei(PK′

− PK)).BiK′

i

+BiEi(PK)Bi(K′

i −Ki)

We thus need to bound Ei(PK′

− PK) which we’ll do next.

We know that C(K) = Ex0∼D[tr((∑Nsi=1, πiP

Ki )x0x

>0 )]. Let ∆C(K) = C(K

′)−

C(K), we then get:

∆C(K) = Ex0∼D[tr((

Ns∑i=1

πi(PK′

i − PKi ))x0x>0 )]

≥ tr(Ns∑i=1

πi(PK′

i − PKi )).σmin(Ex0∼D(x0x>0 ))

≥ tr(Ns∑i=1

(PK′

− PKi )).min(πi).σmin(Ex0∼D(x0x>0 )

=⇒ tr(

Ns∑i=1

(PK′

− PKi )) ≤ |C(K′)− C(K)|µ

From Lemma 4 we know that |C(K′)− C(K)| ≤ hdiff‖K

′ −K‖

‖Ns∑i=1

(PK′

i − PKi )‖ ≤ hdiff .(‖K ′ −K‖)

µ

Now,

Ei(PK′

− PK) =∑

pij(PK′

j − PKj )

‖Ei(PK′

− PK)‖ = ‖∑

pij(PK′

j − PKj )‖

23

‖Ei(PK′

− PK)‖ ≤ hdiff .(‖K ′ −K‖)

µ(18)

Substituting equation 17 in the expression for LKK

i − LKi above we get,

‖LK − LK′

‖ ≤ ‖R‖max∆K + ‖A‖max‖B‖maxhdiff∆K

µ

+ ‖B‖maxhdiff∆K

µ‖K‖

+ ‖B‖2max

C(K)∆K

µ

Thus ‖L(K)− LK′

‖ = hL∆K where hL ∼ f(‖A‖, ‖B‖, ‖K‖, C(K0), hdiff ).

In the lemma 7 below we show why it makes sense to use the estimatedtransition probability matrix in case we don’t know them beforehand. Owingto closeness shown between estimated gradients and state correlation matrix inthe cases where the transition probabilities are known and estimated, with highprobability, in the convergence proofs we can use the terms meant for the casewhen the transition probability’s known for lucidity’s sake.

Lemma 7. Let ∇C(K) and χK be the estimated gradient and expected statecorrelation matrix when the transition probabilities are exactly known, and corre-spondingly let ∇C(K)est and χest,K be the value when the transition probabilitiesare estimated to a large precision (‖P − P‖∞ < εp) with a high probability,then for small enough εp, ∃ε∇, εχ such that ‖∇C(K) − ∇C(K)est‖ ≤ ε∇ and‖χK − χest,K‖ ≤ εχ with high probability.

Proof. The lemma is proved in two parts below, we first prove the error boundbetween expected state correlation matrix χK , χest,K and then correspondinglyuse the result to prove bound on gradients. The proof exploits the recursivenature of the correlation matrix, with trying to bound the difference betweentwo terms at each time step. Furthermore this bound on expected difference isthen used to bound the actual difference using hoeffding’s inequality.

Part a:

We know that Xi(t) := E[xtx>t 1ω(t)=i], and

Xj(t+ 1) =

Ns∑i=1

pij(Ai −BiKi)Xi(t)(Ai −BiKi)>

Now, for a policy K, let Xj(t+1) be the matrix when the estimated transitionprobability matrices, [pij ], are used.

24

Xj(t+ 1)− Xj(t+ 1) =

Ns∑i=1

(Ai −BiKi)[pijXi(t)− pijXi(t)

]× (Ai −BiKi)

>

now, assume that ‖Xi(t)− Xi(t)‖ ≤ εt · · · · · ·w.h.pThe assumption primarily serves to observe how the bound varies with time t.

We can then write,

pijXi(t)− pijXi(t) = (pij − pij)Xi(t) + pij(Xi(t)− Xi(t))

Since we know that, χK ≤C(K) ∗X(0)

µΛmin(Q)

=⇒ ‖Xi(t)‖ ≤ εK

where εK is an upper bound which is a function of K. Thus we get,

‖(pij − pij)Xi(t) + pij(Xi(t)− Xi(t))‖ ≤ εpεK + εt

=⇒ ‖Xi(t+ 1)− Xj(t+ 1)‖ ≤ NscK(εpεK + εt) (19)

Where cK = Ns(‖A‖max + ‖B‖max‖K‖max) is a constant. Now that we havea recursive relationship over bound, we can exploit bound from t = 1 to get theremaining ones.

For t = 1,

Xj(1)− Xj(1) ≤Ns∑i=1

(Ai −BiKi)x0x>0 (Ai −BiKi)

> ∗ (pij − pij)

‖Xj(1)− Xj(1)‖ ≤ c1Kεp (20)

Where c1K = Ns(‖A‖max + ‖B‖max‖K‖max)2‖x0x>0 ‖ is a constant. Thus the

bounds (14) can be recursively determined starting from t = 1 (15). We thusget,

25

T∑t=1

(E[xtx

>t 1ω(t)=i]− E[xtx

>t 1ω(t)=i]

)≤

T∑t=1

εt = εX,T (21)

Now from the algorithm we know that, χK = 1m

∑mi=1 diag

(∑lj=1 xtx

>t 1ω(j)=1, · · · ,

∑lj=1 xtx

>t 1ω(j)=Ns

).

Let χest,K be the correlation matrix using estimated transition probabilities. Wethus get,

Ns∑i=1

E[χK,i − χest,K,i] =

Ns∑i=1

l∑t=1

(E[xtx>t ]− E[xtx

>t ])1ω(t)=i

Now using Hoeffding bound, since we average over m trajectories, we knowthat:

P [‖χK − χest,K − E[χK − χest,K ]‖ ≥ δ] ≤ exp−2mδ2

Thus using above result and 16 we conclude that χK and χest,K are δ + εX,Tclose to each other with a high probability.

Part b:

Let Pi be defined as in equation (6)Now cost function can be written as :

C(K) = Ex0∼D,ω(0)∼ρ

[xT0

(∑i∈Ω

ρiPKi

)x0

]Thus:

C(K) = 〈PK , X(0)〉= 〈Q+K>RK, T (X(0)〉= 〈Q+K>RK,χK〉

=

ns∑i=1

tr((Qi +K>i RiKi)χK,i)

Let Cest,i be the cost accumulated, as in Algorithm 1, by due to the i− thtrajectory on using estimated transition probabilities. We can thus write,

26

C − Cest =

Ns∑i=1

tr((Qi +K>i RiKi)[χK,i − χest,K,i])

∇C(K)− ∇C(K)est =1

m

m∑i=1

d(Ci − Cest,i)

Since we’ve already proven bounds on χK − χest,K in (Part a) above, thecost due to estimated transition probabilities ∇Cest(K) can thus trivially shownto be close to ∇C(K) with high probability.

6.1 Analysis of smoothing parameter and gradient calcu-lation

In this section we will see how the value of smoothing parameters in chosen andhow it will affect cost value and gradient values. We have to be careful aboutchoosing the value of smoothing parameters as many times the cost C(K + r)might not be defined.

In the next lemma we will use the standard technique (e.g. in [4]) to showWe will denote Sr to represent the uniform distribution over the points with

norm r and Brrepresent the uniform distribution over all points with norm atmost r (the entire sphere).

Now we will use the below function and then apply the calculate the gradientusing the use the standard technique (e.g. in [4]) to show that the gradient canbe just with an oracle for function value.

Cr(K) = EU∼Br [C(K + U)].

∇Cr(K) = (∇Cr(K)1, . . . ,∇Cr(K)ns))

We will calculate the gradient using the lemma 2.1 in the [4].

Lemma 8. ∇Cr(K) = dr2EU∼Sr [C(K + U)U ] Here: C(K + U)U = (C(K1 +

U)U, . . . , C(Kns + U)U)

Proof. In this we have used the same U rather then Ui for different states. NowC(K) can be seen as ns variable function and we are estimating the gradient forevery variable. The proof is similar to lemma Lemma 2.1 in [4].

By Stokes formula,

∇∫δBr

C(K1 + U, . . .Kns + U)dx =∫δSr

C(K1 + U, . . .Kns + U)[U, . . . , U ]

‖U‖Fdx.

27

By definition

Cr(K) =

∫δBr C(K1 + U, . . . ,Kns + U)dx

vold(δBr),

Also,EU∼Sr [C(K1 + U, . . .Kns + U)[U, . . . , U ]]

= rEU∼Sr [C(K1 + U, . . .Kns + U)[U, . . . , U ]

r]

= r ·

∫δSr C(K1 + U, . . .Kns + U) [U,...,U ]

‖U‖F dx

vold−1(δSr).

The Lemma follows from combining these equations, and use the fact that

vold(δBr) = vold−1(δSr) ·r

d.

From this lemma we can see that we only need polynomial number of sample toestimate the gradient

Now we will define different estimate of gradients.

∇ = (∇1, . . . , ∇ns) (22)

where :

∇p =1

m

m∑i=1

d.Cir2

and :

Ci =

∞∑t=0

C(Kw(t) + Ui) ∗ Ui ∗ 1ω(t)=p

∇ = (∇1, . . . , ∇ns) (23)

where :

∇p =1

m

m∑i=1

d.Cir2

and :

Ci =

`−1∑t=0

C(Kw(t) + Ui) ∗ Ui ∗ 1ω(t)=p

∇′ = (∇′1, . . . ,∇′ns) (24)

28

where:

∇′p =1

m

m∑i=1

d.Cir2

and

Ci =

∞∑j=0

C(l)(Kw(j) + Ui) ∗ Ui ∗ 1ω(j)=p

(where C(l) is same as defined in lemma 1)In above definition the ∇ is defined as estimated gradient in case of finitetrajectories , the ∇ is defined as estimated gradient in case of finite trajectorieswith finite roll out length. ∇′p is the estimated value of ∇.

Lemma 9. for an ε there are fixed polynomials fr( 1ε ), fsample(d, 1

ε ) such thatwhen r ≤ 1

fr( 1ε )

with m ≥ fsample(d, 1ε ), with high probability(at least 1− (dε )−d

the ∇ value is ε close to ∇C(K) in Frobenius norm.

Proof. We will break the proof in the two parts as follows:

∇ − ∇C(K) = (∇Cr(K)−∇C(K)) + (∇ − ∇Cr(K))

for first term we will choose fr( 1ε ) = min1/r0, 2gdiff/ε (r0 is chosen later)

By lemma (4) when r is smaller then 1/fr(1/ε) = ε/2gdiff , Every point on spherehave

‖∇C(K + U)−∇C(K)‖F ≤ ε/4

Since ∇Cr(K) is the expectation of ∇C(K + U), by triangle inequality

‖∇Cr(K)−∇C(K)‖F ≤ ε/2

Now for second term:From lemma 8 we know that :

E[∇] = ∇Cr(K)

Also each individual sample has norm bounded by 2dC(K)/r, so by VectorBernstein’s Inequality know with m ≥ fsample(d, 1/ε) = Θ

(d(dC(K)εr

2)log d/ε

)samples, with high probability (at least 1− (d/ε)−d)

‖∇ − E[∇]‖F ≤ ε/2

Adding the both term proves our claim.r0 is chosen according to the lemma (4) in relevant polynomial factors.

29

Lemma 10. for x ∼ D and ‖x‖ ≤ L , there are polynomials f`,grad(d, 1/ε),fr,trunc(1/ε), fsample,trunc(d, 1/ε, σ, L2/µ) such that whenm ≥ fsample,trunc(d, 1/ε, L2/µ),` ≥ f`,grad(d, 1/ε), let xij , uij(0 ≤ j ≤ `) be a single path sampled using our algo-rithm them

∇ − ∇C(K) ≤ ε

Proof. First we will divide proof in three parts

∇ − ∇C(K) = (∇ − ∇′) + (∇′ − ∇) + (∇ − ∇C(K))

We have already defined ∇′.For getting the bound on the term (∇ − ∇C(K)) we will use the lemma 9 andchoose functions as follows:

hr,trunc(1/ε) = hr(2/ε)

and making sure

hsample,trunc(d, 1/ε) ≥ hsample(d, 2/ε)

By choosing these we can see that the (∇ − ∇C(K)) is less then ε/2Now for the term (∇′ − ∇), first we will use the lemma 1 From this lemma

for any K′, if we choose :

` ≥ 16d2 · C2(K)(‖Q‖+ ‖R‖‖K‖2)

εrµσ2min(Q)

=: f`,grad(d, 1/ε)

then it holds that

‖C(`)(K ′)− C(K ′)‖ ≤ rε

4d

multiplying by dr on both sides, then multiply and divide by term r and U on

left side(forbenius norm is same). After that taking average over m and usingtriangle inequality, we will get

‖ 1

m

m∑i=1

d

r2[ C(`)(K + Ui)Ui − C(K + Ui)Ui] ‖ ≤

ε

4

= ‖∇′ − ∇‖ ≤ ε

4

The remaining first term ∇ − ∇′) , we can see that

E[∇] = ∇′

Now the randomness in ∇ occur due the intial state x0, we are taking expectationover x0 to estimate the ∇′. Since we have assumed that the ‖xi0‖ ≤ L, So

(xi0)(xi0)> L2

µE[x0x

>0 ]

30

We will apply this to the cost and summing over time, we will have

[

`−1∑j=0

(xij)>Qω(j)x

ij + (uij)

>Rω(j)uij ] ≤

L2

µC(K + Ui).

left side has cost of single trajectory while right side have the bound on it. Nowif we take the sum of all trajectories the right hand side will become the ∇. Sonow we have bound on ∇ and also ∇′ is expectation of ∇. Here we can use theVector Bernstein’s inequality When we have fsample,trunc(d, 1/ε, L2/µ) is a largeenough polynomial, ‖∇ − ∇′‖ ≤ ε/4 with high probability. Summing over allterms:

∇ − ∇C(K) = (∇ − ∇′) + (∇′ − ∇) + (∇ − ∇C(K))

≤ ε

4+ε

4+ε

2= ε

6.2 Convergence of Gradient descentUsing the lemmas in previous sections we will now prove the convergence ofnatural gradient descentFirst we will show that we can estimate the XK accurately and then theconvergence. Recall that:

XK =

( ∞∑t=0

X1(t), . . . ,

∞∑t=0

Xns(t)

)

XK = diag

( ∞∑t=0

X1(t), . . . ,

∞∑t=0

Xns(t)

)Now define

XK = (XK1 , . . . , X

Kns)

where:

XKp =

1

m

m∑i=1

`−1∑t=0

xit(xit)>1ω(t)=p

etagrad = 2

(‖R‖max + ‖B‖2max

(1 +

‖B‖maxα

µ

))α

Λmin(Q)(25)

where :

ξ =1

Λmin(Q)

(1 + ‖B‖2max

µα+ ‖R‖max

)− 1

α is defined as follows:

Kα.= K ∈ K : C(K) < α

for every α ≥ C(K∗)

31

Lemma 11. Suppose C(K0) is finite. For any stepsize η ≤ (etagrad)−1, where

etagrad is defined in 25, the policy gradient method Kt+1 = Kt − η∇C(Kt)

converges to the global minimum K∗ ∈ K linearly as follows

C(Kt)− C(K∗) ≤

(1− 2ηµ2Λmin(R)

‖XK∗‖

)t×(

C(K0)− C(K∗)). (26)

and

C(Kt+1)− C(K∗) ≤

(1− 2ηµ2Λmin(R)

‖XK∗‖

)×(

C(Kt)− C(K∗)).

Proof. This lemma is taken from the [1] theorem 1 and lemma 5.

Theorem 3. Suppose C(K0) is finite and µ ≥ 0, Then the update rule

Kt+1 = Kt − η∇C(Kt)

with step size

η ≤ 1

etagrad

the gradient is estimated using lemma (9), (10) with

r = 1/fGD,r(1/ε)

andm ≥ fGD,sample(d, 1/ε, L2/µ)

samples, both are truncated to fGD,`(d, 1/ε) iterations, then with high probability(at least 1− exp(−d)) in T iterations where

T >‖XK∗‖µΛmin(R)

(etagrad) log2(C(K0)− C(K∗))

Where the etagrad is defined in the 25The natural gradient descent update rule satisfies:

C(KT )− C(K∗) ≤ ε

Proof. Proof of this theorem in easy and quite similar to Natural GradientDescnet where we have to bound one less term and is as follows.

32

K0 = (K0,1, . . . ,K0,ns)

Kt = (Kt,1, . . . ,Kt,ns)

Kt+1 = Kt − η∇C(Kt)

K′

t+1 = Kt − η∇

K∗ = (K∗1 , . . . ,K∗ns)

where K∗ is optimal controller.Now from lemma 11 we know that for

η ≤ 1

etagrad

the

C(Kt+1)− C(K∗) ≤

(1− 2ηµ2Λmin(R)

‖XK∗‖

)×(

C(Kt)− C(K∗))

So, If we can prove that

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ2

‖XK∗‖

Then, From above two equations we can see that if C(Kt)− C(K∗) ≥ ε, then

C(K′

t+1)− C(K∗) ≤

(1− 3ηµ2Λmin(R)

2‖XK∗‖

)×(C(Kt)− C(K∗)

)(27)

Now We will use the induction for T steps:We can from the equation 27 than the cost value is shrinking as t increasing so,C(Kt) ≤ C(K0)If

T ≥ ‖XK∗‖µ

(etagrad) log2(C(K0)− C(K∗))

ThenC(KT )− C(K∗) ≤ ε

Where last steps follows from induction and lemma (11)So Now we have to prove only below equation and we are done.

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ2

‖XK∗‖(28)

33

Here we will use the lemma (3) for going from C(K) to Kso if (28) is true then using lemma (3)

‖K′

t+1 − Kt+1‖ ≤ε

2cdiffηΛmin(R)

µ2

‖XK∗‖(29)

Now definition of K′

t+1, Kt+1 then the (29) will become :

‖∇ − ∇C(K)‖ ≤ ε

2cdiffΛmin(R)

µ2

‖XK∗‖Using the lemma (10) This can be done by choosing :

fGD,r(1/ε) = fr,trunc(2cdiff‖XK∗‖µ2Λmin(R)ε

)

fGD,sample(d, 1/ε, L2/µ) = fsample,trunc(d,

2cdiff‖XK∗‖µ2Λmin(R)ε

, L2/µ)

fGD,`(d, 1/ε) = f`,grad(d,2cdiff‖XK∗‖µ2Λmin(R)ε

)

and if we choose thses values then the difference between the estimated andexact grad becomes

‖∇ − ∇C(K)‖ ≤ ε

2cdiffΛmin(R)

µ2

‖XK∗‖

so the equation 3.18 :

‖K′

t+1 − Kt+1‖ ≤ε

2cdiffηΛmin(R)

µ2

‖XK∗‖If above bound satisfies then our desired bound:

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ2

‖XK∗‖

Will also satisfies and our proof is completed.

6.3 Convergence of Natural Gradient DescentIn this section we will prove the convergence of Gradient descent.

Lemma 12. for x ∼ D, ‖x‖ ≤ L , x10, ..., x

m0 and m random perturbations

Ui ∼ Sr where r is taken such that r ≤ 1/fr,var(1/ε), if we simulate trajectoriesusing these initial points and Algorithm 2 for roll out length ` ≥ f`,var(d, 1/ε)

for m times s.t. m ≥ fvarsample,trunc(d, 1/ε, L2/µ) and estimate the XK then

34

almost surely there exists polynomials fr,var(1/ε), fvarsample,trunc(d, 1/ε, L2/µ)and f`,var(d, 1/ε), such that with igh probability (at least 1−(d/ε)−d) the estimateXK will satisfies

‖XK −XK‖ ≤ ε

Proof. XK,` is same as defined as in lemma 2 Now let

XK

=1

m

m∑i=1

XK+Ui

XK,`

=1

m

m∑i=1

XK+Ui,`

We will divide the proof into three parts as follows

XK −XK = (XK −XK,`) + (X

K,` −XK) + (X

K −XK)

For the second term XK,` −XK

we will directly use the lemma 2 and if wechoose

` ≥ f`,var(d, 1/ε) =8d · C2(K)

εµΛ2min(Q)

then the term XK,` −XK

is bounded by ε4

Now for the first term (XK − XK,`) ; we can notice that E[XK ] = X

K,`

Where the expectation is taken over the initial points.Here we will give the similar arguments as we given in the previous lemma forproving the bound for gradient estimated and expectation of estimated gradient.since: ‖xi0‖ ≤ L and (xi0)(xi0)> L2

µ E[x0x>0 ] now for every i in ns we will have:

`−1∑j=0

xij(xij)>1ω(j)=i

L2

µXKi .

by summing over all m, we will have bound on XK so we can now we can usethe fvarsample,trunc large enough polynomial such that ‖XK −XK,`‖ ≤ ε/2 istrue with high probability. Here we have used the Vector Bernstein’s inequalityin last step to bound the difference of value and its expectation.

For the third term (XK−XK) we will use the condition proved in the lemma

4 (perturbation results). When

r ≤ ε ·(

Λmin(Q)

C(K)

)2µ

16‖B‖max (‖A−BK‖max + 1)

,‖XK+Ui −XK‖ ≤ ε/4

35

Since XK

is the average of XK+Ui , by the triangle inequality, ‖XK−XK‖ ≤ ε/4.So all three terms:

XK −XK = (XK −XK,`) + (X

K,` −XK) + (X

K −XK)

≤ ε

2+ε

4+ε

4= ε

Lemma 13. Suppose C(K0) is finite. For any stepsize η ≤ 1

2(‖R‖+ ‖B‖

2C(K0)µ

) ,the natural policy gradient method Kt+1 = Kt − η∇C(Kt)(χ

Kt)−1 converges tothe global minimum K∗ ∈ K linearly as follows

C(Kt)− C(K∗) ≤

(1− 2ηµΛmin(R)

‖XK∗‖

)t×(

C(K0)− C(K∗)). (30)

and

C(Kt+1)− C(K∗) ≤

(1− 2ηµΛmin(R)

‖XK∗‖

)×(

C(Kt)− C(K∗)).

Proof. This lemma is taken from the [1] theorem 3 and lemma 9.For this the author have first proved some bound for one step then use theinduction to get bound for t steps.

Theorem 4. Suppose C(K0) is finite and µ ≥ 0, Then the update rule

Kt+1 = Kt − η∇C(Kt)(χKt)−1

with step size

η ≤ 1

2(‖R‖+ ‖B‖2C(K0)

µ

)the gradient and XK are estimated using lemma (9), (10) and (12) with

r = 1/fNGD,r(1/ε)

andm ≥ fNGD,sample(d, 1/ε, L2/µ)

36

samples, both are truncated to fNGD,`(d, 1/ε) iterations, then with high probability(at least 1− exp(−d)) in T iterations where

T >‖XK∗‖µ

(‖R‖

Λmin(R)+‖B‖2C(K0)

µΛmin(R)

)log

2(C(K0)− C(K∗))

The natural gradient descent update rule satisfies:

C(KT )− C(K∗) ≤ ε

Proof.K0 = (K0,1, . . . ,K0,ns)

Kt = (Kt,1, . . . ,Kt,ns)

Kt+1 = Kt − η∇C(Kt)(χKt)−1

K′

t+1 = Kt − η∇(XKt)−1

K∗ = (K∗1 , . . . ,K∗ns)

where K∗ is optimal controller.Now from lemma 13 we know that for

η ≤ 1

2(‖R‖+ ‖B‖2C(K0)

µ

)the

C(Kt+1)− C(K∗) ≤

(1− 2ηµΛmin(R)

‖XK∗‖

)×(

C(Kt)− C(K∗))

So, If we can prove that

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ

‖XK∗‖

Then, From above two equations we can see that if C(Kt)− C(K∗) ≥ ε, then

C(K′

t+1)− C(K∗) ≤

(1− 3ηµΛmin(R)

2‖XK∗‖

)×(C(Kt)− C(K∗)

)(31)

Now We will use the induction for T steps:We can from the equation (31) than the cost value is shrinking as t increasingso, C(Kt) ≤ C(K0)If

T ≥ ‖XK∗‖µ

(‖R‖

Λmin(R)+‖B‖2C(K0)

µΛmin(R)

)log

2(C(K0)− C(K∗))

37

ThenC(KT )− C(K∗) ≤ ε

Where last steps follows from induction and lemma (13)So Now we have to prove only below equation and we are done.

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ

‖XK∗‖(32)

Here we will use the lemma (3) for going from C(K) to Kso if (32) is true then using lemma (3)

‖K′

t+1 − Kt+1‖ ≤ε

2cdiffηΛmin(R)

µ

‖XK∗‖(33)

Now definition of K′

t+1, Kt+1 then the (33) will become :

‖∇(XK)−1 −∇C(K)(XK)−1‖ ≤ ε

2cdiffΛmin(R)

µ

‖XK∗‖to prove this we will break the right side into two parts

‖∇(XK)−1 −∇C(K)(XK)−1‖ ≤ ‖∇ − ∇‖‖(XK)−1‖+ ‖∇C(K)‖‖(XK)−1 − (XK)−1‖(34)

For first term for large enough samples from Weyl’s theorem: ‖(XK)−1‖ ≤2/µ Now we need to make sure that the

‖∇ − ∇‖ ≤ ε

8cdiffΛmin(R)

µ2

‖XK∗‖

Using the lemma (10) This can be done by choosing :

fNGD,grad,r(1/ε) = fr,trunc(8cdiff‖XK∗‖µ2Λmin(R)ε

)

fNGD,gradsample(d, 1/ε, L2/µ) = fsample,trunc(d,

8cdiff‖XK∗‖µ2Λmin(R)ε

, L2/µ)

fNGD,`,grad(d, 1/ε) = f`,grad(d,8cdiff‖XK∗‖µ2Λmin(R)ε

)

for Second term:

‖(XK)−1 − (XK)−1‖ ≤ ε

4cdiffΛmin(R)

µ

‖XK∗‖‖∇C(K)‖Here we will use some equations of standard matrix perturbations.

Λmin(XK) ≥ µ

38

‖XK −XK‖ ≤ µ/2

‖(XK)−1 − (XK)−1‖ ≤ 2‖XK −XK‖/µ2

Now for this bound we will use lemma (12) and choose:

fNGD,var,r(1/ε) = fvar,r(8cdiff‖XK∗‖‖∇C(K)‖

µ3Λmin(R)ε)

fNGD,varsample(d, 1/ε, L2/µ) = fvarsample,trunc(d,

8cdiff‖XK∗‖‖∇C(K)‖µ3Λmin(R)ε

, L2/µ)

fNGD, `, var(d, 1/ε) = f`,var(d,8cdiff‖XK∗‖‖∇C(K)‖

µ3Λmin(R)ε)

Finally if we choose:

fNGD,r = maxfNGD,grad,r, fNGD,var,r

fNGD,sample = maxfNGD,gradsample, fNGD,varsample

fNGD,` = maxfNGD,`,grad, fNGD,`,var

then the equation (??) becomes

‖∇(XK)−1 −∇C(K)(XK)−1‖

≤ ‖∇ − ∇‖‖(XK)−1‖+ ‖∇C(K)‖‖(XK)−1 − (XK)−1‖

≤ ε

8cdiffΛmin(R)

µ2

‖XK∗‖2

µ+

ε

4cdiffΛmin(R)

µ

‖XK∗‖‖∇C(K)‖|∇C(K)‖

2cdiffΛmin(R)

µ

‖XK∗‖

so the equation 33 :

‖K′

t+1 − Kt+1‖ ≤ε

2cdiffηΛmin(R)

µ

‖XK∗‖If above bound satisfies then our desired bound:

C(Kt+1)− C(K′

t+1) ≤ ε

2

ηΛmin(R)µ

‖XK∗‖

Will also satisfies and our proof is completed.

39

References[1] Policy Optimization for Markovian Jump Linear Quadratic Control: Gradient-

Based Methods and Global Convergence http://arxiv.org/abs/2011.11852v1

[2] M. Fragoso, “Discrete-time jump LQG problem,” International Journal ofSystems Science, vol. 20, no. 12, pp. 2539–2545, 1989.

[3] O. Costa, M. Fragoso, and R. Marques, Discrete-time Markov jump linearsystems. Springer London, 2006.

[4] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan.Online convex optimization in the bandit setting: gradient descent withouta gradient. In Proceedings of the sixteenth annual ACM-SIAM symposiumon Discrete algorithms, pages 385–394. Society for Industrial and AppliedMathematics, 2005.

[5] Y. Bar-Shalom and X. Li, “Estimation and tracking- principles, tech- niques,and software,” Norwood, MA: Artech House, Inc, 1993., 1993.

[6] E. Fox, E. S. M. Jordan, and A. Willsky, “Bayesian nonparametric inferenceof switching dynamic linear models,” IEEE Transactions on Signal Processing,vol. 59, no. 4, pp. 1569 – 1585, 2011.

[7] D. Sworder and J. Boyd, Estimation problems in hybrid systems. CambridgeUniversity Press, 1999.

[8] K. Gopalakrishnan, H. Balakrishnan, and R. Jordan, “Stability of networkedsystems with switching topologies,” in IEEE Conference on Decision andControl, 2016, pp. 1889–1897.

[9] A. N. Vargas, E. F. Costa, and J. B. R. do Val, “On the control of markovjump linear systems with no mode observation: application to a dc motordevice,” Int. J. Robust. Nonlinear Control, vol. 23, no. 10, pp. 1136–1150,2013

[10] B. Hu, P. Seiler, and A. Rantzer, “A unified analysis of stochastic opti-mization methods using jump system theory and quadratic con- straints,” inConference on Learning Theory, 2017, pp. 1157–1189.

[11] V. Pavlovic, J. Rehg, and J. MacCormick, “Learning switching linear modelsof human motion,” in Advances in Neural Information Pro- cessing Systems,2000. [Just cite]

[12] B. Hu and U. Syed, “Characterizing the exact behaviors of temporal dif-ference learning algorithms using markov jump linear system the- ory,” inAdvances in Neural Information Processing Systems, 2019, pp. 8477–8488.

40

[13] M. Salathe, M. Kazandjieva, J. W. Lee, P. Levis, M. W. Feldman, and J.H. Jones, “A high resolution human contact network for infectious diseasetransmission,” Proceedings of the National Academy of Sciences, vol. 107,no. 51, pp. 22 020–22 025, 2010.

[14] M. Zanin and F. Lillo, “Modelling the air transport with complex networks:A short review,” The European Physical Journal Special Topics, vol. 215, pp.5–21, 2013.

[15] K. J. Astrom and B. Wittenmark. Adaptive Control. Addison Wesley, 1989.

[16] Lennart Ljung, editor. System Identification (2Nd Ed.): Theory for the User.Prentice Hall PTR, Upper Saddle River, NJ, USA, 1999. ISBN 0-13-656695-2.

[17] Claude-Nicolas Fiechter. PAC adaptive control of liner systems. In Proceed-ing COLT ’94 Proceedings of the seventh annual conference on Computationallearning theory, pages 88–97, 1994.

[18] Yasin Abbasi-Yadkori and Csaba Szepesvari. Regret bounds for the adaptivecontrol of linear quadratic systems. Conference on Learning Theory, 2011.ISSN 15337928.

[19] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and BenjaminRecht. Learning without mixing: Towards a sharp analysis of linear systemidentification. In COLT, 2018.

[20] Sanjeev Arora, Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, andYi Zhang. Towards provable control for unknown linear dynamical systems.2018.

[21] Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang.Spectral filtering for general linear dynamical systems. arXiv preprintarXiv:1802.03981, 2018.

[22] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexityof the linear quadratic regulator. ArXiv e-prints, 2017.

[23] Dimitri P. Bertsekas. Approximate policy iteration: A survey and some newmethods. Journal of Control Theory and Applications, 9(3):310–335, 2011.ISSN 16726340. doi: 10.1007/s11768-011-1005-3

[24] S.J. Bradtke, B.E. Ydstie, and a.G. Barto. Adaptive linear quadratic con-trol using policy iteration. Proceedings of American Control Conference,3(2):3475–3479, 1994. doi: 10.1109/ACC.1994.735224.

[25] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre,George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, VedaPanneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, JohnNham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach,Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering thegame of go with deep neural networks and tree search. Nature, 529, 2016.

41

[26] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policygradient methods for the linear quadratic regulator,” in Pro- ceedings ofthe 35th International Conference on Machine Learning, vol. 80, 2018, pp.1467–1476.

[27] S. Kakade. A natural policy gradient. In NIPS, 2001.

[28] Statistical Estimation of Ergodic Markov Chain Kernel over Discrete StateSpace https://arxiv.org/abs/1809.05014

42