on the complexity of markov decision problems...abstract the markov decision problem (mdp) is one of...

187
On the Complexity of Markov Decision Problems Yichen Chen A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department of Computer Science Adviser: Professor Mengdi Wang June 2020

Upload: others

Post on 16-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • On the Complexity of Markov Decision

    Problems

    Yichen Chen

    A Dissertation

    Presented to the Faculty

    of Princeton University

    in Candidacy for the Degree

    of Doctor of Philosophy

    Recommended for Acceptance

    by the Department of

    Computer Science

    Adviser: Professor Mengdi Wang

    June 2020

  • c© Copyright by Yichen Chen, 2020.

    All rights reserved.

  • Abstract

    The Markov decision problem (MDP) is one of the most basic models for sequential

    decision-making problems in a dynamic environment where outcomes are partly ran-

    dom. It models a stochastic control process in which a planner makes a sequence

    of decisions as the system evolves. The Markov decision problem provides a mathe-

    matical framework for dynamic programming, stochastic control, and reinforcement

    learning. In this thesis, we study the complexity of solving MDPs.

    In the first part of the thesis, we propose a class of stochastic primal-dual methods

    for solving MDPs. We formulate the core policy optimization problem of the MDP

    as a stochastic saddle point problem. By utilizing the value-policy duality structure,

    the algorithm samples state-action transitions and makes alternative updates to value

    and policy estimates. We prove that our algorithm is able to find the approximately

    optimal policies to Markov decision problems with small space and computational

    complexity. Using linear models to represent the value functions and the policies, our

    algorithm is capable of scaling to problems with infinite and continuous state spaces.

    In the second part of the thesis, we establish the computational complexity lower

    bounds for solving MDPs. We prove our results by modeling the MDP algorithms

    using branching programs and then characterizing the properties of these programs

    by quantum arguments. The analysis is also extended to the study of the complexity

    of solving two-player turn-based Markov games. Our results show that if we have a

    simulator that can sample according to the transition probability function in O(1),

    the lower bounds have reduced dependence on the state number. These results sug-

    gest a fundamental difference between Markov decision problems with and without a

    simulator.

    We believe that our results provide a new piece of theoretical evidence for the

    success of simulation-based methods in solving MDPs and Markov games.

    iii

  • Acknowledgements

    I would like to give my deepest gratitude to my advisor, Professor Mengdi Wang, for

    her guidance, patience, and inspiration in the past five years. It is she who introduced

    me to the world of dynamic programming. From then on, she has been a continuing

    source of wisdom and support. I benefit tremendously from her taste of research,

    meticulous attention to detail, and keen insights into solving research problems. I

    am forever indebted to her for investing so much time in training me as a researcher,

    reading over my papers, and improving my presentation skills. Beyond research, she

    encouraged me to break out of my comfort zone, gave me plenty of opportunities to

    attend conferences, and introduced me to world-class scientists in the field. Working

    with her has been one of the most amazing experiences of my life.

    I would like to thank Professor Sanjeev Arora and Professor Robert Schapire

    – two leading figures in the research of theoretical computer science and machine

    learning – for being on my thesis committee. Their commitment to research is truly

    inspiring and motivates me to keep exploring the unknown. It has been a great honor

    for me to interact with them over the years within Princeton. I am also grateful to

    Professor Yuxin Chen and Professor Karthik Narasimhan for the valuable suggestions

    and comments they have made regarding my research.

    My special thanks go to my collaborator Lihong Li, who is also my mentor during

    my internship at Google. I appreciate the discussion with him about the interesting

    problems we worked together. I am grateful for his invaluable advice regarding re-

    search, presentation, writing, and others. He is always ready to help if I encounter

    any difficulties in research or life. I also want to thank my localhost Jingtao Wang

    who has made my internship at Google such an enjoyable journey.

    I had a great time working at Princeton. I would like to thank the (past and

    present) members of my research group: Saeed Ghadimi, Xudong Li, Lin Yang, Jian

    Ge, Galen Cho, Hao Lu, Yifan Sun, Yaqi Duan, Hao Gong, Zheng Yu. I would like to

    iv

  • thank my other friends in Princeton: Weichen Wang, Junwei Lu, Levon Avanesyan,

    Zongxi Li, Suqi Liu, Cong Ma, Kaizheng Wang, Xiaozhou Li, Nanxi Kang, Xin Jin,

    Xinyi Fan, Linpeng Tang, Haoyu Zhang, Kelvin Zou, Yixin Sun, Yinda Zhang, Jun

    Su, Zhengyu Song and Yixin Tao. Thank you all for making this journey enjoyable.

    Finally, I would like to thank my parents Tianlei Chen and Zhixia Tao for their

    unconditional love and support throughout my life. I would also like to thank my

    beloved partner, Yiqin Shen, for her love, inspiration, and company. I dedicate this

    thesis to them.

    v

  • To my family.

    vi

  • Contents

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

    1 Introduction 1

    2 Stochastic Primal-Dual Methods for Markov Decision Problems 6

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 10

    2.2.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 10

    2.2.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 11

    2.3 Value-Policy Duality of Markov Decision Processes . . . . . . . . . . 14

    2.3.1 Finite-State Discounted-Reward MDP . . . . . . . . . . . . . 14

    2.3.2 Finite-State Finite-Horizon MDP . . . . . . . . . . . . . . . . 18

    2.4 Stochastic Primal-Dual Methods for Markov Decision Problems . . . 22

    2.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.5.1 Sample Complexity Analysis of Discounted-Reward MDPs . . 27

    2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs . . . . . 30

    2.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    vii

  • 3 Primal-Dual π Learning Using State and Action Features 37

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.2 Preliminaries and Formulation of the Problem . . . . . . . . . . . . . 40

    3.2.1 Infinite-Horizon Average-Reward MDP . . . . . . . . . . . . . 41

    3.2.2 Infinite-Horizon Discounted-Reward MDP . . . . . . . . . . . 44

    3.3 Model Reduction of MDP using State and Action Features . . . . . . 45

    3.3.1 Using State and Action Features As Bases . . . . . . . . . . . 46

    3.3.2 Reduced-Order Bellman Saddle Point Problem for Average-

    Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.3.3 Reduced-Order Bellman Saddle Point Problem for Discounted-

    Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.4 Primal-Dual π Learning for Average-Reward MDPs . . . . . . . . . . 50

    3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.4.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 52

    3.5 Primal-Dual π Learning for Discounted-Reward MDPs . . . . . . . . 63

    3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    3.5.2 Sample Complexity Analysis . . . . . . . . . . . . . . . . . . . 64

    3.6 Related Literatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    4 Complexity Lower Bounds of Discounted-Reward MDPs 76

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    4.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    4.4 Family of Hard Instances of MDP . . . . . . . . . . . . . . . . . . . . 84

    4.4.1 Hard Instances of Standard MDP . . . . . . . . . . . . . . . . . 84

    4.4.2 Hard Instances of CDP MDP and Binary Tree MDP . . . . . . . . 85

    4.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    viii

  • 4.5.1 A Sub-Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    4.5.2 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . 88

    4.5.3 Proofs of Theorem 4.3.2 and 4.3.3 . . . . . . . . . . . . . . . . 92

    4.5.4 Proof of Lemma 4.5.1 . . . . . . . . . . . . . . . . . . . . . . . 93

    5 Complexity Lower Bounds of Markov Games 97

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    5.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    5.4 Hard Instances of Markov Games . . . . . . . . . . . . . . . . . . . . 107

    5.4.1 Hard Instances of Array Markov Game . . . . . . . . . . . . . 107

    5.4.2 Hard Instances of CDP Markov Game and Binary Tree Markov

    Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    5.5.1 Computational Model . . . . . . . . . . . . . . . . . . . . . . 113

    5.5.2 Proof of Theorem 5.3.1 . . . . . . . . . . . . . . . . . . . . . . 114

    5.5.3 Proof of Theorem 5.3.2 . . . . . . . . . . . . . . . . . . . . . . 118

    5.6 Extension to Markov Games with Irreducibility Property . . . . . . . 120

    A Proofs in Chapter 2 122

    A.1 Analysis of the SPD-dMDP Algorithm 2.1 . . . . . . . . . . . . . . . 122

    A.1.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 123

    A.1.2 Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . 130

    A.1.3 Proof of Theorem 2.5.2 . . . . . . . . . . . . . . . . . . . . . . 131

    A.1.4 Proof of Theorem 2.5.3 . . . . . . . . . . . . . . . . . . . . . . 133

    A.2 Analysis of the SPD-fMDP Algorithm 2.2 . . . . . . . . . . . . . . . 134

    A.2.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . 134

    A.2.2 Proof of Theorem 2.5.4 . . . . . . . . . . . . . . . . . . . . . . 145

    ix

  • A.2.3 Proof of Theorem 2.5.5 . . . . . . . . . . . . . . . . . . . . . . 146

    A.2.4 Proof of Theorem 2.5.6 . . . . . . . . . . . . . . . . . . . . . . 148

    B Proofs in Chapter 3 150

    B.1 Proof of Lemmas 3.4.1 and 3.4.2 . . . . . . . . . . . . . . . . . . . . . 150

    B.2 Proof of Lemmas 3.5.1 and 3.5.2 . . . . . . . . . . . . . . . . . . . . . 155

    C Proofs in Chapter 5 159

    C.1 Supporting Lemmas for Theorem 5.3.1 . . . . . . . . . . . . . . . . . 159

    C.1.1 Proof of Lemma 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . 159

    C.1.2 Proof of Lemma 5.5.2 . . . . . . . . . . . . . . . . . . . . . . . 163

    C.2 Supporting Lemmas for Theorem 5.3.2 . . . . . . . . . . . . . . . . . 165

    C.2.1 Proof of Lemma 5.5.3 . . . . . . . . . . . . . . . . . . . . . . . 165

    C.2.2 Proof of Lemma 5.5.4 . . . . . . . . . . . . . . . . . . . . . . 167

    Bibliography 168

    x

  • List of Figures

    4.1 Input of Standard MDP: Arrays of Transition Probabilities forM1 ∈M1

    and M2 ∈M2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    4.2 Input of CDP MDP : Arrays of Cumulative Probabilities for M3 ∈ M3

    and M4 ∈M4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    4.3 Input of Binary Tree MDP: (a) The snippet of the binary tree of M3 ∈

    M3, where all transitions are to bad states. (b) The snippet of the

    binary tree of M4 ∈ M4, where there is some (s, a) which transitions

    to a good state sG,2 with probability �. . . . . . . . . . . . . . . . . . 87

    5.1 A hard instance of MDP: In case of Type I states, the state transitions

    to a random bad state s′ ∈ SB with probability 1 for all action a ∈ AN .

    In case of Type II states, the state transitions to a rewarding state

    s̄ ∈ SG with probability 1 under some action ā ∈ AN . . . . . . . . . . 107

    5.2 A hard instance of CDP Markov Game and Binary Tree Markov Game: In

    case of Type III states, the state transitions to the first 1/� states each

    with probability � for all action a ∈ AN . In case of Type IV states, the

    state transitions to a rewarding state s̄ ∈ SG with probability � under

    some action ā ∈ AN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    xi

  • Chapter 1

    Introduction

    Reinforcement learning lies at the intersection between control, machine learning,

    and stochastic processes. Recent empirical successes demonstrate that reinforcement

    learning, combined with sophisticated function approximation techniques (e.g., deep

    neural networks), can conquer even the most challenging tasks in video and board

    games [81, 97]. These successes have drawn attention to the research of the Markov

    decision process, which provides a mathematical framework for dynamic program-

    ming, stochastic control, and reinforcement learning. The Markov decision process

    is a controlled random walk over a state space S, where in each state s ∈ S, one

    can choose an action a from an action space A so that the random walk transitions

    to another state s′ ∈ S with probability P (s, a, s′) and an immediate reward r(s, a).

    The key goal is to identify an optimal policy that, when running on the Markov

    decision process, maximize some cumulative function of rewards. We use the expres-

    sion Markov decision problem for a Markov decision process together with such an

    optimality criterion [90].

    Researchers have been developing methods for solving Markov decision problems

    for decades. There are three major approaches for solving the MDP: the value iter-

    ation method, the policy iteration method, and the linear programming method. In

    1

  • 1957, Bellman [8] developed an iteration algorithm, called value iteration, to compute

    the optimal total reward function, which is guaranteed to converge in the limit. In

    policy iteration [58], the algorithm alternates between a value determination phase,

    in which the current policy is evaluated, and a policy improvement phase, in which

    the policy is updated according to the evaluation. Around the same time, D’Epenoux

    [39] and de Ghellinck [38] discovered that the MDP has an LP formulation, allowing

    it to be solved by general LP methods such as the simplex method [35], the Ellipsoid

    method [60] or the interior-point algorithm [59].

    As the notion of computational complexity emerged, there were tremendous efforts

    in analyzing the complexity of MDPs and methods for solving them. All of the

    three methods mentioned above are proved to be able to solve MDPs in polynomial

    time [104, 71, 116, 68]. Here, polynomial time means that the number of arithmetic

    operations needed to compute an exact optimal policy is bounded by a polynomial in

    the number of states, actions and the bit-size of the data. However, though being able

    to solve MDPs in polynomial time, all the methods have a computational complexity

    that is superlinear to the input size Θ(|S|2|A|). This is mainly due to the nature

    of the goal to find the exact optimal policy – it requires the full knowledge of the

    system, which suffers from the curse of dimensionality.

    In this thesis, we study the problem of finding the approximately optimal policies

    for MDPs: is there a way to trade the precision of the exact optimal policy for a

    better time complexity? In particular, we are interested in reducing the complexity’s

    dependence on the size of the state space and the size of the action space. To serve this

    purpose, we developed a class of stochastic primal-dual methods for solving MDPs.

    We formulate the core policy optimization problem of the MDP as a stochastic saddle

    point problem. By utilizing the value-policy duality structure of MDPs and leveraging

    the powerful machinery from optimization, our primal-dual algorithms can find the

    approximately optimal policies with small space and computational complexity. By

    2

  • adopting linear models to represent the high-dimensional value function and state-

    action density functions, our algorithm is capable of scaling to problems with infinite

    and continuous state spaces.

    Meanwhile, we are interested in the computational complexity lower bounds for

    solving MDPs. Here, the computational complexity lower bounds mean the minimum

    number of arithmetic operations or queries to the input data, as a function of |S| and

    |A|, that is required for any algorithms to solve the MDP with high probability. In

    contrast to the large volumes of upper bound results, there are fewer results on the

    complexity lower bounds [87, 52]. In this thesis, we establish the first computational

    complexity lower bounds for solving MDPs. The analysis is also extended to the

    study of the computational complexity for two-player turn-based Markov games. Our

    results show that if we have a simulator that can sample according to the transition

    probability function in O(1), the lower bounds have reduced dependence on the state

    number. These results suggest a fundamental difference between problems with and

    without a simulator, which explains the success of the simulation-based approaches

    for Markov decision problems in recent years [97].

    Here we give an overview of each chapter in the thesis. Chapter 2 and Chapter 3

    study the complexity of MDPs from the positive side, developing new algorithms with

    provable convergence rate guarantees. Chapter 4 and Chapter 5 study the complexity

    from the negative side, establishing the first computational complexity lower bounds

    for solving MDPs and Markov games.

    Stochastic primal-dual methods for Markov decision problems (Chapter 2

    [109, 23])

    We study the online estimation of the optimal policy of a Markov decision process

    (MDP). Central to the problem is to learn the optimal policy and/or the value function

    of the system. In the empirically successful actor-critic algorithm [62], the searching

    3

  • procedure updates the value function and the policy simultaneously. In essence, it uses

    the value estimates to update the policy more accurately and uses the policy estimate

    to update the value function with less variance. In Chapter 2, we justify this procedure

    using the optimization theory. We show the dual relationship between the value

    function and the policy function in a saddle point formulation of the Markov decision

    process. The interpretation demonstrates the primal-dual nature of the alternating

    update methods. Based on this insight, we propose a class of Stochastic Primal-Dual

    (SPD) methods that exploit the inherent minimax duality of Bellman equations. The

    SPD methods update a few coordinates of the value and policy estimates as a new

    state transition is observed. We prove the convergence rate of these algorithms and

    show that they are both space-efficient and computationally efficient.

    Primal-dual π learning using state and action features (Chapter 3 [22])

    Approximate linear programming (ALP) represents one of the major algorithmic fam-

    ilies to solve large-scale Markov decision processes (MDPs). In this chapter, we study

    a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm

    called primal-dual π learning for reinforcement learning when a sampling oracle is

    provided. This algorithm enjoys several advantages. First, it adopts linear models to

    represent the high-dimensional value function and state-action density functions, re-

    spectively, using given state and action features. Its run-time complexity depends on

    the number of features, not the size of the underlying MDPs. Second, our algorithm

    and analysis apply to an arbitrary state space, even if the state space is continuous

    and infinite. Third, it operates in a fully online fashion without having to store any

    sample, thus having minimal memory footprint. Fourth, we prove that it is sample-

    efficient, solving for the optimal policy to high precision with a sample complexity

    linear in the dimension of the parameter space.

    4

  • Complexity lower bounds of discounted-reward MDPs (Chapter 4)

    We study the computational complexity of the infinite-horizon discounted-reward

    Markov Decision Problem (MDP) with a finite state space S and a finite action space

    A. We show that any randomized algorithm needs a running time at least Ω(|S|2|A|)

    to compute an �-optimal policy with high probability. We consider two variants of

    the MDP where the input is given in specific data structures, including arrays of

    cumulative probabilities and binary trees of transition probabilities. For these cases,

    we show that the complexity lower bound reduces to Ω(|S||A|�

    ). These results reveal

    a surprising observation that the computational complexity of the MDP depends on

    the data structure of the input.

    Complexity lower bounds of Markov games (Chapter 5)

    Recent years have seen the huge success of reinforcement learning in strategic games

    for artificial intelligence [81, 97]. The reinforcement learning agent learns to play

    optimal actions from games of self-play. In game theory, such systems are modeled

    by stochastic games, which have been studied for decades [95]. Surprisingly, it is

    only in recent years that 2-player turn-based stochastic games have been proved to

    be solvable in strongly polynomial time [53]. In Chapter 5, we studied the compu-

    tational complexity lower bounds for solving 2-player turn-based stochastic games.

    Our results show that the worst-case computational complexity lower bounds depend

    on the data structure used in representing the stochastic game. When the transition

    probabilities of the game are given in the form of arrays, the lower bounds have a

    linear dependence on the number of states and actions. When the probabilities are

    given in the format that allows efficient simulation of the game, the lower bounds

    have a sublinear dependence on the number of states and actions. Our results sug-

    gest that the efficient simulator used in reinforcement learning methods might be a

    contributing factor to its empirical success.

    5

  • Chapter 2

    Stochastic Primal-Dual Methods

    for Markov Decision Problems

    2.1 Introduction

    Markov decision process (MDP) is one of the most basic models of dynamic pro-

    gramming, stochastic control and reinforcement learning; see the textbook references

    [10, 13, 100, 90]. Given a controllable Markov chain and the distribution of state-

    to-state transition rewards, the aim is to find the optimal action to perform at each

    state in order to maximize the expected overall reward. MDP and its numerous vari-

    ants are widely applied in engineering systems, artificial intelligence, e-commerce and

    finance. Classical solvers of MDP require full knowledge of the underlying stochastic

    process and reward distribution, which are often not available in practice.

    In this chapter, we study both the infinite-horizon discounted-reward MDP and

    the finite-horizon MDP. In both cases, we assume that the MDP has a finite state

    space S and a finite action space A. We focus on the model-free learning setting

    where both transition probabilities and transitional rewards are unknown. Instead,

    a simulation oracle is available to generate random state-to-state transitions and

    6

  • transitional rewards. The simulation oracle is able to model offline retrieval of static

    empirical data as well as live interaction with real-time simulation systems. The

    algorithmic goal is to estimate the optimal policy of the unknown MDP based on

    empirical state transitions, without any prior knowledge or restrictive assumption

    about the underlying process. In the literature of approximate dynamic programming

    and reinforcement learning, many methods have been developed, and some of them

    are proved to achieve near-optimal performance guarantees in certain senses; recent

    examples including [85, 34, 67, 66, 109]. Although researchers have made significant

    progress in developing reinforcement learning methods, it remains unclear whether

    there is an approach that achieves both theoretical optimality and practical scalability.

    This is an active area of research.

    In this chapter, we present a novel approach motivated by the linear programming

    formulation of the nonlinear Bellman equation. We formulate the Bellman equation

    into a stochastic saddle point problem, where the optimal primal and dual solutions

    correspond to the optimal value and policy functions, respectively. We propose a class

    of Stochastic Primal-Dual algorithms (SPD) for the discounted MDP and the finite-

    horizon MDP. Each iteration of the algorithms updates the primal and dual solutions

    simultaneously using noisy partial derivatives of the Lagrangian function. We show

    that one can compute a noisy partial derivative efficiently from a single observation of

    the state transition. The SPD methods are stochastic analogs of the primal-dual iter-

    ation for linear programming. They also involve alternative projections onto specially

    constructed sets. The SPD methods are straightforward to implement and exhibit

    favorable space complexity. To analyze its sample complexity, we adopt the notion

    of “Probably Approximately Correct” (PAC), which means to achieve an �-optimal

    policy with high probability using sample size polynomial to the problem parameters.

    The main contributions of this chapter are four-folded:

    7

  • 1. We study the basic linear algebra of reinforcement learning. We show that

    the optimal value and optimal policy are dual to each other, and they are

    the solutions to a stochastic saddle point problem. The value-policy duality

    implies a convenient algebraic structure that may facilitate efficient learning

    and dimension reduction.

    2. We develop a class of stochastic primal-dual (SPD) methods that maintain a

    value estimate and a policy estimate and update their coordinates while pro-

    cessing state-transition data incrementally. The SPD methods exhibit superior

    space and computational scalability. They require O(|S| × |A|) space for dis-

    counted MDP and O(|S| × |A| × H) space for finite-horizon MDP. The space

    complexity of SPD is sublinear to the input size of the MDP model. For dis-

    counted MDP, each iteration updates two coordinates of the value estimate and

    a single coordinate of the policy estimate. For finite-horizon MDP, each iter-

    ation updates 2H coordinates of the value estimate and H coordinates of the

    policy estimate.

    3. For discounted MDP, we develop the SPD-dMDP Algorithm 2.1. It yields

    an �-optimal policy with probability at least 1 − δ using the following sample

    size/iteration number

    O(|S|4|A|2σ2

    (1− γ)6�2ln

    (1

    δ

    )),

    where γ ∈ (0, 1) is the discount factor, |S| and |A| are the sizes of the state space

    and the action space, σ is a uniform upperbound of state-transition rewards.

    We obtain the sample complexity results by analyzing the duality gap sequence

    and applying the Bernstein inequality to a specially constructed martingale.

    The analysis is novel to the authors’ best knowledge.

    4. For finite-horizon MDP, we develop the SPD-fMDP Algorithm 2.2. It yields

    an �-optimal policy with probability at least 1 − δ using the following sample8

  • size/iteration number

    O(|S|4|A|2H6σ2

    �2ln

    (1

    δ

    )),

    where H is the total number of periods. The key aspect of the finite-horizon

    algorithm is to adapt the learning rate/stepsize for updates on different periods.

    In particular, the algorithm has to update the policies associated with the earlier

    periods more aggressively than update those associated with the later periods.

    The SPD is a model-free method and applies to a wide class of dynamic programming

    problems. Within the scope of this chapter, the sample transitions are drawn from a

    static distribution. We conjecture that the sample complexity results can be improved

    by allowing exploitation, i.e., adaptive sampling of actions. The results of this chapter

    suggest that the linear duality of MDP bears convenient structures yet to be fully

    exploited.

    Chapter Organization Section 2.2 reviews the basics of discounted and finite-

    horizon MDP and related works in this area. Section 2.3 studies the duality between

    optimal values and policies. Section 2.4 presents the SPD-dMDP and the SPD-fMDP

    algorithms and discuss their implementation and complexities. Section 2.5 presents

    the main results and the proofs are deferred to the appendix.

    Notations All vectors are considered as column vectors. For a vector x ∈ Rn, we

    denote by xT its transpose, and denote by ‖x‖ =√xTx its Euclidean norm. For a

    matrix A ∈ Rn×n, we denote by ‖A‖ = max{‖Ax‖ | ‖x‖ = 1} its induced Euclidean

    norm. For a set X ⊂ Rn and vector y ∈ Rn, we denote by ΠX{y} = argminx∈X‖y −

    x‖2 the Euclidean projection of y on X , where the minimization is always uniquely

    attained if X is nonempty, convex and closed. We denote by e = (1, . . . , 1)T the

    vector with all entries equaling 1, and we denote by ei = (0, ..., 0, 1, 0, ..., 0)T the

    9

  • vector with its i-th entry equaling 1 and other entries equaling 0. For set X , we

    denote its cardinality by |X |.

    2.2 Preliminaries and Formulation of the Problem

    In this section, we review the basic models of Markov decision processes.

    2.2.1 Finite-State Discounted-Reward MDP

    We consider a discounted MDP described by a tuple M = (S,A,P , r, γ), where S

    is a finite state space, A is a finite action space, γ ∈ (0, 1) is a discount factor. If

    action a is selected while the system is in state i, the system transitions to state j

    with probability Pa(i, j) and incurs a random reward r̂ija ∈ [0, σ] with expectation

    rija.

    Let π : S 7→ A be a policy that maps a state i ∈ S to an action π(i) ∈ A. Consider

    the Markov chain under policy π. We denote its transition probability matrix by Pπ

    and denote its transitional reward vector by rπ∗ , i.e.,

    Pπ(i, j) = Pπ(i)a(i, j), rπ(i) =∑j∈S

    Pπ(i)(i, j)rijπ(i), i, j ∈ S.

    The objective is to find an optimal policy π∗ : S 7→ A such that the infinite-horizon

    discounted reward is maximized, regardless of the initial state:

    maxπ:S7→A

    [∞∑k=0

    γkr̂ikik+1π(ik)

    ],

    where γ ∈ (0, 1) is a discount factor, (i0, i1, . . .) are state transitions generated by the

    Markov chain under policy π, and the expectation is taken over the entire process.

    We assume throughout that there exists a unique optimal policy π∗ to the MDP tuple

    M = (S,A, P, r, γ). In other words, there exists one optimal action for each state.

    10

  • We review the standard definitions of value functions.

    Definition 2.2.1. The value vector vπ ∈ R|S| of a fixed policy π is defined as

    vπ(i) = Eπ

    [∞∑k=0

    γkr̂ikik+1π(ik)

    ∣∣∣ i0 = i] , i ∈ S.Definition 2.2.2. The optimal value vector v∗ ∈ R|S| is defined as

    v∗(i) = maxπ:S7→A

    [∞∑k=0

    γkr̂ikik+1π(ik)

    ∣∣∣ i0 = i] i ∈ S.For the sample complexity analysis of the proposed algorithm, we need a notion

    of sub-optimality of policies. We give its definition as below.

    Definition 2.2.3. We say that a policy π is absolute-�-optimal if

    maxi∈S|vπ(i)− v∗(i)| ≤ �.

    If a policy is absolute-�-optimal, it achieves �-optimal reward regardless of the ini-

    tial state distribution. We note that the absolute-�-optimality is one of the strongest

    notions of sub-optimality for policies. In comparison, some literature analyzes the

    expected sub-optimality of a policy when the state i follows a certain distribution.

    2.2.2 Finite-State Finite-Horizon MDP

    We also consider the finite-horizon Markov decision process, which can be formu-

    lated as a tuple M = (S,A, H,P , r), where S is a finite state space with transition

    probabilities encoded by P = (Pa)a∈A ∈ R|S×S×A|, A is a finite action space and H

    is the horizon. If action a is selected while the system is in state i ∈ S at period

    h = 0, . . . , H − 1, the system transitions to state j and period h+ 1 with probability

    Pa(i, j) and incurs a random reward r̂ija with expectation rija. For the simplicity

    11

  • of presentation, we assume that the reward we receive does not depend on the time

    period. Our algorithm can be readily extended to the case when the reward varies

    with the time period. We assume that both P and r are unknown but they can be

    estimated by sampling.

    We augment the state space with the time period to obtain an augmented Markov

    chain. Now we have a replica of S at each period h, denoted by Sh. Let S[H]

    be the state space of the augmented MDP where [H] denotes the set of integers

    {0, 1, . . . , H − 1}. If we select action a in state (i, h) ∈ S[H], the state transitions

    to a new state (j, h + 1) ∈ S[H] with probability Pa(i, j). The transition incurs a

    random reward r̂ija ∈ [0, σ] with expectation rija. At period H − 1, the state i will

    transition to the terminal state with reward r̂ija. In the rest of the chapter, we use

    Πa(i′, j′) to denote the transition probability of the augmented Markov chain where

    i′, j′ ∈ {(i, h)|i ∈ S, h ∈ [H]}.

    Let π = (π0, . . . , πH−1) be a sequence of one-step policies such that πh maps a

    state i ∈ S to an action πh(i) ∈ A in the h-th period. Consider the augmented

    Markov chain under policy π. We denote its transition probability matrix by Ππ,

    where Ππ ((i, h), (j, h+ 1)) = Pπh(i)(i, j) for all h ∈ [H − 1], i, j ∈ S, which is given

    by

    Ππ =

    0 Pπ0 0 . . . 0

    0 0 Pπ1 . . . 0

    ......

    .... . .

    ...

    0 0 0 . . . PπH−2

    0 0 0 . . . 0

    .

    We denote ra ∈ R|S| to be the expected state transition reward under action a such

    that

    ra(i) =∑j∈S

    Pa(i, j)rija, i ∈ S.

    We review the standard definitions of value functions for finite-horizon MDP.

    12

  • Definition 2.2.4. The h-period value function vπh : Sh 7→ Rn under policy π is defined

    as

    vπh(i) = Eπ

    [H−1∑τ=h

    r̂iτ iτ+1πτ (iτ )

    ∣∣∣∣ ih = i],∀ h ∈ [H], i ∈ S.

    The random variables (ih, ih+1, . . . ) are state transitions generated by the Markov

    chain under policy π starting from state i and period h, and the expectation Eπ[·]

    is taken over the entire process. We denote the overall value function to be v =

    (vT0 , . . . , vTH−1)

    T ∈ R|S|H .

    The objective of the finite-horizon MDP is to find an optimal policy π∗ : S[H] 7→ A

    such that the finite-horizon reward is maximized, regardless of the starting state.

    Based on the optimal policy, we define the optimal value function as follows.

    Definition 2.2.5. The optimal value vector v∗ = (v∗h)h∈[H] ∈ RH×|S| is defined as

    v∗h(i) = maxπ:S[H] 7→A

    [H−1∑τ=h

    r̂iτ iτ+1πτ (iτ )

    ∣∣∣∣ ih = i]

    = Eπ∗

    [H−1∑τ=h

    r̂iτ iτ+1πτ (iτ )

    ∣∣∣∣ ih = i],

    for all h ∈ [H], i ∈ S.

    In order to analyze the sample complexity, we use the following notion of absolute-

    �-optimal for finite-horizon MDP.

    Definition 2.2.6. We say that a policy π is absolute-�-optimal if

    maxh∈[H],i∈S

    |vπh(i)− v∗h(i)| ≤ �.

    Note that if a policy is absolute-�-optimal, it achieves an �-optimal cumulative

    reward from all states and all intermediate periods. This is one of the strongest

    notions of sub-optimality for finite-horizon policies.

    13

  • 2.3 Value-Policy Duality of Markov Decision Pro-

    cesses

    In this section, we study the Bellman equation of the Markov decision process from

    the perspective of linear duality. We show that the optimal value and policy functions

    are dual to each other and they are the solutions to a special saddle point problem.

    We analyze the value-policy duality for the infinite-horizon discounted-reward case

    and the finite-horizon case separately.

    2.3.1 Finite-State Discounted-Reward MDP

    Consider the discounted MDP described by the tupleM = (S,A,P , r, γ) as in Section

    2.2.1. According to the theory of dynamic programming [10], a vector v∗ is the optimal

    value function to the MDP if and only if it satisfies the following |S| × |S| system of

    equations, known as the Bellman equation, given by

    v∗(i) = maxa∈A

    {γ∑j∈S

    Pa(i, j)v∗(j) +

    ∑j∈S

    Pa(i, j)rija

    }, i ∈ S. (2.3.1)

    When γ ∈ (0, 1), the Bellman equation has a unique fixed point solution v∗, and it

    equals to the optimal value function of the MDP. Moreover, a policy π∗ is an optimal

    policy for the MDP if and only of it attains the minimization in the Bellman equation.

    Note that this is a nonlinear system of fixed point equations. Interestingly, the Bell-

    man equation (2.3.1) is equivalent to the following |S|× (|S||A|) linear programming

    (LP) problem (see Puterman, 2014 [90] Section 6.9. and the paper by de Farias and

    Van Roy [36]):

    minimize ξTv

    subject to (I− γPa) v − ra ≥ 0, a ∈ A,(2.3.2)

    14

  • where ξ is an arbitrary vector with positive entries, Pa ∈ R|S|×|S| is matrix whose

    (i, j)-th entry equals to Pa(i, j), I is the identity matrix with dimension |S| × |S|

    and ra ∈ R|S| is the expected state transition reward under action a given by ra(i) =∑j∈S Pa(i, j)rija, i ∈ S. The dual linear program of (2.3.2) is

    maximize∑a∈A

    λTa ra

    subject to∑a∈A

    (I− γP Ta

    )λa = ξ, λa ≥ 0, a ∈ A.

    (2.3.3)

    We will show that the optimal solution λ∗ to the dual problem (2.3.3) corresponds

    to the optimal policy π∗ of the MDP. The duality between the optimal value vector

    and the optimal policy is established in Theorem 2.3.1. We remark that part of the

    results was known in the classical literature of MDP; see Puterman, 2014 [90] Section

    6.9. We provide a short proof for the completeness of the analysis.

    Theorem 2.3.1 (Value-Policy Duality for Discounted MDP). Assume that the

    discounted-reward infinite-horizon MDP tuple M = (S,A, P, r, γ) has a unique

    optimal policy π∗. Then (v∗, λ∗) is the unique pair of primal and dual solutions to

    (2.3.2), (2.3.3) if and only if

    v∗ = (I− γPπ∗)−1rπ∗ ,(λ∗π∗(i),i

    )i∈S

    = (I− γ(Pπ∗)T )−1ξ, λ∗a,i = 0 if a 6= π∗(i).

    Proof. The proof is based on the fundamental property of linear programming, i.e.,

    (v∗, λ∗) is the optimal pair of primal and dual solutions if and only if:

    (a) v∗ is primal feasible, i.e., (I− γPa) v∗ − ra ≥ 0 for all a ∈ A.

    (b) λ∗ is dual feasible, i.e.,∑

    a∈A(I− γP Ta

    )λ∗a = ξ and λ

    ∗a ≥ 0 for all a ∈ A.

    15

  • (c) (v∗, λ∗) satisfies the complementarity slackness condition

    λ∗a,i · (v∗i − γPa,iv∗ − ra,i) = 0 ∀i ∈ S, a ∈ A,

    where λ∗a,i is the i-th element of λ∗a and Pa,i is the i-th row of Pa.

    Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)

    and v∗ is the optimal value vector. By the definition of the optimal value function, we

    know that v∗i−γPπ∗(i),iv∗−rπ∗(i),i = 0. Since π∗ is unique, we have v∗i > γPa,iv∗+ra,i =

    0 if a 6= π∗(i). As a result, we have λ∗a,i = 0 for all a 6= π∗(i). It means that the

    optimal dual variable λ∗ has exactly |S| nonzeros, corresponding to |S| active row

    constraints of the primal problem (2.3.2). We combine this observation with the dual

    feasibility relation∑

    a∈A(I− γP Ta

    )λ∗a = ξ and obtain

    (I− γ(Pπ∗)T )(λ∗π∗(i),i

    )i∈S

    = ξ.

    Note that all eigenvalues of Pπ∗ belong to the unit ball, therefore (I − γ(Pπ∗)T ) is

    invertible. Then we have(λ∗π∗(i),i

    )i∈S

    = (I − γ(Pπ∗)T )−1ξ, which together with the

    complementarity condition implies that λ∗ is unique. Similarly, we can show that

    v∗ = (I− γPπ∗)−1rπ∗ from the primal feasibility and slackness condition.

    Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.1.

    Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.

    Theorem 2.3.1 suggests a critical correspondence between the optimal dual solu-

    tion λ∗ and the optimal policy π∗. In particular, one can recover the optimal policy

    π∗ from the basis of λ∗ as follows:

    π∗(i) = a, if λ∗a,i > 0.

    16

  • In other words, finding the optimal policy is equivalent to finding the basis of the

    optimal dual solution. This suggests that learning the optimal policy is a special case

    of stochastic linear programs.

    Saddle Point Formulation of Discounted MDP We rewrite the LP program

    (2.3.2) into an equivalent minimax problem, given by

    minv∈R|S|

    maxλ≥0

    L(v, λ) = ξTv +∑a∈A

    λTa ((γPa − I)v + ra) . (2.3.4)

    The primal variable v is of dimension |S|, and the dual variable

    λ = (λa)a∈A = (λa,i)a∈A,i∈S

    is of dimension |S| · |A|. Each subvector λa ∈ R|S| is the vector multiplier correspond-

    ing to constraint inequalities (I− γPa) v − ra ≥ 0. Each entry λa,i > 0 is the scalar

    multiplier associated with the ith row of (I− γPa) v − ra ≥ 0.

    In order to develop an efficient algorithm, we modify the saddle point problem as

    follows

    minv∈R|S|

    maxλ∈R|S×A|

    {L(v, λ) = ξTv +

    ∑a∈A

    λTa ((γPa − I)v + ra)

    },

    subject to v ∈ V , λ ∈ Ξ ∩∆

    (2.3.5)

    where

    V ={v∣∣∣ v ≥ 0, ‖v‖∞ ≤ σ

    1− γ

    },

    Ξ =

    {λ∣∣∣ ∑a∈A

    λa,i ≥ ξi, ∀i ∈ S

    }, ∆ =

    {λ∣∣∣ λ ≥ 0, ‖λ‖1 = ‖ξ‖1

    1− γ

    }.

    (2.3.6)

    17

  • We will show later that v∗ and λ∗ belong to V and Ξ∩∆ respectively (Lemma A.1.1).

    As a result, the modified saddle point problem (2.3.5) is equivalent to the original

    problem (2.3.4).

    2.3.2 Finite-State Finite-Horizon MDP

    Consider the finite-horizon Markov decision process, which is described by a tuple

    M = (S,A, H,P , r) as in Section 2.2.2. The Bellman equation of finite-horizon

    MDP is given by

    v∗h(i) = maxa∈A

    {ra(i) +

    ∑j∈S

    Pa(i, j)v∗h+1(j)

    },∀ h ∈ [H], i ∈ S,

    v∗H = 0,

    (2.3.7)

    where Pa is the transition probability matrix using a fixed action a. The vector form

    of the Bellman equation is

    v∗h = maxa∈A

    {ra(i) + Pav

    ∗h+1

    }, ∀ h ∈ [H], i ∈ S,

    v∗H = 0,

    where the maximization is carried out component-wise. A vector v∗ satisfies the

    Bellman equation if and only if it is the optimal value function.

    The Bellman equation is equivalent to the following linear program:

    minimizeH−1∑h=0

    ξTh vh

    subject to vh ≥ Pavh+1 + ra, a ∈ A, h ∈ [H]

    vH = 0,

    (2.3.8)

    18

  • where ξ = (ξ0, . . . , ξH−1) is an aribrary vector with positive entries and v =

    (vT0 , . . . , vTH−1)

    T is the primal variable. The above linear program has |S||A|H

    constraints and |S|H variables. The dual linear program of (2.3.8) is given by

    maximizeH−1∑h=0

    ∑a∈A

    λTh,ara

    subject to∑a∈A

    (λh,a − P Ta λh−1,a

    )= ξh, h ∈ [H]

    λ−1,a = 0, λh,a ≥ 0, h ∈ [H], a ∈ A

    (2.3.9)

    where the dual variable λ = (λh,a)h∈[H],a∈A is a vector of dimension |S||A|H. In the

    remainder of the chapter, we use the notation λh to denote the vector λh = (λh,a)a∈A

    and λa to denote the vector λa = (λh,a)h∈[H]. We denote by λ∗ the optimal dual

    solution. Now we establish the value-policy duality for finite-horizon MDP.

    Theorem 2.3.2 (Value-Policy Duality for Finite-Horizon MDP). Assume that the

    finite-horizon MDP tuple M = (S,A, H,P , r) has a unique optimal policy π∗ =

    (π∗h)h∈[H]. The vector pair (v∗, λ∗) is the unique pair of primal and dual solutions to

    (2.3.8), (2.3.9) if and only if:

    v∗ = (I−Ππ∗)−1(Rh,π∗)h∈[H],(λ∗h,π∗h(i)

    (i))h∈[H],i∈S

    = (I−ΠTπ∗)−1ξ, λ∗h,a(i) = 0 if a 6= π∗h(i),

    where Ππ∗ is the transition probability under the optimal policy π∗, Rh,π∗ is the expected

    transitional reward under policy, i.e., Rh,π∗(i) = rπ∗h(i)(i) for all i ∈ S and h ∈ [H].

    Proof. The proof is based on the fundamental property of the linear duality, i.e.,

    (v∗, λ∗) is the optimal pair of primal and dual solutions if and only if

    (a) v∗ is primal feasible, i.e., v∗h − Pav∗h+1 − ra ≥ 0, ∀ h ∈ [H], a ∈ A.

    (b) λ∗ is dual feasible, i.e.,∑

    a∈A(λh,a − P Ta λh−1,a

    )= ξh, λh,a ≥ 0, ∀ h ∈

    [H], a ∈ A.

    19

  • (c) (v∗, λ∗) satisfies the complementarity slackness condition

    λ∗h,a(i) · (v∗h(i)− ra(i)− Pa,iv∗h+1) = 0 ∀ h ∈ [H], i ∈ S, a ∈ A,

    where λh,a(i) is the i-th element of λh,a and Pa,i is the i-th row of Pa.

    Suppose that (v∗, λ∗) is primal-dual optimal. As a result, it satisfies (a), (b), (c)

    and v∗ is the optimal value vector. By the definition of the optimal value function,

    we know that v∗h(i) − Pπ∗h(i),iv∗h+1 − rπ∗h(i)(i). Since π

    ∗ is unique, we have v∗h(i) >

    Pa,iv∗h+1 + ra(i) = 0 if a 6= π∗h(i). As a result, we have λ∗h,a(i) = 0 for all a 6= π∗h(i).

    Together with the dual feasibility relation∑

    a∈A(λh,a − P Ta λh−1,a

    )= ξh, we have

    (I− ΠTπ∗)(λ∗h,π∗h(i)(i)

    )h∈[H],i∈S

    = ξ,

    where Ππ∗ is defined in Section 2.2.2. Note that (I − ΠTπ∗) is invertible as its deter-

    minant is equal to one. Then we have(λ∗h,π∗h(i)

    (i))h∈[H],i∈S

    = (I − ΠTπ∗)−1ξ, which

    together with the complementarity condition implies that λ∗ is unique. Similarly, we

    can show that v∗ = (I− Ππ∗)−1(Rh,π∗)h∈[H] from the primal feasibility and slackness

    condition.

    Now suppose that (v∗, λ∗) satisfies the three conditions stated in Theorem 2.3.2.

    Then we obtain (a),(b),(c) directly, which proves that (v∗, λ∗) is primal-dual optimal.

    Saddle Point Formulation of Finite-Horizon MDP According to the La-

    grangian duality, we can rewrite the Bellman equation (2.3.8) into the following saddle

    point problem:

    minv∈V

    maxλ∈Ξ∩∆

    L(v, λ) =H−1∑h=0

    ξTh vh +H−1∑h=0

    ∑a∈A

    λTh,a (ra + Pavh+1 − vh) , (2.3.10)

    20

  • where

    V ={v∣∣∣ vh ≥ 0, ‖vh‖∞ ≤ (H − h)σ, ∀ h ∈ [H]} ,

    Ξ =

    {λ∣∣∣ ∑a∈A

    λh,a ≥ ξh, ∀ h ∈ [H]

    },∆ =

    {λ∣∣∣ λ ≥ 0, ‖λh‖1 = h∑

    τ=0

    ‖ξh‖1, ∀ h ∈ [H]

    }.

    (2.3.11)

    We will prove later that the optimal primal solution v∗ and the optimal dual solutions

    λ∗ belong to the additional constraints V and Ξ∩∆ respectively (Lemma A.2.1). As

    a result, (v∗, λ∗) is a pair of primal and dual solutions to the saddle point problem

    (2.3.10).

    We now state the matrix form of the saddle point problem. Let I be the identity

    matrix with dimension |S|H and let Πa ∈ R|S|H×|S|H be the augmented transition

    probability matrix of the augmented Markov chain, taking the form

    Πa =

    0 Pa 0 . . . 0

    0 0 Pa . . . 0

    ......

    .... . .

    ...

    0 0 0 . . . Pa

    0 0 0 . . . 0

    .

    Then the Lagrangian L(v, λ) is equivalent to

    L(v, λ) = ξTv +∑a∈A

    λTa (Ra + (Πa − I)v) , (2.3.12)

    where Ra = (rTa , . . . , r

    Ta )

    T , λa = (λh,a)h∈[H]. The partial derivatives of the Lagrangian

    are

    ∇vL(v, λ) = ξ +∑a∈A

    (ΠTa − I)λa, ∇λaL(v, λ) = Ra + (Πa − I)v.

    In what follows, we will utilize linear structures of the partial derivatives to develop

    stochastic primal-dual algorithms.

    21

  • 2.4 Stochastic Primal-Dual Methods for Markov

    Decision Problems

    We are interested in developing algorithms that not only apply to explicitly given

    MDP models but also apply to reinforcement learning. In particular, we focus on the

    model-free learning setting, which is summarized as below.

    Model-Free Learning Setting of MDP

    (i) The state space S, the action spaces A, the reward upperbound σ, and the dis-

    count factor γ (or the horizon H) are known.

    (ii) The transition probabilities P and reward functions r are unknown.

    (iii) There is a Sampling Oracle (SO) that takes input (i, a) and generates a new state

    j with probabilities Pa(i, j) and a random reward r̂ija ∈ [0, σ] with expectation

    rija.

    Motivated by the value-policy duality (Theorems 2.3.1, 2.3.2), we develop a class of

    stochastic primal-dual methods for the saddle point formulation of Bellman Equation.

    In particular, we develop the Stochastic Primal-Dual method for Discounted Markov

    Decision Process (SPD-dMDP), as given in Algorithm 2.1.

    The algorithm keeps an estimate of the value function v and the dual iterates λ

    in the minimax problem (2.3.4) and makes updates to them while sampling random

    states and actions. At the k-th iteration, the algorithm samples a random state i and

    a random action a. Then the algorithm uses the sampled state and action to compute

    noisy gradients of the minimax problem (2.3.4) with respect to the value function v

    and the dual iterates λ. After updating the value function and the dual iterates using

    the gradients, the algorithm projects v and λ back to the corresponding constraints

    22

  • V and Ξ ∩ ∆ in the optimization problem (2.3.5). Essentially, each iteration is a

    stochastic primal-dual iteration for solving the saddle point problem.

    We also develop the Stochastic Primal-Dual method for Finite-horizon Markov

    Decision Process (SPD-fMDP), as given by Algorithm 2.2. The SPD algorithms

    maintain a running estimate of the optimal value function (i.e., the primal solution)

    and the optimal policy (i.e., the dual solution). They make simple updates to the

    value and policy estimates as new state and reward observations are drawn from the

    sampling oracle.

    Implementation and Computational Complexities The proposed SPD algo-

    rithms are essentially stochastic coordinate descent methods. They exhibit favorable

    properties such as small space complexity and small computational complexity per

    iteration.

    The SPD-dMDP Algorithm 2.1 updates the value and policie estimates by pro-

    cessing one sample state-transition at a time. It keeps track of the dual variable λ and

    primal variable v, which utilizes |A|× |S|+ |S| = O(|A|× |S|) space. In comparison,

    the dimension of the discounted MDP is |A|× |S|2. Thus the space complexity of Al-

    gorithm 2.1 is sublinear with respect to the problem size. Moreover, the SPD-dMDP

    Algorithm 2.1 is a coordinate descent method. It updates two coordinates of the

    value estimation and one coordinate of the policy estimate per each iteration. After

    the dual update, the algorithm first projects the dual iterates onto the simplex ∆ and

    then projects the dual iterates onto the box constraint Ξ. The projection onto the

    simplex can be implemented using O(|S| × |A|) [25]. The projection onto Ξ under

    the simplex constraint can be implemented using O(|S| × |A|) arithmetic operations

    as well.1 The overall computation complexity per each iteration is O(|S|× |A|) arith-1Consider the projection of x ∈ ∆ onto Ξ∩∆. We can write the problem as miny∈∆

    ∑a

    ∑i(ya,i−

    xa,i)2, s.t.

    ∑a ya,j ≥ c,∀j, where c is a constant. The derivative of the objective with respect to ya,i

    is 2(ya,i−xa,i). Denote y∗ to be the optimal solution. For two variables y∗a1,i1 , y∗a2,i2

    , the derivativesof the objective with respect to them 2(y∗a1,i1 − xa1,i1), 2(y

    ∗a2,i2

    − xa2,i2) must equal to each other

    23

  • Algorithm 2.1 Stochastic Primal-Dual Algorithm for Discounted MDP (SPD-dMDP)

    Input: Sampling Oracle SO, n = |S|, m = |A|, γ ∈ (0, 1), σ ∈ (0,∞)Initialize v(0) : S 7→

    [0, σ

    1−γ

    ]and λ(0) : S ×A 7→

    [0, ‖ξ‖1σ

    1−γ

    ]arbitrarily.

    Set ξ = σ√ne

    for k = 1, 2, . . . , T doSample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOSet β =

    √n/k

    Update the primal iterates by

    v(k)(i)← max{

    min

    {v(k−1)(i)− β

    (1

    mξ(i)− λ(k−1)a (i)

    ),

    σ

    1− γ

    }, 0

    }v(k)(j)← max

    {min

    {v(k−1)(j)− γβλ(k−1)a (i),

    σ

    1− γ

    }, 0

    }v(k)(s)← v(k−1)(s) ∀s 6= i, j

    Update the dual iterates by

    λ(k− 1

    2)

    a (i)← λ(k−1)(a, i) + β(γv(k−1)(j)− v(k−1)(i) + r̂ija

    )λ(k−

    12

    )(a′, i′)← λ(k−1)(a′, i′), ∀ (a′, i′) such that a′ 6= a or i′ 6= i

    Project the dual iterates by

    λ(k) ← ΠΞΠ∆λ(k−12

    ), where Ξ and ∆ are given by (2.3.6)

    end forOuput: Averaged dual iterate λ̂ = 1

    T

    ∑Tk=1 λ

    (k) and randomized policy π̂ where

    P(π̂(i) = a) = λ̂a(i)∑a∈A λ̂a(i)

    metic operations. Therefore SPD Algorithm 2.1 uses sublinear space complexity and

    sublinear computation complexity per iteration.

    Algorithm 2.2 has a similar spirit with Algorithm 2.1. It keeps track of the

    dual variable (λh,a)h∈[H],a∈A (randomized policies for all periods) and primal vari-

    if∑

    a y∗a,i1

    > c,∑

    a y∗a,i2

    > c. Othewise, we can move the values of y∗a1,i1 , y∗a2,i2

    to get a smallerobjective value. Thus we can compute y∗ explicitly as follows. For i such that

    ∑a xa,i < c, we set

    y∗·,i so that∑

    a y∗a,i = c. For the remaining variables y

    ∗a,i, we set the value of y

    ∗a,i to be xa,i added

    by a constant shift so that∑

    a

    ∑i y∗a,i is 1. Note that this projection algorithm also works for other

    distance metrics such as KL divergence:∑

    a,i ya,i log(ya,i/xa,i).

    24

  • able v0, . . . , vH−1 (value functions for all periods). Algorithm 2.2 is specially designed

    for H-period MDP in two aspects. It uses a non-uniform weight vector where ξ0 =eH

    and ξh =e

    (H−h)(H−h+1) , which places more weight for later periods to balance the

    smaller values. It uses larger stepsizes to update policies associated with later pe-

    riods more aggressively, while using small stepsizes to update earlier-period policies

    more conservatively. The space complexity is O(|S| × |A| × H), which is sublinear

    with respect to the problem dimension O(|S|2×|A|×H). Algorithm 2.2 is essentially

    a coordinate descent method that involves projection onto simple sets. The compu-

    tation complexity per each iteration is O(|S| × |A| ×H), which is mainly due to the

    projection step.

    Comparisons with Existing Methods Our SPD algorithms are fundamentally

    different from the existing reinforcement learning methods. From a theoretical per-

    spective, our SPD algorithms are based on the stochastic saddle point formulation of

    the Bellman equation. To the authors’ best knowledge, this idea has not been used

    in any existing method. From a practical perspective, the SPD methods are easy to

    implement and have small space and computational complexity (one of the smallest

    compared to existing methods). In what follows, we compare the newly proposed

    SDP methods with several popular existing methods.

    • The new SPD methods share a similar spirit with the Q-learning and delayed

    Q-learning methods. Both of them maintain and update a value for each state-

    action pair (i, a). Delayed Q-learning maintains estimates of the value function

    at each state-action pair (i, a), i.e., the Q values. Our SPD maintains estimates

    of probabilities for choosing each (i, a), i.e., the randomized policy. In both

    cases, the values associated with state-action pairs are used to determine how

    to choose the actions. Delayed Q-learning uses O(|S||A|) space and O(ln |A|)

    25

  • Algorithm 2.2 Stochastic Primal-Dual Algorithm for Finite-horizon MDP (SPD-fMDP)

    Input: Sampling Oracle SO, n = |S|, m = |A|, H, σ ∈ (0,∞)Initialize vh : S 7→ [0, (H − h)σ] and λh : S ×A 7→

    [0, n

    H−h

    ],∀ h ∈ [H] arbitrarily

    Set ξ0 =eH

    and ξh =e

    (H−h)(H−h+1) ,∀ h 6= 0for k = 1, 2, . . . , T do

    Sample i uniformly from SSample a uniformly from ASample j and r̂ija conditioned on (i, a) from SOUpdate the primal iterates by

    v(k)h (i)← max

    {min

    {v

    (k−1)h (i)−

    (H − h)2σ√k

    (ξh(i)−mλ

    (k−1)h,a (i)

    ), (H − h)σ

    }, 0

    },

    v(k)h (j)← max

    {min

    {v

    (k−1)h (j)−

    m(H − h)2σ√k

    λ(k−1)h−1,a(i), (H − h)σ

    }, 0

    },

    v(k)h (s)← v

    (k−1)h (s) , ∀ h ∈ [H], s 6= i, j

    Update the dual iterates by

    λ(k− 1

    2)

    h,a (i)← λ(k−1)h,a (i) +

    n

    (H − h)2σ√k

    (v

    (k−1)h+1 (j)− v

    (k−1)h (i) + r̂ija

    ),∀ h ∈ [H]

    λ(k− 1

    2)

    h,a′ (i′)← λ(k−1)h,a (i), ∀ h ∈ [H],∀ a

    ′, i′ such that a′ 6= a or i′ 6= i

    Project the dual iterates by

    λ(k) ← ΠΞΠ∆λ(k−12

    ), where Ξ and ∆ are given by (2.3.11)

    end forOuput: Averaged dual iterate λ̂ = 1

    T

    ∑Tk=1 λ

    (k) and randomized policy π̂ =

    (π̂0, · · · , π̂H−1) where P(π̂h(i) = a) = λ̂h,a(i)∑a∈A λ̂h,a(i)

    arithmetic operations per iteration [99]. Our SPD methods enjoy similar com-

    putational advantages as the delayed Q-learning method.

    • The SPD methods are also related to the class of actor-critic methods. Our

    dual variable mimics the actor, while the primal variable mimics the critic.

    In particular, the dual update step in SPD turns out to be very similar to

    the actor update: both updates use a noisy temporal difference. Actor-critic

    methods are two-timescale methods in which the actor updates on a faster scale

    26

  • in comparison to the critic. In contrast, the new SPD methods have only one

    timescale: the primal and dual variables are updated using a single sequence of

    stepsizes. As a result, SPD methods are more efficient in utilizing new data as

    they emerge and achieve O(1/√T ) rate of convergence.

    • Upper confidence methods maintain and update a value or interval for each

    state-action-state triplet; see the works by Lattimore and Hutter [66], Dann

    and Brunskill [34]. These methods use space up to O(|S|2|A|), which is linear

    with respect to the size of the MDP model. In contrast, SPD does not estimate

    transition probabilities of the unknown MDP and uses only O(|S||A|) space.

    To sum up, the main advantage of the SPD methods is the small storage and small

    computational complexity per iteration. We note that the main computational com-

    plexity of SPD is due to the projection of dual variables.

    2.5 Main Results

    In this section, we study the convergence of the two SPD methods: the SPD-dMDP

    Algorithm 2.1 and the SPD-fMDP Algorithm 2.2. Our main results show that the

    SPD methods output a randomized policy that is absolute-�-optimal using finite sam-

    ples with high probability. We analyze the cases of discounted MDP and finite-horizon

    MDP separately. All the proofs of the theorems are deferred to the appendix.

    2.5.1 Sample Complexity Analysis of Discounted-Reward

    MDPs

    We analyze the SPD-dMDP Algorithm 2.1 as a stochastic analog of a deterministic

    primal-dual iteration. We show that the duality gap associated with the primal and

    dual iterates decreases to zero with the following guarantee.

    27

  • Theorem 2.5.1 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈

    R|S×A| be the averaged dual iterates generated by the SPD-dMDP Algorithm 2.1 using

    the following sample size/iteration number

    (|S|3|A|2σ4

    (1− γ)4�2ln

    (1

    δ

    )).

    Then the dual iterate λ̂ satisfies

    ∑a∈A

    (λ̂a)T (v∗ − γPav∗ − ra) ≤ �

    with probability at least 1− δ.

    Proof Outline. The SPD-dMDP Algorithm 2.1 can be viewed as a stochastic approxi-

    mation scheme for the saddle point problem (2.3.5). Upon drawing a triplet (ik, ak, jk),

    we obtain noisy samples of partial derivatives ∇vL(v(k), λ(k)) and ∇λL(v(k), λ(k)). The

    SPD-dMDP Algorithm 2.1 is equivalent to

    vk+1 = ΠV [vk − βk

    (∇vL(v(k), λ(k)) + �k

    )]

    λk+1 = ΠΞΠ∆[λk + βk

    (∇λL(v(k), λ(k)) + εk

    )],

    where V , Ξ and ∆ are specially constructed sets, and �k, εk are zero-mean noise

    vectors. βk is the stepsize. By analyzing the distance between (vk, λk) and (v∗, λ∗), we

    obtain that the squared distance decreases by factors of the duality gap per iteration.

    Then we construct a martingale based on the sequences of duality gaps and apply

    Bernstein’s inequality. The formal proof is deferred to the appendix.

    Note that the dual variable is always nonnegative λ̂ ≥ 0 by the projection on the

    simplex ∆. Also note that the nonnegative vector v∗ − (ra + γPav∗) ≥ 0 is a vector

    of primal constraint tolerances attained by the primal optimal solution v∗. Theorem

    28

  • 2.5.1 essentially gives an error bound for entries of λ̂ corresponding to inactive primal

    row constraints.

    Now we are ready to present the sample complexity of SPD for discounted MDP.

    Theorem 2.5.2 shows that the averaged dual iterate λ̂ gives a randomized policy that

    approximates the optimal policy π∗. The performance of the randomized policy can

    be analyzed using the diminishing duality gap from Theorem 2.5.1.

    Theorem 2.5.2 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-

    dMDP Algorithm 2.1 iterate with the following sample size/iteration number

    (|S|4|A|2σ2

    (1− γ)6�2ln

    (1

    δ

    )),

    then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.

    Next, we consider how to recover the optimal policy π∗ from the dual iterates λ̂

    generated by the SPD-dMDP Algorithm 2.1. Note that the policy space is discrete,

    which makes it possible to distinguish the optimal one from others when the estimated

    policy π̂ is close enough to the optimal one.

    Definition 2.5.1. Let the minimal action discrimination constant d̄ be the minimal

    efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.

    It is given by

    d̄ = min(i,a):π∗(i)6=a

    (v∗(i)− γPa,iv∗ − ra(i)).

    As long as there exists a unique optimal policy π∗, we have d̄ > 0. A large value

    of d̄ means that it is easy to discriminate optimal actions from suboptimal actions.

    A small value of d̄ means that some suboptimal actions perform similarly to optimal

    actions.

    29

  • Theorem 2.5.3 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),

    let the SPD-dMDP Algorithm 2.1 iterate with the following sample size

    (|S|4|A|4σ2

    d̄2(1− γ)4ln

    (1

    δ

    )).

    Let π̂Tr be obtained by rounding the randomized policy π̂ to the nearest deterministic

    policy, given by

    π̂Tr(i) = argmaxa∈Aλ̂a,i, i ∈ S.

    Then P(π̂Tr = π∗

    )≥ 1− δ.

    To our best knowledge, this is the first result that shows how to recover the exact

    optimal policy of reinforcement learning. The discrete nature of the policy space

    makes the exact recovery possible.

    2.5.2 Sample Complexity Analysis of Finite-Horizon MDPs

    Now we analyze the SPD-fMDP Algorithm 2.2. Again we start with the duality gap

    analysis. We have the following theorem.

    Theorem 2.5.4 (PAC Duality Gap). For any � > 0, δ ∈ (0, 1), let λ̂ = (λ̂a)a∈A ∈

    R|S×A| be the averaged dual iterates generated by the SPD-fMDP Algorithm 2.2 using

    the following sample size/iteration number

    (|S|4|A|2H2σ2

    �2

    (ln

    1

    δ

    )).

    Then the dual iterate λ̂ satisfies

    H−1∑h=0

    ∑a∈A

    (λ̂h,a)T (v∗h − Pav∗h+1 − ra) ≤ �.

    with probability at least 1− δ.30

  • Proof Outline. We can view the SPD-fMDP Algorithm 2.2 as a stochastic approxi-

    mation scheme for the saddle point problem (2.3.10). The SPD-fMDP Algorithm 2.2

    is equivalent to the following iteration

    vk+1h = ΠV [vkh −

    (H − h)2σγkn

    (∇vhL(v(k), λ(k)) + �k

    )]

    λk+1h = ΠΞΠ∆

    [λkh +

    γkm(H − h)2σ

    (∇λhL(v(k), λ(k)) + εk

    )],

    where Ξ and ∆ are given by (2.3.11), �k and εk are zero-mean noise vectors, γk is the

    stepsize. The formal proof is deferred to Section 6.

    Next we present the sample complexity for the SPD-fMDP Algorithm 2.2. The

    analysis is obtained from the duality gap.

    Theorem 2.5.5 (PAC Sample Complexity). For any � > 0, δ ∈ (0, 1), let the SPD-

    fMDP Algorithm 2.2 iterate with the following sample size

    (|S|4|A|2H6σ2

    �2ln

    (1

    δ

    )),

    then the output policy π̂ is absolute-�-optimal with probability at least 1− δ.

    Next we consider how to recover the optimal policy π∗ from the dual iterates λ̂

    generated by the SPD-fMDP Algorithm 2.2. We abuse the notation d̄ to denote the

    minimal action discrimination for finite-horizon MDP.

    Definition 2.5.2. Let the minimal action discrimination constant d̄ be the minimal

    efficiency loss of deviating from the optimal policy π∗ by making a single wrong action.

    It is given by

    d̄ = min(h,i,a):π∗h(i)6=a

    (v∗h(i)− Pa,iv∗h+1 − ra(i)).

    As long as there exists a unique optimal policy π∗, we have d̄ > 0. Now we state

    our last theorem.

    31

  • Theorem 2.5.6 (Exact Recovery of The Optimal Policy). For any � > 0, δ ∈ (0, 1),

    let π̂Tr be the truncated pure policy such that

    π̂Trh (i) = argmaxa∈Aλ̂h,a(i), i ∈ Sh.

    Let the SPD-fMDP Algorithm 2.2 iterate with the following iteration number/sample

    size

    (|S|4|A|4H6σ2

    d̄2ln

    (1

    δ

    )).

    Then P(π̂Tr = π∗

    )≥ 1− δ.

    The results of Theorems 2.5.4, 2.5.5 and 2.5.6 for finite-horizon MDP are analogous

    to Theorems 2.5.1, 2.5.2 and 2.5.3 for discounted-reward MDP. The horizon H plays

    a role similar to the discounted infinite sum∑∞

    k=0 γk = 1

    1−γ .

    2.6 Related Works

    Our proposed methods use ideas from the linear program approach for Bellman’s

    equations and the stochastic approximation method. The linear program formulation

    of Bellman’s equation was known at around the same time when Bellman’s equation

    was known; see [10, 90]. Ye [116] shows that the policy iteration of discounted MDP

    is a form of the dual simplex method, which is strongly polynomial for the equivalent

    linear program and terminates in run time O( |A×S|2

    1−γ ). Cogill [28] considered the

    exact primal-dual method for MDP with full knowledge and interpreted it as a form

    of value iteration. Approximate linear programming approaches have been developed

    for solving the large-scale MDP on a low-dimensional subspace, started with de Farias

    and Van Roy [36] and followed by Veatch [106] and Abbasi-Yadkori et al. [2].

    Our algorithm and analysis are closely related to the class of stochastic approxi-

    mation (SA) methods. For textbook references on stochastic approximation, please

    32

  • see [63, 9, 16, 12]. We also use the averaging idea by [89]. In particular, our algorithm

    can be viewed as a stochastic approximation method for stochastic saddle point prob-

    lems, which was first studied by Nemirovski and Rubinstein [84] without the rate of

    convergence analysis. Recent works Dang and Lan [33] and Chen et al. [24] studied

    first-order stochastic methods for a class of general stochastic convex-concave saddle

    point problems and obtained optimal and near-optimal convergence rates.

    In the literature of reinforcement learning, there have been works on dual temporal

    difference learning which use primal-dual-type methods, see for examples [112, 72, 75,

    76]. These works focused on evaluating the value function for a fixed policy. This is

    different from our work, where the aim is to learn the optimal policy. We also remark

    that a primal-dual learning method has been considered for the optimal stopping

    problem by Borkar et al. [17]. However, no explicit sample complexity analysis is

    available.

    In the recent several years, a growing body of work has provided various rein-

    forcement learning methods that are able to “learn” the optimal policy with sample

    complexity guarantees. The notion of “Probably Approximately Correct” (PAC) was

    considered for MDP by Strehl et al. [98], which requires that the learning method out-

    puts an �-optimal policy using a sample size that is polynomial to parameters of the

    MDP and 1/� with high probability. Since then, many methods have been developed

    for discounted MDP and proved to achieve various PAC guarantees. Strehl et al. [98]

    showed that the R-MAX has sample complexity O(S2A/(�3(1 − γ)6)) and Delayed

    Q-learning has O(SA/(�4(1 − γ)8)). Lattimore and Hutter [66] proposed the Upper

    Confidence Reinforcement Learning and obtained a PAC upper and lower bound

    O(

    SA�2(1−γ)3 log

    )under a restrictive assumption: one can only move to two states

    from any given state. Lattimore et al. [67] extended the analysis to more general

    RL models. Azar et al. [6] showed that the model-based value iteration achieves the

    optimal rate O(|S×A| log(|S×A|/δ)

    (1−γ)3�2

    )for discounted MDP. Dann and Brunskill [34] de-

    33

  • veloped a upper confidence method for fixed-horizon MDP and obtain a near-optimal

    rate O(S2AH2

    �2ln 1

    δ

    ). They also provide a lower bound Ω(SAH

    2

    �2ln 1

    δ+c). Based on their

    PAC bounds, Osband and Van Roy [86] conjecture that the regret lower bound for

    reinforcement learning is Ω(√SAT ). Although the above confidence methods achieve

    the close-to-optimal PAC complexity in some cases, they require maintaining a con-

    fidence interval for each state-action-state triplet. Thus these methods are not yet

    satisfactory in the space complexity. It remains unclear whether there is an approach

    that achieves both the space efficiency and the near-optimal sample complexity guar-

    antee, without estimating the transition probabilities. This motivates the research in

    this chapter.

    We emphasize that the SPD methods proposed in this chapter differ fundamentally

    from the existing methods mentioned above. In contrast, the SPD methods are more

    closely related to first-order stochastic methods for convex optimization and convex-

    concave saddle point problems. The closest prior work to the current one is the

    work and Wang and Chen [109], which proposed the first primal-dual-type learning

    method and a loose error bound. No PAC analysis was given in [109]. In this chapter,

    we develop a new class of stochastic primal-dual methods, which are substantially

    improved in both practical efficiency and theoretical complexity. Practically, the new

    algorithms are essentially coordinate descent algorithms involving projections onto

    simple sets. As a result, each iteration is straightforward to implement, making the

    algorithms practically favorable. Theoretically, these methods come with rigorous

    sample complexity guarantees. The results of this chapter provide the first PAC

    guarantee for primal-dual reinforcement learning.

    34

  • 2.7 Summary

    We have presented a novel primal-dual approach for solving the MDP in the model-

    free learning setting. A significant practical advantage of primal-dual learning meth-

    ods is the small storage and computational efficiency. The SPD methods use O(|S|×

    |A|) space and O(|S| × |A|) arithmetic operation per iteration, which is sublinear

    with respect to dimensions of the MDP. We show that the SPD output an absolute-�-

    optimal solution using O( |S|4|A|2�2

    ) samples. In comparison, it is known that the sample

    complexity of reinforcement learning is bounded from below by O( |S||A|�2

    ) in a slightly

    different setting [98, 34]. Clearly, our sample complexity results do not yet match the

    lower bounds. However, these results represent our first steps in the study of rein-

    forcement learning using the primal-dual approach. We will improve our algorithms

    in Chapter 3 using a better update scheme on the dual variables, and show that the

    new algorithms achieve the theoretical optimal sample complexity lower bounds.

    We make several remarks about potential improvement and extensions of the

    primal-dual learning methods.

    • The SPD-dMDP Algorithms 2.1 and 2.2 require that the state-action pair is

    sampled uniformly from |S| × |A|. In other words, the SPD-dMDP Algorithm

    2.1 and 2.2 use pure exploration without any exploitation. Such a sampling

    oracle is suitable for offline learning when a fixed-size static data set is given.

    In the online learning setting, the sample complexity will be improved when

    actions are sampled according to the latest value and policy estimates rather

    than uniformly.

    • Another extension is to consider average-reward discounted MDP. In the

    average-reward case, the discount factor γ and horizon H will disappear from

    the sample complexity. One cannot derive the sample complexity for average-

    35

  • reward MDP from the current result. We will study this problem in Chapter 3

    and give algorithms that achieve theoretical optimal bounds.

    To the authors’ belief, the primal-dual approach studied in this chapter has significant

    theoretical and practical potential. The bilinear stochastic saddle point formulation

    of Bellman equations is amenable to online learning and dimension reduction. The

    intrinsic linear duality between the optimal policy and value functions implies a con-

    venient structure for efficient learning.

    36

  • Chapter 3

    Primal-Dual π Learning Using

    State and Action Features

    3.1 Introduction

    Reinforcement learning lies at the intersection between control, machine learning,

    and stochastic processes [14, 100]. The objective is to learn an optimal policy of a

    controlled system from interaction data. The most studied model for a controlled

    stochastic system is the Markov decision process (MDP), i.e., a controlled random

    walk over a (possibly continuous) state space S, where in each state s ∈ S one can

    choose an action a from an action space A so that the random walk transitions to

    another state s′ ∈ S with density p(s, a, s′). In this paper, we do not assume the

    MDP model is explicitly known but consider the setting where a generative model is

    given (see, e.g., [7]). In other words, there is an oracle that takes (s, a) as input and

    outputs a random s′ with density p(s, a, s′) and an immediate reward r(s, a) ∈ [0, 1].

    This is also known as the simulator-defined MDP in some literatures [40, 102]. Our

    goal is to find an optimal policy that, when running on the MDP to generate an

    37

  • infinitely long trajectory, yields the highest average per-step reward in the limit or

    the highest accumulative discounted reward.

    Here, we focus on problems where the state and action spaces S andA are too large

    to be enumerated. In practice, it might be computationally challenging even to store a

    single state of the process (e.g., states could be high-resolution images). Suppose that

    we are given a collection of state-action feature functions φ : S ×A 7→ Rm and value

    feature functions ψ : S 7→ Rn. They map each state-action pair (s, a) and state s ∈ S

    to column vectors φ(s, a) = (φ1(s, a), . . . , φm(s, a))T and ψ(s) = (ψ1(s), . . . , ψn(s))

    T,

    respectively, where m and n are much smaller than the sizes of S and A.

    Our primary interest is to develop a sample-efficient and computationally scalable

    algorithm, which takes advantage of the given features to solve an MDP with very

    large or continuous state space and huge action space. Given the feature maps, φ

    and ψ, we adopt linear models for approximating both the value function and the

    stationary state-action density function of the MDP. By doing so, we can represent

    the value functions and state-action density functions, which are high-dimensional

    quantities, using a much smaller number of parameters.

    Contributions. Our main contribution is a tractable, model-free primal-dual π-

    learning algorithm for such compact parametric representations. The algorithm ap-

    plies to a general state space, which may be continuous and infinite. It incrementally

    updates

    • The new algorithm is inspired by a saddle point formulation of policy optimiza-

    tion in MDPs, which we refer to as the Bellman saddle point problem. We show

    a strong relationship between the parametric saddle point problem and the orig-

    inal Bellman equation. In particular, the difference between solutions to these

    two problems can be quantified using the L∞- and L1-errors of the parametric

    function classes that are used to approximate the optimal value function and

    38

  • state-action density function, respectively. In the special case where the approx-

    imation error is zero (which we refer to as the “realizable” scenario), solving the

    parametric Bellman saddle point problem is equivalent to solving the original

    Bellman equation.

    • Each iteration of the algorithm can be viewed as a stochastic primal-dual iter-

    ation for solving the Bellman saddle point problem, where the value and policy

    updates are coupled in light of strong duality. We study the sample complexity

    of the π learning algorithm by analyzing the coupled primal-dual convergence

    process. We show that finding an �-optimal policy (comparing to the best

    approximate policy) requires a sample size that is linear in m+n logn�2

    , ignoring

    constants. The sample complexity depends only on the numbers of state and

    action features. It is invariant to the actual sizes of the state and action spaces.

    Notations. The following notations are used throughout the paper. For any integer

    n, we use [n] to denote the set of integers {1, 2, . . . , n}. Denote (Ω,F , ζ) to be an

    arbitrary measure space. Let f : Ω 7→ R be a measurable function defined on (Ω,F , ζ).

    Denote by∫

    Ωfdζ the Lebesgue integral of f . If the meaning of ζ is clear from the

    context, we use∫

    Ωf(x)dx to denote the Lebesgue integral of f . If f is absolutely

    integrable, we define the L1 norm ‖f‖1 of f to be ‖f‖1 =∫

    Ω|f |dζ. If f is square

    integrable, we define the L2 norm ‖f‖2 of f to be ‖f‖2 = (∫

    Ωf 2dζ)

    12 . The L∞

    norm of f is defined to be the infimum of all the quantities M ≥ 0 that satisfy

    |f(x)| ≤M for almost every x. Given two measurable functions f and g, their inner

    product 〈f, g〉 is defined to be∫

    Ωfgdζ. For two probability distributions, u and

    w, over a finite set X, we denote by DKL(u‖w) their Kullback-Leibler divergence,

    i.e., DKL(u‖w) =∑

    x∈X u(x) logu(x)w(x)

    . For two functions f(x) and g(x), we say that

    f(x) = O(g(x)) if there exists a constant C such that |f(x)| ≤ Cg(x) for all x.

    39

  • 3.2 Preliminaries and Formulation of the Problem

    Let (S,F) be the state space that is equipped with an appropriate measure, and

    we use∫S f(s)ds to denote the integral over this measurable space. We let A be

    a finite discrete action space, and we use∫A f(a)da to denote the integral over the

    action space. A function p(s, a, s′), s ∈ S, a ∈ A, s′ ∈ S is called a Markov transition

    function if for each state-action pair (s, a) ∈ S×A, the function p(s, a, s′) as a function

    of s′ ∈ S, is a probability density function over S where∫S p(s, a, s

    ′)ds′ = 1, and for

    each fixed s′ ∈ S, the function p(s, a, s′) is measurable as a function of (s, a) ∈ S×A.

    Let v : S 7→ R be a function defined on S. A Markov transition function p(s, a, s′)

    defines the transition operator P that maps v to a function Pv : S ×A 7→ R

    (Pv)(s, a) =

    ∫Sv(s′)p(s, a, s′)ds′. (3.2.1)

    Let s′ be a state sampled from p(s, a, ·) given a state-action pair (s, a). The transition

    operator has the equivalent definition

    (Pv)(s, a) = Es′|s,a[v(s′)].

    We denote by M = (S,A, p, p0, r) a Markov decision process where S,A, p are

    defined above, p0 : S 7→ R is a bounded initial state density function and r : S×A 7→

    R is a reward function with r(s, a) ∈ [0, 1]. The agent starts from the initial state

    drawn from p0 and takes actions sequentially. After taking action a at state s, the

    agent transitions to the next state with density p(s, a, ·). During the transition, the

    agent receives a reward r(s, a).

    In this work, we focus on the case where the agent applies a randomized stationary

    policy. A randomized stationary policy is a function π(s, a), s ∈ S, a ∈ A, where

    π(s, ·) is a distribution over the action space. Denote pπ to be the probability density

    40

  • function of p under a fixed policy π where

    pπ(s, s′) =

    ∫Aπ(s, a)p(s, a, s′)da,

    for all s, s′ ∈ S. A Markov transition function pπ(s, s′) also defines the operator PTπ

    that acts on the probability density functions

    (PTπ ν)(s′) =

    ∫Spπ(s, s

    ′)ν(s)ds, (3.2.2)

    where ν is a bounded probability density function defined on S. Suppose that ν is the

    distribution of the agent’s current state. Then PTπ ν is the distribution of the agent’s

    next state after the agent takes an action according to π.

    3.2.1 Infinite-Horizon Average-Reward MDP

    Denote by Π the space of all randomized stationary policies. The policy optimization

    problem is to maximize the infinite-horizon average reward over Π:

    v̄∗ = maxπ∈Π

    {v̄π = lim sup

    T→∞Eπ

    [1

    T

    T−1∑t=0

    r(st, at)|s0 ∼ p0

    ]}, (3.2.3)

    where (s0, a0, s1, a1, . . . , st, at, . . .) are state-action transitions generated by the

    Markov decision process under π from the initial distribution p0, and the expectation

    Eπ[·] is taken over the entire trajectory.

    Under certain continuity and ergodicity assumptions [90, 56], the optimal average

    reward v̄∗ is independent of the initial distribution p0, and the policy optimization

    41

  • problem (3.2.3) is equivalent to the following optimization problem

    v̄∗ = maxπ∈Π,ξ:S7→R

    ∫Sξ(s)

    ∫Aπ(s, a)r(s, a)dads

    s.t. ξ(s′) =

    ∫Spπ(s, s

    ′)ξ(s)ds, ∀s′,

    ξ ≥ 0,∫Sξ(s)ds = 1 ,

    (3.2.4)

    where the constraint ensures that ξ is the stationary denstiy function of states under

    the policy π. Let µ(s, a) = ξ(s)π(s, a). Then policy optimization problem (3.2.4)

    becomes a (possibly infinite dimensional) linear program

    v̄∗ = maxµ:S×A7→R

    ∫S

    ∫Aµ(s, a)r(s, a)dads

    s.t.

    ∫Aµ(s′, a)da =

    ∫S

    ∫Ap(s, a, s′)µ(s, a)dads , ∀s,

    µ ≥ 0,∫S

    ∫Aµ(s, a)dads = 1 ,

    (3.2.5)

    where the constraint ensures that µ is a stationary state-action density of the MDP

    under some policy. We denote by µ∗ the optim