march 24, 2021 arxiv:2103.12690v1 [cs.lg] 23 mar 2021

26
An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap Yuanhao Wang * Ruosong Wang Sham M. Kakade March 24, 2021 Abstract A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in the linear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a constant suboptimality gap between the optimal Q-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Q- function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model). 1 Introduction There has been substantial recent theoretical interest in understanding the means by which we can avoid the curse of dimensionality and obtain sample-efficient reinforcement learning (RL) methods [Wen and Van Roy, 2017, Du et al., 2019c,b, Wang et al., 2019, Yang and Wang, 2019, Lattimore et al., 2020, Yang and Wang, 2020, Jin et al., 2020, Cai et al., 2020, Zanette et al., 2020, Weisz et al., 2020, Du et al., 2020, Zhou et al., 2020b,a, Modi et al., 2020, Jia et al., 2020, Ayoub et al., 2020]. Here, the extant body of literature largely focuses on sufficient conditions for efficient reinforcement learning. Our understanding of what are the necessary conditions for efficient reinforcement learning is far more limited. With regards to the latter, arguably, the most natural assumption is linear realizability: we assume that the optimal Q-function lies in the linear span of a given feature map. The goal is to the obtain polynomial sample complexity under this linear realizability assumption alone. This “linear Q * problem” was a major open problem (see Du et al. [2019b] for discussion), and a recent hardness result by Weisz et al. [2020] provided a negative answer. In particular, the result showed that even with access to a generative model, any algorithm requires an exponential number of samples (in the dimension d of the feature mapping) to find a near-optimal policy, provided the action space has exponential size. With this question resolved, one may naturally ask what is the source of hardness for the construction in Weisz et al. [2020] and if there are additional assumptions that can serve to bypass the underlying source of this hardness. * Princeton University. Email: [email protected] Carnegie Mellon University. Email: [email protected] University of Washington and Microsoft Research. Email: [email protected] 1 arXiv:2103.12690v1 [cs.LG] 23 Mar 2021

Upload: others

Post on 18-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

An Exponential Lower Bound for Linearly-Realizable MDPswith Constant Suboptimality Gap

Yuanhao Wang* Ruosong Wang† Sham M. Kakade‡

March 24, 2021

Abstract

A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in thelinear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? Therecent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential(in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment.One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even morefavorable assumption: there exists a constant suboptimality gap between the optimal Q-value of the best action andthat of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easieridentification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has accessto a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption.

This work focuses on this question in the standard online reinforcement learning setting, where our main resultresolves this question in the negative: our hardness result shows that an exponential sample complexity lower boundstill holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Q-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generativemodel setting. Complementing our negative hardness result, we give two positive results showing that provablysample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivityassumption (both implicitly place stronger conditions on the underlying dynamics model).

1 IntroductionThere has been substantial recent theoretical interest in understanding the means by which we can avoid the curse ofdimensionality and obtain sample-efficient reinforcement learning (RL) methods [Wen and Van Roy, 2017, Du et al.,2019c,b, Wang et al., 2019, Yang and Wang, 2019, Lattimore et al., 2020, Yang and Wang, 2020, Jin et al., 2020, Caiet al., 2020, Zanette et al., 2020, Weisz et al., 2020, Du et al., 2020, Zhou et al., 2020b,a, Modi et al., 2020, Jia et al.,2020, Ayoub et al., 2020]. Here, the extant body of literature largely focuses on sufficient conditions for efficientreinforcement learning. Our understanding of what are the necessary conditions for efficient reinforcement learningis far more limited. With regards to the latter, arguably, the most natural assumption is linear realizability: we assumethat the optimal Q-function lies in the linear span of a given feature map. The goal is to the obtain polynomial samplecomplexity under this linear realizability assumption alone.

This “linear Q∗ problem” was a major open problem (see Du et al. [2019b] for discussion), and a recent hardnessresult by Weisz et al. [2020] provided a negative answer. In particular, the result showed that even with access toa generative model, any algorithm requires an exponential number of samples (in the dimension d of the featuremapping) to find a near-optimal policy, provided the action space has exponential size.

With this question resolved, one may naturally ask what is the source of hardness for the construction in Weiszet al. [2020] and if there are additional assumptions that can serve to bypass the underlying source of this hardness.

*Princeton University. Email: [email protected]†Carnegie Mellon University. Email: [email protected]‡University of Washington and Microsoft Research. Email: [email protected]

1

arX

iv:2

103.

1269

0v1

[cs

.LG

] 2

3 M

ar 2

021

Minimum Gap? Generative Model Online RL

No Exponential [Weisz et al., 2020] Exponential [Weisz et al., 2020]

Yes Polynomial [Du et al., 2019b] Exponential (This work, Theorem 1)

Table 1: Known sample complexity results for RL with linear function approximation under realizability. “Exponen-tial” refers to exponential lower bound (in the dimension or horizon), while “polynomial” refers to a polynomial upperbound.

Here, arguably, it is most natural to further examine the suboptimality gap in the problem, which is the gap between theoptimalQ-value of the best action and that of the second-best action; the construction in Weisz et al. [2020] does in factfundamentally rely on having an exponentially small gap. Instead, if we assume the gap is lower bounded by a constantfor all states, we may hope that the problem is substantially easier since with a finite number of samples (appropriatelyobtained), we can identify the optimal policy itself (i.e. gap assumptions allows us to translate value-based accuracyto identification of the optimal policy itself). In fact, this intuition is correct in the following sense: with a generativemodel, it is not difficult to see that polynomial sample complexity is possible under the linear realizability assumptionplus the suboptimality gap assumption, since the suboptimality gap assumption allows us to easily identify an optimalaction for all states, thus making the problem tractable (see Section C in Du et al. [2019b] for a formal argument).

More generally, the suboptimality gap assumption is widely discussed in the bandit literature [Dani et al., 2008,Audibert and Bubeck, 2010, Abbasi-Yadkori et al., 2011] and the reinforcement learning literature [Simchowitz andJamieson, 2019, Yang et al., 2020] to obtain fine-grained sample complexity upper bounds. More specifically, under therealizability assumption and the suboptimality gap assumption, it has been shown that polynomial sample complexityis possible if the transition is nearly deterministic [Du et al., 2020] (also see Wen and Van Roy [2017]). However, itremains unclear whether the suboptimality gap assumption is sufficient to bypass the hardness result in Weisz et al.[2020], or the same exponential lower bound still holds even under the suboptimality gap assumption, when thetransition could be stochastic and the generative model is unavailable. For the construction in Weisz et al. [2020], atthe final stage, the gap between the value of the optimal action and its non-optimal counterparts will be exponentiallysmall, and therefore the same construction does not imply an exponential sample complexity lower bound under thesuboptimality gap assumption.

Our contributions. In this work, we significantly strengthen the hardness result in Weisz et al. [2020]. In particular,we show that in the online RL setting (where a generative model is unavailable) with exponential-sized action space,the exponential sample complexity lower bound still holds even under the suboptimality gap assumption. Comple-menting our hardness result, we show that under the realizability assumption and the suboptimality gap assumption,our hardness result can be bypassed if one further assumes the low variance assumption in Du et al. [2019c] 1, or ahypercontractivity assumption. Hypercontractive distributions include Gaussian distributions (with arbitrary covari-ance matrices), uniform distributions over hypercubes and strongly log-concave distributions [Kothari and Steinhardt,2017]. This condition has been shown powerful for outlier-robust linear regression [Kothari and Steurer, 2017], buthas not yet been introduced for reinforcement learning with linear function approximation.

Our results have several interesting implications, which we discuss in detail in Section 6. Most notably, our resultsimply an exponential separation between the standard reinforcement learning setting and the generative model setting.Moreover, our construction enjoys greater simplicity, making it more suitable to be generalized for other RL problemsor to be presented for pedagogical purposes.

Organization. This paper is organized as follows. In Section 2, we review related works in literature. In Section 3,we introduce necessary notations, definitions and assumptions. In Section 4, we present our hardness result. InSection 5, we present the upper bound. Finally we discuss implications of our results and open problems in Section 6.

1We note that the sample complexity of the algorithm in Du et al. [2019c] has at least linear dependency on the number of actions, which is notsufficient for bypassing our hardness results which assumes an exponential-sized action space.

2

2 Related workPrevious hardness results. Existing exponential lower bounds in RL [Krishnamurthy et al., 2016, Chen and Jiang,2019] usually construct unstructured MDPs with an exponentially large state space. Du et al. [2019b] proved thatunder the approximate version of the realizability assumption, i.e., the optimal Q-function lies in the linear span of agiven feature mapping approximately, any algorithm requires an exponential number of samples to find a near-optimalpolicy. The main idea in Du et al. [2019b] is to use the Johnson-Lindenstrauss lemma [Johnson and Lindenstrauss,1984] to construct a large set of near-orthogonal feature vectors. Such idea is later generalized to other settings,including those in Wang et al. [2020a], Kumar et al. [2020], Van Roy and Dong [2019], Lattimore et al. [2020].Whether the exponential lower bound still holds under the exact version of the realizability assumption is left as anopen problem in Du et al. [2019b].

The above open problem is recently solved by Weisz et al. [2020]. They show that under the exact version of therealizability assumption, any algorithm requires an exponential number of samples to find a near-optimal policy as-suming an exponential-sized action space. The construction in Weisz et al. [2020] also uses the Johnson-Lindenstrausslemma to construct a large set of near-orthogonal feature vectors, with additional subtleties to ensure exact realizability.

Very recently, under the exact realizability assumption, strong lower bounds are proved in the offline setting [Wanget al., 2020b, Zanette, 2020]. These work focus on the offline RL setting, where a fixed data distribution with sufficientcoverage is given and the agent cannot interact with the environment in an online manner. Instead, we focus on theonline RL setting in this paper.

Existing upper bounds. For RL with linear function approximation, most existing upper bounds require represen-tation conditions stronger than realizability. For example, the algorithms in Yang and Wang [2019, 2020], Jin et al.[2020], Cai et al. [2020], Zhou et al. [2020b,a], Modi et al. [2020], Jia et al. [2020], Ayoub et al. [2020] assume that thetransition model lies in the linear span of a given feature mapping, and the algorithms in Wang et al. [2019], Lattimoreet al. [2020], Zanette et al. [2020] assume completeness properties of the given feature mapping. In the remainingpart of this section, we mostly focus on previous upper bounds that requires only realizability as the representationcondition.

For deterministic systems, under the realizability assumption, Wen and Van Roy [2017] provide an algorithmthat achieves polynomial sample complexity. Later, under the realizability assumption and the suboptimality gapassumption, polynomial sample complexity upper bounds are shown if the transition is deterministic [Du et al., 2020],a generative model is available [Du et al., 2019b], or a low-variance condition holds [Du et al., 2019c]. Compared tothe original algorithm in Du et al. [2019c], our modified algorithm in Section 5 works under a similar low-variancecondition. However, the sample complexity in Du et al. [2019c] has at least linear dependency on the number ofactions, whereas our sample complexity in Section 5 has no dependency on the size of the action space. Finally, Shariffand Szepesvari [2020] obtains a polynomial upper bound under the realizability assumption when the features for allstate-action pairs are inside the convex hull of a polynomial-sized coreset and the generative model is available to theagent.

3 PreliminariesThroughout a paper, we use [N ] to denote the set 1, 2, . . . , N. For a set S,we use ∆S to denote the probabilitysimplex.

3.1 Markov decision process (MDP) and reinforcement learningAn MDP is specified by (S,A, H, P, Rhh∈[H]), where S is the state space, A is the action space with |A| = A,H ∈ Z+ is the planning horizon, P : S × A → ∆S is the transition function and Rh : S × A → ∆R is the rewarddistribution. Throughout the paper, we occasionally abuse notation and use a scalar a to denote the single-pointdistribution at a.

A (stochastic) policy takes the form π = πhh∈[H], where each πh : S → ∆A assigns a distribution over actionsfor each state. We assume that the initial state is drawn from a fixed distribution, i.e. s1 ∼ µ. Starting from the initial

3

state, a policy π induces a random trajectory s1, a1, r1, · · · , sH , aH , rH via the process ah ∼ πh(·), rh ∼ R(·|sh, ah)and sh+1 ∼ P (·|sh, ah). For a policy π, denote the distribution of sh in its induced trajectory by Dπh .

Given a policy π, the Q-function (action-value function) is defined as

Qπh(s, a) := E

[H∑

h′=h

rh′ |sh = s, ah = a, π

],

while V πh (s) := Ea∼πh(s)[Qπh(s, a)]. We denote the optimal policy by π∗, and the associated optimal Q-function

and value function by Q∗ and V ∗ respectively. Note that Q∗ and V ∗ can also be defined via the Bellman optimalityequation2:

V ∗h (s) = maxa∈A

Q∗h(s, a),

Q∗h(s, a) = E[Rh(s, a) + V ∗h+1(sh+1)|sh = s, ah = a

].

The online RL setting. In this paper, we aim to prove lower bound and upper bound in the online RL setting. In thissetting, in each episode, the agent interacts with the unknown environment using a policy and observes rewards andthe next states. We remark that the hardness result by Weisz et al. [2020] operates in the setting where a generativemodel is available to the agent so that the agent can transit to any state. Also, it is known that with a generative model,under the linear realizability assumption plus the suboptimality gap assumption, one can find a near-optimal policywith polynomial number of samples (see Section C in Du et al. [2019b] for a formal argument).

3.2 Linear Q? function approximationWhen the state space is large or infinite, structures on the state space are necessary for efficient reinforcement learning.In this work we consider linear function approximation. Specifically, there exists a feature map φ : S ×A → Rd, andwe will use linear functions of φ to represent Q-functions of the MDP. To ensure that such function approximation isviable, we assume that the optimal Q-function is realizable.

Assumption 1 (Realizability). For all h ∈ [H], there exists θ∗h ∈ Rd such that for all (s, a) ∈ S × A, Q∗h(s, a) =φ(s, a)>θ∗h.

This assumption is widely used in existing reinforcement learning and contextual bandit literature [Du et al., 2019c,Foster and Rakhlin, 2020]. However, even for linear function approximation, realizability alone is not sufficient forsample-efficient reinforcement learning [Weisz et al., 2020].

In this work, we also impose the regularity condition that ‖θ∗h‖2 = O(1) and ‖φ(s, a)‖2 = O(1), which canalways be achieved via rescaling.

Another assumption that we will use is that the minimum suboptimality gap is lower bounded.

Assumption 2 (Minimum Gap). For any state s ∈ S , a ∈ A, the suboptimality gap is defined as ∆h(s, a) :=V ∗h (s)−Q∗h(s, a). We assume that minh∈[H],s∈S,a∈A ∆h(s, a) : ∆h(s, a) > 0 ≥ ∆min.

As mentioned in the introduction, this assumption is common in bandit and reinforcement learning literature.

4 Hard Instance with Constant Suboptimality GapWe now present our main hardness result:

Theorem 1. Consider an arbitrary online RL algorithm that takes the feature mapping φ : S × A → Rd as input. Inthe online RL setting, there exists an MDP with a feature mapping φ satisfying Assumption 1 and Assumption 2, suchthat the algorithm requires min2Ω(d), 2Ω(H) samples to find a policy π with

Es1∼µV π(s1) ≥ Es1∼µV ∗(s1)− 0.05

with probability 0.1.

4

Figure 1: The Leaking Complete Graph Construction. Illustration of a hard MDP. There are m + 1 states in theMDP, where f is an absorbing terminal state. Starting from any non-terminal state a, regardless of the action, there isat least γ = 1/4 probability that the next state will be f .

The remainder of this section provides the construction of a hard family of MDPs where Q∗ is linearly realizableand has constant suboptimality gap and where it takes exponential samples to learn a near-optimal policy. Each ofthese hard MDPs can roughly be seen as a “leaking complete graph” (see Fig. 1). Information about the optimal policycan only be gained by: (1) taking the optimal action; (2) reaching a non-terminal state at level H . We will showthat when there are exponentially many actions, either events happen with negligible probability unless exponentiallymany trajectories are played.

4.1 Construction of the MDP familyIn this subsection we describe the construction of the hard instance (the hard MDP family) in detail. Let m be aninteger to be determined. The state space is 1, · · · , m, f. The special state f is called the terminal state. At state i,the set of available actions is [m] \ i; at the terminal state f , the set of available actions is [m − 1]. In other wordsthere are m−1 actions available at each state. Each MDP in this family is specified by an index a∗ ∈ [m] and denotedbyMa∗ . In other words, there are m MDPs in this family.

In order to construct the MDP family, we first find a set of approximately orthogonal vectors by leveraging theJohnson-Lindenstrauss lemma [Johnson and Lindenstrauss, 1984].

Lemma 1 (Johnson-Lindenstrauss). For any γ > 0, if m ≤ exp( 18γ

2d′), there exists m unit vectors v1, · · · , vm inRd′ such that ∀i, j ∈ [m] such that i 6= j, |〈vi, vj〉| ≤ γ.

We will set γ = 14 and m = bexp( 1

8γ2d)c. By Lemma 1, we can find such a set of d-dimensional unit vectors

v1, · · · , vm. For the clarity of presentation, we will use vi and v(i) interchangeably. The construction ofMa∗ isspecified below.

Features. The feature map, which maps state-action pairs to d dimensional vectors, is defined as

φ(a1, a2) :=(⟨v(a1), v(a2)

⟩+ 2γ

)· v(a2), ∀a1, a2 ∈ [m], a1 6= a2,

φ(f, ·) := 0

Note that the feature map is independent of a∗ and is shared across the MDP family.

2We additionally define VH+1(s) = 0 for all s ∈ S.

5

Rewards. For 1 ≤ h < H , the rewards are defined as

Rh(a1, a∗) :=

⟨v(a1), v(a∗)

⟩+ 2γ,

Rh(a1, a2) := −2γ[⟨v(a1), v(a2)

⟩+ 2γ

], (a2 6= a∗, a2 6= a1)

Rh(f, ·) := 0.

For h = H , rH(s, a) := 〈φ(s, a), v(a∗)〉 for every state-action pair.

Transitions. The initial state distribution µ is set as a uniform distribution over 1, · · · , m. The transition proba-bilities are set as follows.

Pr[f |a1, a∗] = 1,

Pr[·|a1, a2] =

a2 :⟨v(a1), v(a2)

⟩+ 2γ

f : 1−⟨v(a1), v(a2)

⟩− 2γ

, (a2 6= a∗, a2 6= a1)

Pr[f |f, ·] = 1.

After taking action a2, the next state is either a2 or f . Thus this MDP looks roughly like a “leaking complete graph”(see Fig. 1): starting from state a, it is possible to visit any other state (except for a∗); however, there is always at least1− 3γ probability of going to the terminal state f . The transition probabilities are indeed valid, because

0 < γ ≤⟨v(a1), v(a2)

⟩+ 2γ ≤ 3γ < 1.

We now verify that realizability, i.e. Assumption 1, is satisfied. In particular, we claim the following.

Lemma 2. In the MDPMa∗ , ∀h ∈ [H], for any state-action pair (s, a), Q∗h(s, a) = 〈φ(s, a), v(a∗)〉.Proof. We first verify the statement for the terminal state f . Observe that at the terminal state f , regardless of theaction taken, the next state is always f and the reward is always 0. Hence Q∗h(f, ·) = V ∗h (f) = 0 for all h ∈ [H].Thus Q∗h(f, ·) = 〈φ(f, ·), v(a∗)〉 = 0.

We now verify realizability for other states via induction on h = H,H − 1, · · · , 1. The induction hypothesis is∀a1 ∈ [m], a2 6= a1,

Q∗h(a1, a2) =(⟨v(a1), v(a2)

⟩+ 2γ

)·⟨v(a2), v(a∗)

⟩, (1)

and that ∀a1 6= a∗,V ∗h (a1) = Q∗h(a1, a

∗) =⟨v(a1), v(a∗)

⟩+ 2γ. (2)

When h = H , (1) holds by the definition of rewards. Next, note that ∀h, (2) follows from (1). This is because fora2 6= a∗, a1,

Q∗h(a1, a2) =(⟨v(a1), v(a2)

⟩+ 2γ

)·⟨v(a2), v(a∗)

⟩≤ 3γ2,

while

Q∗h(a1, a∗) =

⟨v(a1), v(a∗)

⟩+ 2γ ≥ γ > 3γ2.

In other words, (1) implies that a∗ is always the optimal action. Thus, it remains to show that (1) holds for h assuming(2) holds for h+ 1. By Bellman’s optimality equation,

Q∗h(a1, a2) = Rh(a1, a2) + Esh+1

[V ∗h+1(sh+1)

∣∣ a1, a2

]= −2γ

[⟨v(a1), v(a2)

⟩+ 2γ

]+ Pr[sh+1 = a2] · V ∗h+1(a2) + Pr[sh+1 = f ] · V ∗h+1(f)

= −2γ[⟨v(a1), v(a2)

⟩+ 2γ

]+[⟨v(a1), v(a2)

⟩+ 2γ

]·(⟨v(a1), v(a∗)

⟩+ 2γ

)=(⟨v(a1), v(a2)

⟩+ 2γ

)·⟨v(a1), v(a∗)

⟩.

6

This is exactly (1) for h. Hence both (1) and (2) hold for all h ∈ [H].

4.2 Suboptimality gapFrom Eq. (1) and (2), it is easy to see that at state a1 6= a∗, for a2 6= a∗, the suboptimality gap is

∆h(a1, a2) := V ∗h (a1)−Q∗h(a1, a2) > γ − 3γ2 ≥ 1

4γ.

Thus in this MDP, Assumption 2 is satisfied with ∆min ≥ 14γ = Ω(1).

Remark 1. Here we ignored the terminal state f and the essentially unreachable state a∗ for simplicity. Intuitively thisseems fair, since reaching f is effectively the end of this episode, and the state a∗ can only be reached with negligibleprobability. We can also address this issue by including an auxiliary dimension (see Appendix B for details). Theresulting MDP will satisfy Assumption 2 rigorously, and modifications will not affect information theoretic argumentsbelow.

4.3 The information-theoretic argumentNow we are ready to state and prove our main technical lemma.

Lemma 3. For any algorithm, there exists a∗ ∈ [m] such that in order to output π with

Es1∼µV π(s1) ≥ Es1∼µV ∗(s1)− 0.05

with probability at least 0.1 forMa∗ , the number of samples required is 2Ω(mind,H).

We provide a proof sketch for the lower bound below. The full proof can be found in Appendix C. Our main result,Theorem 1, is a direct consequence of Lemma 3.

Proof sketch. Observe that the feature map of Ma∗ does not depend on a∗, and that for h < H and a2 6= a∗,the reward Rh(a1, a2) also contains no information about a∗. The transition probabilities are also independent of a∗,unless the action a∗ is taken. Moreover, the reward at state f is always 0. Thus, to receive information about a∗, theagent either needs to take the action a∗, or be at a non-game-over state at the final time step (h = H).

However, note that the probability of remaining at a non-terminal state at the next layer is at most

supa1 6=a2

〈v(a1), v(a2)〉+ 2γ ≤ 3γ ≤ 3

4.

Thus for any algorithm, Pr[sH 6= f ] ≤(

34

)H, which is exponentially small.

In other words, any algorithm that does not know a∗ either needs to “be lucky” so that sH = f , or needs to take a∗

“by accident”. Since the number of actions is m = 2Θ(d), either event cannot happen with constant probability unlessthe number of episodes is exponential in mind,H.

In order to make this claim rigorous, we can construct a reference MDPM0 as follows. The state space, actionspace, and features ofM0 are the same as those ofMa. The transitions are defined as follows:

Pr[·|a1, a2] =

a2 :⟨v(a1), v(a2)

⟩+ 2γ

f : 1−⟨v(a1), v(a2)

⟩− 2γ

, (∀a1, a2 s.t. a1 6= a2)

Pr[f |f, ·] = 1.

The rewards are defined as follows:

Rh(a1, a2) := −2γ[⟨v(a1), v(a2)

⟩+ 2γ

], ( ∀a1, a2 s.t. a1 6= a2)

Rh(f, ·) := 0.

7

Note thatM0 is identical toMa∗ , except when a∗ is taken, or when an trajectory ends at a non-terminal state. Sincethe latter event happens with an exponentially small probability, we can show that for any algorithm the probability oftaking a∗ inMa∗ is close to the probability of taking a∗ inM0. SinceM0 is independent of a∗, unless an exponentialnumber of samples are used, for any algorithm there exists a∗ ∈ [m] such that the probability of taking a∗ inM0 iso(1). It then follows that the probability of taking a∗ inMa∗ is o(1). Since a∗ is the optimal action for every state,such an algorithm cannot output a near-optimal policy forMa∗ .

5 Upper BoundTheorem 1 suggests that Assumption 1 and Assumption 2 are not sufficient for sample-efficient RL when the numberof actions could be exponential, and that additional assumptions are needed to achieve polynomial sample complexity.One style of assumption is via assuming a global representation property on the features, such as completeness [Zanetteet al., 2020].

In this section, we consider two assumptions on additional structures on the transitions of the MDP rather than thefeature representation that enable good rates for linear regression with sparse bias. The first condition is a variant ofthe low variance condition in Du et al. [2019c].

Assumption 3 (Low variance condition). There exists a constant 1 ≤ Cvar < ∞ such that for any h ∈ [H] and anypolicy π,

Es∼Dπh[|V π(s)− V ∗(s)|2

]≤ Cvar ·

(Es∼Dπh [|V π(s)− V ∗(s)|]

)2.

The second assumption is that the feature distribution is hypercontractive.

Assumption 4. There exists a constant 1 ≤ Chyper <∞ such that for any h ∈ [H] and any policy π, the distributionof φ(s, a) with (s, a) ∼ Dπh is (Chyper, 4)-hypercontractive. In other words, ∀π, ∀h ∈ [H], ∀v ∈ Rd,

E(s,a)∼Dπh

[(φ(s, a)>v)4

]≤ Chyper ·

(E(s,a)∼Dπh [(φ(s, a)>v)2]

)2

.

Intuitively, hypercontractivity characterizes the anti-concentration of a distribution. A broad class of distributionsare hypercontractive with Chyper = O(1), including Gaussian distributions (of arbitrary covariance matrices), uniformdistributions over the hypercube and sphere, and strongly log-concave distributions [Kothari and Steurer, 2017]. Hy-percontractivity has been previously used for outlier-robust linear regression [Klivans et al., 2018, Bakshi and Prasad,2020] and moment-estimation [Kothari and Steurer, 2017].

We show that under Assumptions 1, 2, 3 or 1, 2, 4, a modified version of the Difference Maximization Q-learning(DMQ) algorithm [Du et al., 2019c] is able to learn a near-optimal policy using polynomial number of trajectorieswith no dependency on the number of actions.

5.1 Optimal experiment designGiven a set of d-dimensional vectors, (optimal) experiment design aims at finding a distribution ρ over the vectorssuch that when sampling from this distribution, linear regression performs optimally based on some criteria. In thispaper, we will use the G-optimality criterion, which minimizes the maximum prediction variance over the set. Thefollowing lemma on G-optimal design is a direct corollary of the Kiefer-Wolfowitz theorem [Kiefer and Wolfowitz,1960].

Lemma 4 (Existence of G-optimal design). For any set X ⊆ Rd, there exists a distribution KX supported on X ,known as the G-optimal design, such that

maxx∈X

x>(Ez∼KXzz>

)−1x ≤ d.

8

Efficient algorithms for finding such a distribution can be found in Todd [2016].In the context of reinforcement learning, the set X corresponds to the set of all features, which is inaccessible.

Instead, one can only observe one state s at a time, and choose a ∈ A based on the features φ(s, a)a∈A. Such aproblem is closer to the distributional optimal design problem described by Ruan et al. [2020]. For our purpose, thefollowing simple approach suffices: given a state s, perform exploration by sampling from the G-optimal design onφ(s, a)a∈A. The performance of this exploration strategy is guaranteed by the following lemma, which will be usedin the analysis in Section 5.

Lemma 5 (Lemma 4 in Ruan et al. [2020]). For any state s, denote the G-optimal design with its features by ρs(·) ∈∆A, and the corresponding covariance matrix by Σs :=

∑a ρs(a)φ(s, a)φ(s, a)>. Given a distribution ν over states.

Denote the average covariance matrix by Σ := Es∼νΣs. Then

Es∼ν[maxa∈A

φ(s, a)>Σ−1φ(s, a)

]≤ d2.

We provide a proof in the appendix for completeness.Note that the performance of this strategy is only worse by a factor of d (compared to the case where one can query

all features), and has no dependency on the number of actions.

5.2 The modified DMQ algorithmOverview. During the execution of the Difference Maximization Q-learning (DMQ) algorithm, for each level h ∈[H], we maintain three variables: the estimated linear coefficients θh ∈ Rd, a set of exploratory policies Πh, and theempirical feature covariance matrix Σh associated with Πh. We initialize θh = 0 ∈ Rd, Σh := 0d×d and Πh to as asingle purely random exploration policy, i.e., Πh = πwhere π chooses an action uniformly at random for all states.3

Each time we execute Algorithm 1, the goal is to update the estimated linear coefficients θh ∈ Rd, so that forall π ∈ Πh, θh is a good estimation to θ∗h with respect to the distribution induced by π. We run ridge regression onthe data distribution induced by policies in Πh, and the regression targets are collected by invoking the greedy policyinduced by θh′h′>h.

However, there are two apparent issues with such an approach. First, for levels h′ > h, θh′ is guaranteed to achievelow estimation error only with respect to the distributions induced by policies Πh′ . It is possible that for some π ∈ Πh,the estimation error of θh′ is high for the distribution induced by π (followed by the greedy policy). To resolve thisissue, the main idea in Du et al. [2019c] is to explicitly check whether θh′ also predicts well on the new distribution(see Line 5 in Algorithm 1). If not, we add the new policy into Πh′ and invoke Algorithm 1 recursively. The analysisin Du et al. [2019c] upper bounds the total number of recursive calls by a potential function argument, which alsogives an upper bound on the sample complexity of the algorithm.

Second, the exploratory policies Πh only induce a distribution over states at level h, and the algorithm still needsto decide an exploration strategy to choose actions at level h. To this end, the algorithm in Du et al. [2019c] exploresall actions uniformly at random, and therefore the sample complexity has at least linear dependency on the number ofactions. We note that similar issues also appear in the linear contextual bandit literature [Lattimore and Szepesvari,2020, Ruan et al., 2020], and indeed our solution here is to explore by sampling from the G-optimal design over thefeatures at a single state. As shown by Lemma 5, for all possible roll-in distributions, such an exploration strategyachieves a nice coverage over the feature space, and is therefore sufficient for eliminating the dependency on the sizeof the action space.

The algorithm. The formal description of the algorithm is given in Algorithm 1. The algorithm should be run bycalling LearnLevel on input h = 0.

Here, for a policy πh ∈ Πh, the associated exploratory policy πh is defined as

πh(sh′) =

π(sh′) (if h′ < h)

Sample from ρsh(·) (if h′ = h)

arg maxa φh′(sh′ , a)>θh′ (if h′ > h)

. (3)

3We also define a special Π0 in the same manner.

9

Algorithm 1: LearnLevelInput: A level h ∈ 0, · · · , H

1 for πh ∈ Πh do2 for h′ = H,H − 1, · · · , h+ 1 do3 Collect N samples (sjh′ , a

jh′)j∈[N ] with sjh′ ∼ D

πhh′ and ajh′ ∼ ρsj

h′(πh defined in (3))

4 Σh′ ← 1N

∑Nj=1 φ(sjh′ , a

jh′)φ(sjh′ , a

jh′)>

5 if ‖Σ−12

h′ Σh′Σ− 1

2

h′ ‖2 > β|Πh′ | then6 Πh′ ← Πh′ ∪ πh7 Run LearnLevel recursively on level h′

8 Goto Line 1 (restart the algorithm)

9 if h = 0 then10 Output greedy policy with respect to θhh∈[H]

11 Σh ← λr|Πh|I , wh ← 0 ∈ Rd

12 for i = 1, · · · , N |Πh| do13 Sample π from uniform distribution over Πh

14 Execute πh (see (3)) to collect (sih, aih, yi), where yi :=

∑h′≥h r

ih′ is the on-the-go reward

15 Σh ← Σh + 1N |Πh|φ(sih, a

ih)φ(sih, a

ih)>

16 wh ← wh + 1N |Πh|φ(sih, a

ih)yi

17 θh ←(

(λridge − λr|Πh| )I + Σh

)−1

wh

Here ρs(·) is the G-optimal design on the set of vectors φ(s, ·)a∈A, as defined by Lemma 4. Note that when h = 0,πh is always the greedy policy on θhh∈[H]. The choice of the algorithmic parameters (β, λr, λridge) can be foundin the proof of Theorem 2.

5.3 AnalysisWe show the following theorem regarding the modified algorithm.

Theorem 2. Assume that Assumption 1, 2 and one of Assumption 3 and 4 hold. Also assume that

ε ≤ poly(∆min, 1/Cvar, 1/d, 1/H) (Under Assumption 3)or ε ≤ poly(∆min, 1/Chyper, 1/d, 1/H). (Under Assumption 4)

Let µ be the initial state distribution. Then with probability 1 − ε, running Algorithm 1 on input 0 returns a policy πwhich satisfies Es1∼µV π(s1) ≥ Es1∼µV ∗(s1)− ε using poly(1/ε) trajectories.

Note that here both the algorithm and the theorem have no dependence on A, the number of actions. The proof ofthe theorem under Assumption 3 is largely based on the analysis in Du et al. [2019c]. The largest difference is that weused Lemma 5 instead of the original union bound argument when controlling Pr

[supa |θ>h φ(s, a)Q∗h(s, a)| > γ

2

].

The proof under Assumption 4 relies on a novel analysis of least squares regression under hypercontractivity. The fullproof is deferred to Appendix E.

6 DiscussionExponential separation between the generative model and the online setting. When a generative model (alsoknown as simulator) is available, Assumption 1 and Assumption 2 are sufficient for designing an algorithm with

10

poly(1/ε, 1/∆min, d,H) sample complexity [Du et al., 2019b, Theorem C.1]. As shown by Theorem 1, under thestandard online RL setting (i.e. without access to a generative model), the sample complexity is lower bounded by2Ω(mind,H) when ∆min = Θ(1), under the same set of assumptions. This implies that the generative model isexponentially more powerful than the standard online RL setting.

Although the generative model is conceptually much stronger than the online RL model, previously little is knownon the extent to which the former is more powerful. In tabular RL, for instance, the known sample complexity boundswith or without access to generative models are nearly the same [Zhang et al., 2020, Agarwal et al., 2020], andboth match the lower bound (up to logarithmic factors). To the best of our knowledge, the only existing example ofsuch separation is shown by Wang et al. [2020a] under the following set of conditions: (i) deterministic system; (ii)realizability (Assumption 1); (iii) no reward feedback (a.k.a. reward-free exploration). In comparison, our separationresult holds under less restrictions (allows stochasticity) and for the usual RL environment (instead of reward-freeexploration), and it therefore far more natural.

Connecting Theorem 1 and Theorem 2. Our hardness result in Theorem 1 shows that under Assumption 1 andAssumption 2, any algorithm requires exponential number of samples to find a near-optimal policy, and therefore,sample-efficient RL is impossible without further assumptions (e.g., Assumption 3 assumed in Theorem 2). Indeed,Theorem 1 and Theorem 2 imply that the coefficient C in Assumption 3 is at least exponential for the hard MDPfamily used in Theorem 1, which can also be verified easily.

Minimum reaching probability. We would also like to mention the following reachability condition, which isassumed by Du et al. [2019a], Misra et al. [2020]. Denote the probability under policy π of visiting state s at steph by Prπh[s]; then it is assumed that mins∈S,h>1 maxπ Prπh[s] = ηmin > 0. Although in Ma∗ , ηmin ≤ 2−Θ(H) isexponentially small, one can slightly modify the construction so that ηmin = 1 (see Appendix D for details). Therationale is straightforward: in theMa∗ , the action a∗ will not be taken with high probability in polynomial samples.Thus, one can exploit this by setting a∗ to lead to a special state from which all states are reachable. This suggests thateven under Assumption 1, Assumption 2 and the condition that ηmin = 1, there is an exponential sample complexitylower bound.

Open problems. The first open problem, perhaps an obvious one, is whether a sample complexity lower boundunder Assumption 1 can be shown with polynomial number of actions. This will further rule out poly(A, d,H)-style upper bounds, which are still possible with the current results. An even stronger lower bound would be onewith polynomial number of actions under both realizability and minimum suboptimality gap. Such lower bound willlikely settle the sample complexity of reinforcement learning with linear function approximation under the realizabilityassumption. Another open problem is whether Assumption 3 and 4 can be replaced by or understood as more naturalcharacterizations of the complexity of the MDP.

AcknowledgmentsThe authors would like to thank Kefan Dong and Dean Foster for helpful discussions. Sham M. Kakade acknowledgesfunding from the ONR award N00014-18-1-2247. RW was supported in part by the NSF IIS1763562, US ArmyW911NF1920104, and ONR Grant N000141812861.

ReferencesYasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Ad-

vances in Neural Information Processing Systems, pages 2312–2320, 2011.

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model isminimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020.

11

Jean-Yves Audibert and Sebastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conferenceon learning theory-2010, pages 13–p, 2010.

Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning withvalue-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.

Ainesh Bakshi and Adarsh Prasad. Robust linear regression: Optimal rates in polynomial time. arXiv preprintarXiv:2007.01394, 2020.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In Interna-tional Conference on Machine Learning, pages 1283–1294. PMLR, 2020.

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In InternationalConference on Machine Learning, pages 1042–1051. PMLR, 2019.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. InConference on Learning Theory, 2008.

Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficientrl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019a.

Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficientreinforcement learning? In International Conference on Learning Representations, 2019b.

Simon S Du, Yuping Luo, Ruosong Wang, and Hanrui Zhang. Provably efficient q-learning with function approxi-mation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages8060–8070, 2019c.

Simon S Du, Jason D Lee, Gaurav Mahajan, and Ruosong Wang. Agnostic q-learning with function approximationin deterministic systems: Near-optimal bounds on approximation error and sample complexity. Advances in NeuralInformation Processing Systems, 33, 2020.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles.In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.

Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targetedregression. In Learning for Dynamics and Control, pages 666–686. PMLR, 2020.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linearfunction approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.

William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporarymathematics, 26(189-206):1, 1984.

Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics,12:363–366, 1960.

Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. In ConferenceOn Learning Theory, pages 1420–1430. PMLR, 2018.

Pravesh K Kothari and Jacob Steinhardt. Better agnostic clustering via relaxed tensor norms. arXiv preprintarXiv:1711.07465, 2017.

Pravesh K Kothari and David Steurer. Outlier-robust moment-estimation via sum-of-squares. arXiv preprintarXiv:1711.11581, 2017.

12

Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Pac reinforcement learning with rich observations. InProceedings of the 30th International Conference on Neural Information Processing Systems, pages 1848–1856,2016.

Aviral Kumar, Abhishek Gupta, and Sergey Levine. Discor: Corrective feedback in reinforcement learning via distri-bution correction. arXiv preprint arXiv:2003.07305, 2020.

Tor Lattimore and Csaba Szepesvari. Bandit algorithms. Cambridge University Press, 2020.

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rlwith a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR, 2020.

Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provablyefficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.

Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning usinglinearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages2010–2020. PMLR, 2020.

Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimaldesign. arXiv preprint arXiv:2007.01980, 2020.

Roshan Shariff and Csaba Szepesvari. Efficient planning in large mdps with weak linear function approximation.arXiv preprint arXiv:2007.06184, 2020.

Max Simchowitz and Kevin Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. arXiv preprintarXiv:1905.03814, 2019.

Michael J Todd. Minimum-Volume Ellipsoids: Theory and Algorithms, volume 23. SIAM, 2016.

Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.

Benjamin Van Roy and Shi Dong. Comments on the du-kakade-wang-yang lower bounds. arXiv preprintarXiv:1911.07910, 2019.

Ruosong Wang, Simon S Du, Lin F Yang, and Ruslan Salakhutdinov. On reward-free reinforcement learning withlinear function approximation. arXiv preprint arXiv:2006.11274, 2020a.

Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear functionapproximation? arXiv preprint arXiv:2010.11895, 2020b.

Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learning withgeneralized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.

Gellert Weisz, Philip Amortila, and Csaba Szepesvari. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. arXiv preprint arXiv:2010.01374, 2020.

Zheng Wen and Benjamin Van Roy. Efficient reinforcement learning in deterministic systems with value functiongeneralization. Mathematics of Operations Research, 42(3):762–782, 2017.

Kunhe Yang, Lin F Yang, and Simon S Du. q-learning with logarithmic regret. arXiv preprint arXiv:2006.09118,2020.

Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In InternationalConference on Machine Learning, pages 6995–7004. PMLR, 2019.

13

Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. InInternational Conference on Machine Learning, pages 10746–10756. PMLR, 2020.

Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harderthan online rl. arXiv preprint arXiv:2012.08005, 2020.

Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies withlow inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.

Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020.

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linearmixture markov decision processes. arXiv preprint arXiv:2012.08507, 2020a.

Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps withfeature mapping. arXiv preprint arXiv:2006.13165, 2020b.

14

A Proof of Lemma 5Proof. We state a proof of this lemma for completeness. By Lemma 4, ∀s,

maxa∈A

φ(s, a)>Σ−1s φ(s, a) ≤ d.

It follows that ∀a ∈ A,φ(s, a)φ(s, a)> 4 dΣs.

Therefore,

Es∼ν[maxa∈A

φ(s, a)>Σ−1φ(s, a)

]= Es∼ν max

a∈ATr(φ(s, a)φ(s, a)>Σ−1

)≤ Es∼νTr

(dΣsΣ

−1)

= d2.

B Addressing Remark 1Let us redefineMa∗ as follows. The state space is again 1, · · · , m, f. The action space is [m] for every state. Wewill also use the same set of m d-dimensional vectors v1, · · · , vm. In this construction, we will reset γ := 1

6 .

Features. The feature map now maps state-action pairs to d+ 1 dimensional vectors, and is defined as follows.

φ(a1, a2) :=(

0,(⟨v(a1), v(a2)

⟩+ 2γ

)· v(a2)

), (∀a1, a2 ∈ [m], a1 6= a2)

φ(a1, a1) :=

(3

4γ, 0

), (∀a1 ∈ [m])

φ(f, 1) = (0,0) ,

φ(f, a) := (−1,0) . (∀a 6= 1)

Note that the feature map is again independent of a∗. Define θ∗ := (1, v(a∗)).

Rewards. For 1 ≤ h < H , the rewards are defined as

Rh(a1, a∗) :=

⟨v(a1), v(a∗)

⟩+ 2γ, (a1 6= a∗)

Rh(a1, a2) := −2γ[⟨v(a1), v(a2)

⟩+ 2γ

], (a1 6= a∗, a2 6= a∗, a2 6= a1)

Rh(a1, a1) :=3

4γ, (∀a1)

Rh(a∗, a2) :=[⟨v(a∗), v(a2)

⟩+ 2γ

]· 〈v(a∗), v(a2)〉, (a2 6= a∗)

Rh(f, 1) := 0,

Rh(f, a) := −1. (a 6= 1)

For h = H , rH(s, a) := 〈φ(s, a), v(a∗)〉 for every state-action pair.

15

Transitions. The initial state distribution is set as a uniform distribution over 1, · · · , m. The transition probabili-ties are set as follows.

Pr[f |a1, a∗] = 1,

Pr[f |a1, a1] = 1,

Pr[·|a1, a2] =

a2 :⟨v(a1), v(a2)

⟩+ 2γ

f : 1−⟨v(a1), v(a2)

⟩− 2γ

, (a2 6= a∗, a2 6= a1)

Pr[f |f, ·] = 1.

We now check realizability in the new MDP. Note that now we want to show Q∗h(s, a) = φ(s, a)>θ∗, where θ∗ =(1, v(a∗)). We claim that ∀h ∈ [H],

V ∗h (a1) = 〈v(a1), v(a∗)〉+ 2γ, (a1 6= a∗)

Q∗h(a1, a2) = (〈v(a1), v(a2)〉+ 2γ) · 〈v(a2), v(a∗)〉, (a2 6= a1)

Q∗h(a1, a1) =3

4γ. (∀a1)

To see this, first notice that the expression of Q∗h implies that the optimal action is a∗ for any non-terminal state.Suppose a1 6= a∗, then for a2 6= a1, a

∗, Q∗h(a1, a2) ≤ 3γ2 < γ ≤ Q∗h(a1, a∗). Moreover,

Q∗h(a1, a1) =3

4γ < γ ≤ Q∗h(a1, a

∗).

Thus, a∗ is indeed the optimal action for a1 if a1 6= a∗.For a∗, a1 6= a∗, Q∗h(a∗, a1) ≤ 3γ2 < 3

4γ = Q∗h(a∗, a∗). What remains is to compute V ∗h−1 via Bellman’sequations, which is the same as in the analysis in Section 4.

Also, it is easy to see that Q∗h(f, 1) = 0, and that ∀a 6= 1, Q∗h(f, a) = −1.

Suboptimality Gap. InMa∗ , ∀a1 6= a∗, ∀a2 6= a∗, Q∗h(a1, a2) ≤ max3γ2, 34γ. Thus

∆h(a1, a2) ≥ γ −max3γ2,3

4γ =

1

24.

For a∗, V ∗h (a∗) = 1− γ, while for a1 6= a∗,

Q∗h(a∗, a1) = (〈v(a∗), v(a1) + 2γ) · 〈v(a∗), v(a1)〉 ≤ 3γ2.

Thus ∆∗h(a∗, a1) ≥ 34γ − 3γ2 = 1

24 . As for the terminal state f , the suboptimality gap is obviously 1. Therefore∆min ≥ 1

24 in this new MDP.

Information theoretic arguments. The modifications here do not affect the proof of Theorem 1. Suppose action a2

is taken at state a1. If a1 6= a2, then the behavior (transitions and rewards) would be identical to the original MDP. Ifa1 = a2 6= a∗, neither the transition and the rewards depend on a∗. Hence, we can still construct a reference MDPas in the proof of Theorem 1, such that information on a∗ can only be gained by: (1) either taking a∗; (2) or reachingsH 6= f .

C Proof of Theorem 1Proof. We consider K episodes of interaction between the algorithm and the MDP Ma. Since each trajectory is asequence of H states, we define the total number of samples as KH . Denote the state, the action and the reward atepisode k and timestep h by skh, akh and rkh respectively.

16

Consider the following reference MDP denoted byM0. The state space, action space, and features of this MDPare the same as those of the MDP family. The transitions are defined as follows:

Pr[·|a1, a2] =

a2 :⟨v(a1), v(a2)

⟩+ 2γ

f : 1−⟨v(a1), v(a2)

⟩− 2γ

, (∀a1, a2 s.t. a1 6= a2)

Pr[f |f, ·] = 1.

The rewards are defined as follows:

Rh(a1, a2) := −2γ[⟨v(a1), v(a2)

⟩+ 2γ

], ( ∀a1, a2 s.t. a1 6= a2)

Rh(f, ·) := 0.

Intuitively, this MDP is very similar to the MDP family, except that the optimal action a∗ is removed. More specifically,M0 is identical toMa except when the action a is taken at a non-terminal state, or when an episode ends at a non-terminal state.

More specifically, we claim that for t < H , ∀st, at such that at 6= a,

PrMa[st+1|st, at] = PrM0

[st+1|st, at],

and that for t < H , ∀st, at such that at 6= a,

rMat (st, at) = rM0

t (st, at).

Also, rMa

H (st, at) = rM0

H (st, at) if st = f . It follows that

PrMa

[s1

1, a11, r

11, · · · skh, akh, rkh

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f]

=PrM0

[s1

1, a11, r

11, · · · skh, akh, rkh

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f].

Here Akh is a shorthand fora1

1, a12, · · · , a1

H , · · · , akh

, i.e. all actions taken up to timestep h for episode k. Bymarginalizing the states and the actions, we get

PrMa

[akh

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f]

= PrM0

[akh

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f].

It then follows that

PrMa

[akh = a

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f]

= PrM0

[akh = a

∣∣∣a /∈ Akh,∀k′ ≤ k, sk′H = f].

Next, we prove via induction that

PrMa

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]

= PrM0

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]. (4)

Suppose that (4) holds up to (k, h− 1). Then

PrMa

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]

=PrMa

[a /∈ Akh−1

]PrMa

[akh = a

∣∣∣a /∈ Akh−1,∀k′ ≤ k, sk′

H = f]

+ PrMa

[a ∈ Akh−1

∣∣∣∀k′ ≤ k, sk′H = f]

=PrM0

[a /∈ Akh−1

]PrM0

[akh = a

∣∣∣a /∈ Akh−1,∀k′ ≤ k, sk′

H = f]

+ PrM0

[a ∈ Akh−1

∣∣∣∀k′ ≤ k, sk′H = f]

=PrM0

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f].

17

That is, (4) holds for h, k as well. By induction, (4) holds for all h, k. Thus,

PrMa

[a ∈ Akh

]≤ PrMa

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]

+ Pr[∃k′ ≤ k, sk

H 6= f]

≤ PrM0

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]

+ k ·(

3

4

)H.

Since |Akh| ≤ kH ,∑a∈[m] PrM0

[a ∈ Akh

∣∣∣∀k′ ≤ k, sk′H = f]≤ kH . It follows that there exists a∗ ∈ [m] such that

PrM0

[a∗ ∈ AKH

∣∣∣∀k′ ≤ K, sk′H = f]≤ KH

m= KH · e−Θ(d).

As a result

PrMa∗

[a∗ ∈ AKH

]≤ KH · e−Θ(d) +K

(3

4

)H.

In other words, unless KH = 2Ω(mind,H), the probability of taking the optimal action a∗ in the interaction withMa∗ is o(1).

From the suboptimality gap condition, it follows that if Es1∼µV π(s1) ≥ Es1∼µV ∗(s1)−0.05, Pr[a1 6= a∗ ∧ s1 6= a∗

∆min ≤ 0.05. Hence

Pr [a1 = a∗] ≥ 1−(

0.8 +1

m

)= 0.2− 1

m.

Therefore, if the algorithm is able to output such a policy with probability 0.1, it is able to take the action a∗ in thenext episode with Θ(1) probability by executing π. However, as proved above, this is impossible unless KH =2Ω(mind,H).

D Ensuring Minimum ReachabilityBased on the MDP with auxiliary feature dimension in the previous section, we will modify Ma∗ in the followingmanner.

1. Remove the state a∗.

2. Create a deterministic initial state s; at state s, taking action a 6= a∗ leads to state a; taking a special action 0leads to the terminal state f ; taking action a∗ leads to state s again.

3. Set φ(s, 0) = (−1,0); φ(s, a) = (−1, v(a)).

4. R(s, a) = −1− 2γ for a ∈ [m] 6= a∗. R(s, a∗) = 0. R(s, 0) = −1.

It can be seen that now, ηmin = 1 because to reach state a at level h, one can simply take the action sequencea∗, · · · , a∗, a. Realizability at the initial state s can also be verified (with θ∗ = (1, v(a∗)) as before).

E Proof of Theorem 2Proof under Assumption 3. Let us set β = 8, λridge = ε2, λr = ε6, B = 2d log( d

λr), ε1 = ε2, ε2 = λr

2B , N =d·log(1/ε2)

ε22. Recall that ε ≤ poly(∆min, 1/Cvar, 1/d, 1/H). First, by Lemma 8, the event Ω holds with probability

1− ε; we will condition on this event in the following proof. By lemma 10, when the algorithm terminates, |Πh| ≤ Bfor all h ∈ [H]. Note that the this implies that Algorithm 1 is called or restarted at most H · (1 + B) times. In eachcall or restart of Algorithm 1, at most NB +N trajectories are sampled. Therefore, when the algorithm terminates, atmost

H(1 +B) · (NB +N) ≤ poly (1/ε)

18

trajectories are sampled.It remains to show that the greedy policy with respect to θ1, · · · , θH is indeed ε-optimal with high probability. To

that end, let us state the following claims about the algorithm.

1. Each time Line 9 is reached in Algorithm 1, ∀π ∈ Πh, define πh as in (3), ∀h′ > h,

Esh′∼D

πhh′

[supa∈A

∣∣φ(sh′ , a)>(θh − θ∗h)∣∣2] ≤ ∆2

minε

4H. (5)

2. Each time when θh is updated at Line 17, ∀π ∈ Πh, define the associated covariance matrix at step h asΣπh = Esh∼Dπh ,ah∼ρsh

[φ(sh, ah)φ(sh, ah)>

]. Then ‖θh − θ∗h‖2Σπh ≤ 6BCvarε

2. It follows that

Esh∼Dπh

[supa∈A

∣∣φ(sh, a)>(θh − θ∗h)∣∣2] ≤ ∆2

minε

4H. (6)

Note that by the first claim with h = 0, it follows that for the greedy policy π (π0 is always the greedy policy) w.r.t.θhh∈[H], ∀h ∈ [H],

Esh∼Dπh

[supa∈A

∣∣φ(sh, a)>(θh − θ∗h)∣∣2] ≤ ∆2

minε

4H.

Consequently by Markov’s inequality,

Prsh∼Dπh

[∃a ∈ A :

∣∣φ(sh, a)>(θh − θ∗h)∣∣ > ∆min

2

]≤ ε

H.

By Assumption 2 and the fact that π takes the greedy action w.r.t. θh, this implies that

Prsh∼Dπh

[πh(sh) 6= π∗h(sh)] ≤ ε

H.

Thus for a random trajectory induced by π, with probability at least 1 − ε, πh(sh) = π∗h(sh) for all h = 1, · · · , H ,which proves the theorem.

It remains to prove the two claims.

Proof of (6). We first prove the second claim based on the assumption that the first claim holds when Line 9 isreached in the same execution of LearnLevel. By the first claim and the same arguments above, ∀π ∈ Πh, constructπh as in (3), then Pr

sh′∼Dπhh′

[πh(sh′) 6= π∗(sh′)] ≤ ε/H . Thus,

Esh+1∼D

πhh+1

[V πhh+1(sh+1)

]≥ E

sh+1∼Dπhh+1

[V ∗h+1(sh+1)

]− ε.

By Assumption 3, this suggests that

Esh+1∼D

πhh+1

[(V πhh+1(sh+1)− V ∗h+1(sh+1)

)2]≤ Cvarε

2.

When (sh, ah, y) is sampled,

E [y|sh, ah] = E[R(sh, ah) + V πhh+1(sh+1)|sh, ah

]= Q∗(sh, ah) + E

[V πhh+1(sh+1)− V ∗h+1(sh+1)|sh, ah

],

where the expectation is over trajectories induced by πh. In other words, yi :=∑h′≥h r

ih can be written as φ(sih, a

ih)>θ∗h+

bi + ξi, where ξi is mean-zero independent noise with |ξi| ≤ 2 almost surely and bi :=∑h′>h r

ih′ − V ∗h+1(sih+1)

satisfies E[b2i ] ≤ Cvarε2. Note that θh is the ridge regression estimator for this linear model. By Lemma 7,

Eπ∼Unif(Πh),sh∼Dπh ,ah∼ρsh

[∣∣φ(sh, ah)>(θh − θ∗h)∣∣2] ≤ 4(Cvarε

2 + ε1 + λridge) ≤ 6Cvarε2.

19

It follows that ∀π ∈ Πh,

Esh∼Dπh ,ah∼ρsh[∣∣φ(sh, ah)>(θh − θ∗h)

∣∣2] ≤ |Πh| · 6Cvarε2 ≤ 6BCvarε

2.

Now, by Lemma 5,

Esh∼Dπh

[supa∈A

∣∣φ(sh, a)>(θh − θ∗h)∣∣2]

≤Esh∼Dπh

[supa∈A‖φ(sh, a)‖2(Σπh)−1

]· ‖φh − φ∗h‖2Σπh

≤d2 · 6BCvarε2 ≤ ∆2

minε

4H.

This proves the second claim.

Proof of (5). Now, let us prove the first claim, assuming that the second claim holds for the last update of any θh.By observing Algorithm 1, if Line 9 is reached, during the last execution of the first for loop (i.e. Lines 1 to 8), theif clause at Line 5 must have returned False every time (otherwise the algorithm will restart). It follows that duringthe last execution of Lines 1 to 8, neither θhh∈[H] nor Πhh∈[H] is updated.

Consider the if clause when checking π ∈ Πh for layer h′. Recall that

Σπhh′ = Esh′∼D

πhh ,ah′∼ρsh′

[φ(sh′ , ah′)φ(sh′ , ah′)

>] .Also define Σ∗h′ := λr

|Πh′ |I + Eπ∼Unif(Πh′ )Σ

πh′ . Then by Lemma 9,

‖ (Σ∗h′)− 1

2 Σπhh′ (Σ∗h′)− 1

2 ‖2 ≤ 3β|Πh′ |.

It follows that

‖θh − θ∗h‖2Σπhh′

= (θh − θ∗h)>Σπhh′ (θh − θ∗h)

=(

(Σ∗h′)12 (θh − θ∗h)

)> ((Σ∗h′)

− 12 Σπhh′ (Σ∗h′)

− 12

)((Σ∗h′)

12 (θh − θ∗h)

)≤ ‖θh − θ∗h‖2Σ∗

h′· ‖ (Σ∗h′)

− 12 Σπhh′ (Σ∗h′)

− 12 ‖2

≤ 3βB ·

(λr ·

(2

λridge

)2

+ 6BCvarε2

)≤ 24B2 · 10Cvarε

2.

By Lemma 5,

Esh′∼D

πhh

[supa∈A‖φ(sh′ , a)‖2

(Σπhh′ )−1

]≤ d2.

As a result,

Esh′∼D

πhh′

[supa∈A

∣∣φ(sh′ , a)>(θh − θ∗h)∣∣2] ≤ E

sh′∼Dπhh

[‖θh − θ∗h‖2Σπh

h′· supa∈A‖φ(sh′ , a)‖2

(Σπhh′ )−1

]≤ 240B2Cvarε

2 · d2 ≤ ε∆2min

4H.

This proves the first claim. The failure probability of the algorithm is controlled by Lemma 8.

20

Proof under Assumption 4. The proof under Assumption 4 is quite similar, except that we will use Lemma 14 insteadof Lemma 7 for the analysis of ridge regression.

Let us set β = 8, ε0 = ε2, λridge = ε3, λr = ε9, B = 2d log( dλr

), ε1 = ε3, ε2 = λr2B , N = d

ε32. Recall that

ε ≤ poly(∆min, 1/Chyper, 1/d, 1/H). We will state similar claims about the algorithm.

1. Each time Line 9 is reached in Algorithm 1, ∀π ∈ Πh, define πh as in (3), ∀h′ > h,

Esh′∼D

πhh′

[supa∈A

∣∣φ(sh′ , a)>(θh − θ∗h)∣∣2] ≤ ∆2

minε04H

. (7)

2. Each time when θh is updated at Line 17, ∀π ∈ Πh, define the associated covariance matrix at step h asΣπh = Esh∼Dπh ,ah∼ρsh

[φ(sh, ah)φ(sh, ah)>

]. Then ‖θh − θ∗h‖2Σπh ≤

∆2minε0

120HBd2 . It follows that

Esh∼Dπh

[supa∈A

∣∣φ(sh, a)>(θh − θ∗h)∣∣2] ≤ ∆2

minε04H

. (8)

We now prove the two claims in similar fashion.

Proof of (6). We first prove the second claim based on the assumption that the first claim holds when Line 9 isreached in the same execution of LearnLevel. By the first claim, ∀π ∈ Πh, construct πh as in (3), then

Prsh′∼D

πhh′

[πh(sh′) 6= π∗(sh′)] ≤ ε0/H. (9)

When (sh, ah, y) is sampled,

E [y|sh, ah] = E[R(sh, ah) + V πhh+1(sh+1)|sh, ah

]= Q∗(sh, ah) + E

[V πhh+1(sh+1)− V ∗h+1(sh+1)|sh, ah

],

where the expectation is over trajectories induced by πh. In other words, yi :=∑h′≥h r

ih can be written as φ(sih, a

ih)>θ∗h+

bi + ξi, where ξi is mean-zero independent noise with |ξi| ≤ 2 almost surely, and bi is defined as

bi := −∑h′>h

(V ∗(sih′)−Q∗(sih′ , aih′)

).

Here E[ξi] = 0 because

E[ξi] = E

∑h′≥h

rih′

−Q∗h(sih, aih)− E[bi] = Qπh(sih, a

ih)−Q∗(sih, aih) +

(Q∗(sih, a

ih)−Qπh(sih, a

ih))

= 0.

By (9), Pr[bi 6= 0] ≤ ε0. Thus by Lemma 14,

Eπ∼Unif(Πh),sh∼Dπh ,ah∼ρsh

[∣∣φ(sh, ah)>(θh − θ∗h)∣∣2] ≤ 8 (ε1 + λridge) + 288ε1.50 C2.5

hyperd4.5

(2B

ε

)0.5

≤ 16ε3 + 288ε2.5C2.5hyperd

4.5(2B)0.5.

It follows that ∀π ∈ Πh,

Esh∼Dπh ,ah∼ρsh[∣∣φ(sh, ah)>(θh − θ∗h)

∣∣2] ≤ |Πh| ·(16ε2 + 288ε2.5C2.5

hyperd4.5(2B)0.5

)≤ ∆2

minε0120HBd2

,

where we used the fact ε ≤ poly(∆min, 1/Chyper, 1/d, 1/H). Now, by Lemma 5,

Esh∼Dπh

[supa∈A

∣∣φ(sh, a)>(θh − θ∗h)∣∣2] ≤Esh∼Dπh [sup

a∈A‖φ(sh, a)‖2(Σπh)−1

]· ‖φh − φ∗h‖2Σπh

≤d2 · ∆2minε0

120HBd2≤ ∆2

minε04H

.

This proves the second claim.

21

Proof of (7). Now, let us prove the first claim, assuming that the second claim holds for the last update of any θh.Consider Line 9 when checking for π ∈ Πh for layer h′. Recall that

Σπhh′ = Esh′∼D

πhh ,ah′∼ρsh′

[φ(sh′ , ah′)φ(sh′ , ah′)

>] .Similar to the proof under Assumption 3, we can bound ‖θh − θ∗h‖Σπh

h′by

‖θh − θ∗h‖2Σπhh′≤ ‖θh − θ∗h‖2Σ∗

h′· ‖ (Σ∗h′)

− 12 Σπhh′ (Σ∗h′)

− 12 ‖2

≤ 3βB ·

(λr ·

(2

λridge

)2

+∆2

minε0120HBd2

)

≤ 96Bε3 +∆minε05Hd2

.

By Lemma 5, Esh′∼D

πhh

[supa∈A ‖φ(sh′ , a)‖2

(Σπhh′ )−1

]≤ d2. Consequently

Esh′∼D

πhh′

[supa∈A

∣∣φ(sh′ , a)>(θh − θ∗h)∣∣2] ≤ E

sh′∼Dπhh

[‖θh − θ∗h‖2Σπh

h′· supa∈A‖φ(sh′ , a)‖2

(Σπhh′ )−1

]≤ 96Bε3d2 +

∆minε05H

≤ ∆minε04H

.

In the last inequality we used ε0 = ε2 and ε ≤ poly(∆min, 1/d, 1/H). This proves (7). Finally the failure probabilityis controlled in Lemma 8.

Lemma 6 (Covariance concentration [Tropp, 2015]). Suppose M1, · · · ,MN ∈ Rd×d are i.i.d. random matri-ces drawn from a distribution D over positive semi-definite matrices. If ‖Mt‖F ≤ 1 almost surely and N =

Ω(d log(d/δ)

ε2

), then with probability 1− δ,∥∥∥∥∥ 1

N

N∑i=1

Mt − EM∼D[M ]

∥∥∥∥∥2

≤ ε.

Lemma 7 (Risk bound for ridge regression, Lemma A.2 Du et al. [2019c]). Suppose that (x1, y1), · · · , (xN , yN ) arei.i.d. data drawn from D with

yi = θ>xi + bi + ξi,

where E(xi,yi)∼D[b2i ] ≤ η, |ξi| ≤ 2n almost surely and E[ξi] = 0. Let the ridge regression estimator be

θ =

(N∑i=1

xix>i +Nλridge · I

)−1

·N∑i=1

xiyi.

If N = Ω(dε2N

log(dδ ))

, then with probability at least 1− δ,

Ex∼D[(

(θ − θ)>x)2]≤ 4 (η + εN + λridge) .

Lemma 8 (Failure probability). Define the following events regarding the execution of Algorithm 1.

1. Ω1: Each time Σh is updated,∥∥∥Σh − Eπ∼Unif(Πh),sh∼Dπh ,ah∼ρsh

[φ(sh, ah)φ(sh, ah)>

]∥∥∥2≤ ε2. (10)

22

2. Ω2: Each time θh is updated,

Eπ∼Unif(Πh),s∼Dπh ,a∼ρs

[((θh − θ∗h)>φ(s, a)

)2] ≤ 4 (η + ε1 + λridge) , (11)

where η is defined as in Lemma 7.

3. Ω3: Each time θh is updated,

Eπ∼Unif(Πh),s∼Dπh ,a∼ρs

[((θh − θ∗h)>φ(s, a)

)2] ≤ 288η1.5C2.5d4.5

(2B

ε

)0.5

, (12)

where η and C are defined as in Lemma 14.

Then under Assumption 3, Pr[Ω1 ∩ Ω2] ≥ 1− ε. Alternatively, under Assumption 4, Pr[Ω1 ∩ Ω3] ≥ 1− ε.

Proof. Note that N ≥ d log(1/ε2)ε22

where ε2 ≤ ε6

d . Therefore, by Lemma 6, each time Σh is updated, (10) holds withprobability at least 1− ε2.

As for (11), note that N ≥ d log(1/ε2)ε22

dε21· log( dε2 ). Thus by Lemma 7, each time θh is updated, (11) holds with

probability at least 1− ε2.Similarly, for (12), under the choice of parameters under Assumption 4, N ≥ d

ε32(dε22

+ 1η

)ln 2dB

ε + 2Bε . Thus

by Lemma 14, the probability that (12) is violated each step is at most ε/2B.Note that when the algorithm terminates, the Σh and θh are updated at most |Πh| times. Also note that, if during

the first B updates, neither (10) nor (11) are violated, by Lemma 10 it follows that |Πh| ≤ B when the algorithmterminates. In other words,

Pr[Ω1 ∪ Ω2] ≥ 1−B · 2ε2 ≥ 1− ε.

Similarly, under Assumption 4,

Pr[Ω1 ∪ Ω3] ≥ 1−B · ε2 −B · ε

2B≥ 1− ε.

Lemma 9 (Distribution shift error checking). Assume that ε2 < min 12βλr,

λr2B . Consider the if clause when

checking for πh ∈ Πh, i.e. when computing ‖Σ−12

h′ Σh′Σ− 1

2

h′ ‖2. Define

M1 :=λr

N |Πh′ |I + Eπ∼Unif(Πh′ ),sh′∼Dπh′ ,ah′∼ρsh′

[φ(sh′ , ah′)φ(sh′ , ah′)

>] ,and

M2 := Esh′∼D

πhh′ ,ah′∼ρsh′

[φ(sh′ , ah′)φ(sh′ , ah′)

>] .Then under the event Ω defined in Lemma 8, when ‖Σ−

12

h′ Σh′Σ− 1

2

h′ ‖2 ≤ β|Πh′ |,

‖M−1/21 M2M

−1/21 ‖2 ≤ 3β|Πh′ |.

When ‖Σ−12

h′ Σh′Σ− 1

2

h′ ‖2 ≥ β|Πh′ |,

‖M−1/21 M2M

−1/21 ‖2 ≥

1

4β|Πh′ |.

Proof. By Lemma 6,

‖M1 − Σh′‖2 ≤ ε2 ≤λr2B≤ 1

2λmin (Σh′) .

23

Thus 12Σh′ 4M1 4 2Σh′ . Also by Lemma 6, ‖M2 − Σh′‖2 ≤ ε2. Therefore, if ‖Σ−

12

h′ Σh′Σ− 1

2

h′ ‖2 ≥ β|Πh′ |,

‖M−1/21 M2M

−1/21 ‖2 ≥

1

2‖Σ−1/2

h′ M2Σ−1/2h′ ‖2 ≥

1

2‖Σ−1/2

h′ Σh′Σ−1/2h′ ‖2 −

1

2ε2‖Σ−1

h′ ‖2

≥ 1

2β|Πh′ | −

1

2ε2 ·|Πh′ |λr≥ 1

4β|Πh′ |.

Similarly, when ‖Σ−12

h′ Σh′Σ− 1

2

h′ ‖2 ≤ β|Πh′ |,

‖M−1/21 M2M

−1/21 ‖2 ≤ 2‖Σ−1/2

h′ M2Σ−1/2h′ ‖2 ≤ 2‖Σ−1/2

h′ Σh′Σ−1/2h′ ‖2 + 2ε2‖Σ−1

h′ ‖2

≤ 2β|Πh′ |+ 2ε2 ·|Πh′ |λr≤ 3β|Πh′ |.

Lemma 10 (Lemma A.6 in Du et al. [2019c]). Under the event Ω1 defined in Lemma 8 |Πh| ≤ B for all h ∈ [H].

F Analysis of Ridge Regression under HypercontractivityRecall that a distribution D is (C, 4)-hypercontractive if ∀v,

Ex∼D[(x>v)4] ≤ C ·(Ex∼D[(x>v)2]

)2.

In this section we prove an strengthened version of Lemma 7 for hypercontractive distributions (Lemma 14), whichmay be of independent interest.

Lemma 11. Let x be a d-dimensional r.v. If the distribution of x is (C, 4)-hypercontractive and isotropic (i.e.E[xx>] = I), then

Pr[‖x‖2 > t] ≤ Cd2

t4.

Proof. Consider a Gaussian random vector v ∼ N(0, I). Then

Ev[(x>v)4] = ‖x‖4 · Eξ∼N(0,1)ξ4 = 3‖x‖4.

Therefore

Ex[‖x‖4] =1

3Ex,v[(x>v)4] ≤ C

3Ev(Ex(x>v)2

)2≤ C

3Ev‖v‖4 =

C · (d2 + 2d)

3≤ d2C.

The claim then follows from Markov’s inequality.

Lemma 12. If the x1, · · · , xn are i.i.d. samples from a (C, 4)-hypercontractive distribution. Let σ(·) denote thedecreasing order of ‖xi‖2. Then with probability 1− δ,

m∑k=1

‖xσ(k)‖2 = 3δ−1/4n1/4m3/4C1/4d1/2.

Proof. Fix k ∈ [m]. Set t = α(Cd2nk

)1/4

. By Lemma 11,

Pr[‖xσ(k)‖2 > t] ≤(n

k

)Pr[‖x‖ > t]k ≤

(n

k

)·(Cd2

t4

)k≤ nk

k!· kk

α4knk≤( e

α4

)k.

24

Choosing α =(

2eδ

)1/4gives Pr[‖xσ(k)‖2 > t] ≤ (δ/2)k. By a union bound, with probability 1− δ,

m∑i=1

‖xσ(i)‖2 ≤m∑k=1

(2e/δ)1/4

(Cd2n

k

)1/4

≤ 3δ−1/4n1/4m3/4C1/4d1/2.

Lemma 13 (Lemma 3.4 Bakshi and Prasad [2020]). IfD is (C, 4)-hypercontractive and x1, · · · , xn are i.i.d. samplesdrawn from D. Let Σ := Ex∼D[xx>]. With probability 1− δ,(

1− Cd2

√nδ

)Σ 4

1

n

n∑i=1

xix>i 4

(1 +

Cd2

√nδ

)Σ.

Lemma 14 (Risk bound for ridge regression with hypercontractivity). Suppose that (x1, y1), · · · , (xN , yN ) are i.i.d.data drawn from D with

yi = θ>xi + bi + ξi,

where Pr[bi 6= 0] ≤ η, ‖b‖∞ ≤ 1, |ξi| ≤ 1, and E[ξi] = 0. Assume that distribution of x is (C, 4)-hypercontractive(see Assumption 4). Let the ridge regression estimator be

θ =

(N∑i=1

xix>i +Nλridge · I

)−1

·N∑i=1

xiyi.

If N = Ω(

( dε2N

+ 1η ) log(dδ ) + 1

δ

), then with probability at least 1− δ,

Ex∼D[(

(θ − θ)>x)2]≤ 8 (εN + λridge) + 288η1.5C2.5d4.5δ−0.5.

Proof. Define Σ := 1N

∑Ni=1 xix

>i and Σ := Ex∼D[xx>]. Then

θ =1

N

(λridgeI + Σ

)−1 N∑i=1

(xix>i θ + xi · ξi + xi · bi

)=

1

N

(λridgeI + Σ

)−1 N∑i=1

bixi︸ ︷︷ ︸(a)

+1

N

(λridgeI + Σ

)−1 N∑i=1

(xix>i θ + xi · ξi

)︸ ︷︷ ︸

(b)

.

By Lemma 7, ‖θ − (b)‖2Σ ≤ 4(εN + λridge). It remains to bound the ‖ · ‖Σ norm of (a).First, by Hoeffding’s inequality, with probability 1 − δ, ‖b‖0 =

∑ni=1 I[bi 6= 0] ≤ 2ηN . Define zi := Σ−1/2xi

to be the normalized input. It can be seen that E[ziz>i ] = I and that the distribution of zi is also hypercontractive. By

Lemma 12, with probability 1− 2δ,

n∑i=1

‖zi‖2 · I[bi 6= 0] ≤ 3δ−1/4N1/4(2ηN)3/4(Cd2)1/4.

25

It follows that with probability 1− 2δ,

‖(a)‖Σ =1

N

∥∥∥∥∥Σ−1N∑i=1

xibi

∥∥∥∥∥Σ

≤ 1

N

N∑i=1

‖Σ1/2Σ−1xibi‖2

=1

N

N∑i=1

‖Σ1/2Σ−1Σ1/2zibi‖2

≤ 1

N‖Σ1/2Σ−1Σ1/2‖2 ·

N∑i=1

‖zi‖2 ·H · I[bi 6= 0]

≤ 3H

(1 +

Cd2

√Nδ

)· δ−1/4N−3/4(2ηN)3/4(Cd2)1/4

≤ 12Hη0.75 · C 54 d

94 δ−

14 .

Therefore

‖θ − θ‖2Σ ≤ 2‖θ − (b)‖2Σ + 2‖(a)‖2Σ≤ 8(εN + λridge) + 288η1.5C2.5d4.5δ−0.5.

26