breaking the sample size barrier in model-based reinforcement...

78
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning Yuting Wei Carnegie Mellon University Nov, 2020

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Breaking the Sample Size Barrier inModel-Based Reinforcement Learning

Yuting Wei

Carnegie Mellon University

Nov, 2020

Page 2: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Gen Li

Tsinghua EE

Yuejie Chi

CMU ECE

Yuantao Gu

Tsinghua EE

Yuxin Chen

Princeton EE

Page 3: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Reinforcement learning (RL)

3 / 34

Page 4: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

RL challenges

• Unknown or changing environment

• Credit assignment problem

• Enormous state and action space

4 / 34

Page 5: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Provable e�ciency

• Collecting samples might be expensive or impossible:

sample e�ciency

• Training deep RL algorithms might take long time:

computational e�ciency

5 / 34

Page 6: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

This talk

Question: can we design sample- and computation-e�cientRL algorithms?

—– inspired by numerous prior work

[Kearns and Singh, 1999, Sidford et al., 2018a, Agarwal et al., 2019]...

6 / 34

Page 7: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Background: Markov decision processes

7 / 34

Page 8: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Markov decision process (MDP)

• S: state space

• A: action space

• r(s, a) 2 [0, 1]: immediate reward

• ⇡(·|s): policy (or action selection rule)

• P(·|s, a): unknown transition probabilities

8 / 34

Page 9: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Markov decision process (MDP)

• S: state space

• A: action space

• r(s, a) 2 [0, 1]: immediate reward

• ⇡(·|s): policy (or action selection rule)

• P(·|s, a): unknown transition probabilities

8 / 34

Page 10: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Markov decision process (MDP)

• S: state space

• A: action space

• r(s, a) 2 [0, 1]: immediate reward

• ⇡(·|s): policy (or action selection rule)

• P(·|s, a): unknown transition probabilities

8 / 34

Page 11: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Markov decision process (MDP)

• S: state space

• A: action space

• r(s, a) 2 [0, 1]: immediate reward

• ⇡(·|s): policy (or action selection rule)

• P(·|s, a): unknown transition probabilities

8 / 34

Page 12: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Help the mouse!

• state space S: positions in the maze

• action space A: up, down, left, right

• immediate reward r : cheese, electricity shocks, cats

• policy ⇡(·|s): the way to find cheese

9 / 34

Page 13: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Help the mouse!

• state space S: positions in the maze

• action space A: up, down, left, right

• immediate reward r : cheese, electricity shocks, cats

• policy ⇡(·|s): the way to find cheese

9 / 34

Page 14: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Help the mouse!

• state space S: positions in the maze

• action space A: up, down, left, right

• immediate reward r : cheese, electricity shocks, cats

• policy ⇡(·|s): the way to find cheese

9 / 34

Page 15: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Help the mouse!

• state space S: positions in the maze

• action space A: up, down, left, right

• immediate reward r : cheese, electricity shocks, cats

• policy ⇡(·|s): the way to find cheese

9 / 34

Page 16: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Help the mouse!

• state space S: positions in the maze

• action space A: up, down, left, right

• immediate reward r : cheese, electricity shocks, cats

• policy ⇡(·|s): the way to find cheese

9 / 34

Page 17: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Value function

Value function of policy ⇡: long-term discounted reward

8s 2 S : V⇡(s) := E

" 1X

t=0

�trt�� s0 = s

#

• � 2 [0, 1): discount factor

• (a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

10 / 34

Page 18: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Value function

Value function of policy ⇡: long-term discounted reward

8s 2 S : V⇡(s) := E

" 1X

t=0

�trt�� s0 = s

#

• � 2 [0, 1): discount factor

• (a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

10 / 34

Page 19: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Action-value function (a.k.a. Q-function)

Q-function of policy ⇡

8(s, a) 2 S ⇥A : Q⇡(s, a) := E

" 1X

t=0

�trt�� s0 = s, a0 = a

#

• (��a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

11 / 34

Page 20: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Action-value function (a.k.a. Q-function)

Q-function of policy ⇡

8(s, a) 2 S ⇥A : Q⇡(s, a) := E

" 1X

t=0

�trt�� s0 = s, a0 = a

#

• (��a0, s1, a1, s2, a2, · · · ): generated under policy ⇡

11 / 34

Page 21: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Optimal policy

• optimal policy ⇡?: maximizing value function

• optimal value /Q function: V?:= V

⇡?; Q

?:= Q

⇡?

12 / 34

Page 22: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Optimal policy

• optimal policy ⇡?: maximizing value function

• optimal value /Q function: V?:= V

⇡?; Q

?:= Q

⇡?

12 / 34

Page 23: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Optimal policy

• optimal policy ⇡?: maximizing value function

• optimal value /Q function: V?:= V

⇡?; Q

?:= Q

⇡?

12 / 34

Page 24: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Practically, learn the optimal policy from data samples . . .

Page 25: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

This talk: sampling from a generative model

For each state-action pair (s, a), collect N samples {(s, a, s 0(i))}1iN

How many samples are su�cient to learn an "-optimal policy?

14 / 34

Page 26: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

This talk: sampling from a generative model

For each state-action pair (s, a), collect N samples {(s, a, s 0(i))}1iN

How many samples are su�cient to learn an "-optimal policy?

14 / 34

Page 27: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

This talk: sampling from a generative model

For each state-action pair (s, a), collect N samples {(s, a, s 0(i))}1iN

How many samples are su�cient to learn an "-optimal policy?

14 / 34

Page 28: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

An incomplete list of prior art• [Kearns and Singh, 1999]

• [Kakade, 2003]

• [Kearns et al., 2002]

• [Azar et al., 2012]

• [Azar et al., 2013]

• [Sidford et al., 2018a]

• [Sidford et al., 2018b]

• [Wang, 2019]

• [Agarwal et al., 2019]

• [Wainwright, 2019a]

• [Wainwright, 2019b]

• [Pananjady and Wainwright, 2019]

• [Yang and Wang, 2019]

• [Khamaru et al., 2020]

• [Mou et al., 2020]

• . . .

15 / 34

Page 29: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

An even shorter list of prior art

algorithm sample size range sample complexity "-range

Empirical QVI ⇥ |S|2|A|(1��)2

,1)|S||A|

(1��)3"2(0, 1p

(1��)|S|]

[Azar et al., 2013]

Sublinear randomized VI ⇥ |S||A|(1��)2

,1) |S||A|(1��)4"2

�0, 1

1��

⇤[Sidford et al., 2018b]

Variance-reduced QVI ⇥ |S||A|(1��)3

,1) |S||A|(1��)3"2

(0, 1][Sidford et al., 2018a]

Randomized primal-dual ⇥ |S||A|(1��)2

,1) |S||A|(1��)4"2

(0, 11�� ][Wang, 2019]

Empirical MDP + planning ⇥ |S||A|(1��)2

,1) |S||A|(1��)3"2

(0, 1p1��

][Agarwal et al., 2019]

important parameters =)• # states |S|, # actions |A|• the discounted complexity

11��

• approximation error " 2 (0, 11�� ]

16 / 34

Page 30: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

All prior theory requires sample size >|S||A|(1� �)2| {z }

sample size barrier

17 / 34

Page 31: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

All prior theory requires sample size >|S||A|(1� �)2| {z }

sample size barrier

17 / 34

Page 32: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

All prior theory requires sample size >|S||A|(1� �)2| {z }

sample size barrier

17 / 34

Page 33: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

This talk: break the sample complexity barrier

18 / 34

Page 34: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Two approaches

Model-based approach (“plug-in”)

1. build empirical estimate bP for P

2. planning based on empirical bP

Model-free approach— learning w/o constructing a model explicitly

19 / 34

Page 35: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Two approaches

Model-based approach (“plug-in”)

1. build empirical estimate bP for P

2. planning based on empirical bP

Model-free approach— learning w/o constructing a model explicitly

19 / 34

Page 36: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Two approaches

Model-based approach (“plug-in”)

1. build empirical estimate bP for P

2. planning based on empirical bP

Model-free approach— learning w/o constructing a model explicitly

19 / 34

Page 37: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s 0(i))}1iN

Empirical estimates: estimate bP(s 0|s, a) by 1

N

NX

i=1

1{s 0(i) = s0}

| {z }empirical frequency

20 / 34

Page 38: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Model estimation

Sampling: for each (s, a), collect N ind. samples {(s, a, s 0(i))}1iN

Empirical estimates: estimate bP(s 0|s, a) by 1

N

NX

i=1

1{s 0(i) = s0}

| {z }empirical frequency

20 / 34

Page 39: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Our method: plug-in estimator + perturbation

21 / 34

Page 40: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Our method: plug-in estimator + perturbation

21 / 34

Page 41: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Our method: plug-in estimator + perturbation

21 / 34

Page 42: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Our method: plug-in estimator + perturbation

21 / 34

Page 43: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Challenges in the sample-starved regime

truth: P 2 R|S||A|⇥|S| empirical estimate: bP

• can’t recover P faithfully if sample size ⌧ |S|2|A|!

Can we trust our policy estimate when reliable model estimation is

infeasible?

22 / 34

Page 44: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Challenges in the sample-starved regime

truth: P 2 R|S||A|⇥|S| empirical estimate: bP

• can’t recover P faithfully if sample size ⌧ |S|2|A|!

Can we trust our policy estimate when reliable model estimation is

infeasible?

22 / 34

Page 45: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20)

For every 0 < " 11�� , policy b⇡?

p of perturbed empirical MDP achieves

kV b⇡?p � V

?k1 " and kQb⇡?p � Q

?k1 �"

with sample complexity at most

eO✓

|S||A|(1� �)3"2

◆.

• b⇡?p: obtained by empirical QVI or PI within eO

�1

1��

�iterations

• minimax lower bound: e⌦( |S||A|(1��)3"2 ) [Azar et al., 2013]

23 / 34

Page 46: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20)

For every 0 < " 11�� , policy b⇡?

p of perturbed empirical MDP achieves

kV b⇡?p � V

?k1 " and kQb⇡?p � Q

?k1 �"

with sample complexity at most

eO✓

|S||A|(1� �)3"2

◆.

• b⇡?p: obtained by empirical QVI or PI within eO

�1

1��

�iterations

• minimax lower bound: e⌦( |S||A|(1��)3"2 ) [Azar et al., 2013]

23 / 34

Page 47: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Main result

Theorem (Li, Wei, Chi, Gu, Chen ’20)

For every 0 < " 11�� , policy b⇡?

p of perturbed empirical MDP achieves

kV b⇡?p � V

?k1 " and kQb⇡?p � Q

?k1 �"

with sample complexity at most

eO✓

|S||A|(1� �)3"2

◆.

• b⇡?p: obtained by empirical QVI or PI within eO

�1

1��

�iterations

• minimax lower bound: e⌦( |S||A|(1��)3"2 ) [Azar et al., 2013]

23 / 34

Page 48: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

24 / 34

Page 49: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

A sketch of the main proof ingredients

25 / 34

Page 50: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Notation and Bellman equation

• V⇡: true value function under policy ⇡

I Bellman equation: V = (I � �P⇡)�1

r [Sutton and Barto, 2018]

• bV ⇡: estimate of value function under policy ⇡

I Bellman equation: bV = (I � � bP⇡)�1

r

• ⇡?: optimal policy w.r.t. true value function

• b⇡?: optimal policy w.r.t. empirical value function

• V?:= V

⇡?: optimal values under true models

• bV ?:= bV b⇡?

: optimal values under empirical models

26 / 34

Page 51: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Notation and Bellman equation

• V⇡: true value function under policy ⇡

I Bellman equation: V = (I � �P⇡)�1

r [Sutton and Barto, 2018]

• bV ⇡: estimate of value function under policy ⇡

I Bellman equation: bV = (I � � bP⇡)�1

r

• ⇡?: optimal policy w.r.t. true value function

• b⇡?: optimal policy w.r.t. empirical value function

• V?:= V

⇡?: optimal values under true models

• bV ?:= bV b⇡?

: optimal values under empirical models

26 / 34

Page 52: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Notation and Bellman equation

• V⇡: true value function under policy ⇡

I Bellman equation: V = (I � �P⇡)�1

r [Sutton and Barto, 2018]

• bV ⇡: estimate of value function under policy ⇡

I Bellman equation: bV = (I � � bP⇡)�1

r

• ⇡?: optimal policy w.r.t. true value function

• b⇡?: optimal policy w.r.t. empirical value function

• V?:= V

⇡?: optimal values under true models

• bV ?:= bV b⇡?

: optimal values under empirical models

26 / 34

Page 53: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Notation and Bellman equation

• V⇡: true value function under policy ⇡

I Bellman equation: V = (I � �P⇡)�1

r [Sutton and Barto, 2018]

• bV ⇡: estimate of value function under policy ⇡

I Bellman equation: bV = (I � � bP⇡)�1

r

• ⇡?: optimal policy w.r.t. true value function

• b⇡?: optimal policy w.r.t. empirical value function

• V?:= V

⇡?: optimal values under true models

• bV ?:= bV b⇡?

: optimal values under empirical models

26 / 34

Page 54: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Proof ideas (cont.)

Elementary decomposition:

V? � V

b⇡?=

�V

? � bV ⇡?�+� bV ⇡? � bV b⇡?�

+� bV b⇡? � V

b⇡?�

�V

? � bV ⇡?�+ 0 +

� bV b⇡? � Vb⇡?�

• Step 1: control V ⇡ � bV ⇡, for fixed ⇡

(Bernstein’s inequality + high order decomposition)

• Step 2: control bV b⇡? � Vb⇡?

(decouple statistical dependence)

27 / 34

Page 55: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Proof ideas (cont.)

Elementary decomposition:

V? � V

b⇡?=

�V

? � bV ⇡?�+� bV ⇡? � bV b⇡?�

+� bV b⇡? � V

b⇡?�

�V

? � bV ⇡?�+ 0 +

� bV b⇡? � Vb⇡?�

• Step 1: control V ⇡ � bV ⇡, for fixed ⇡

(Bernstein’s inequality + high order decomposition)

• Step 2: control bV b⇡? � Vb⇡?

(decouple statistical dependence)

27 / 34

Page 56: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Proof ideas (cont.)

Elementary decomposition:

V? � V

b⇡?=

�V

? � bV ⇡?�+� bV ⇡? � bV b⇡?�

+� bV b⇡? � V

b⇡?�

�V

? � bV ⇡?�+ 0 +

� bV b⇡? � Vb⇡?�

• Step 1: control V ⇡ � bV ⇡, for fixed ⇡

(Bernstein’s inequality + high order decomposition)

• Step 2: control bV b⇡? � Vb⇡?

(decouple statistical dependence)

27 / 34

Page 57: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Proof ideas (cont.)

Elementary decomposition:

V? � V

b⇡?=

�V

? � bV ⇡?�+� bV ⇡? � bV b⇡?�

+� bV b⇡? � V

b⇡?�

�V

? � bV ⇡?�+ 0 +

� bV b⇡? � Vb⇡?�

• Step 1: control V ⇡ � bV ⇡, for fixed ⇡

(Bernstein’s inequality + high order decomposition)

• Step 2: control bV b⇡? � Vb⇡?

(decouple statistical dependence)

27 / 34

Page 58: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ ��I � �P⇡

��1� bP⇡ � P⇡� h bV ⇡ � V

⇡i

28 / 34

Page 59: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ ��I � �P⇡

��1� bP⇡ � P⇡� h bV ⇡ � V

⇡i

28 / 34

Page 60: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ ��I � �P⇡

��1� bP⇡ � P⇡� h bV ⇡ � V

⇡i

28 / 34

Page 61: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ �2⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘2 bV ⇡

28 / 34

Page 62: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ �2⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘2V

+ �2⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘2⇥ bV ⇡ � V⇡⇤

28 / 34

Page 63: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ �2⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘2V

+ �3⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘3V

+ . . .

Bernstein’s inequality: |� bP⇡ � P⇡

�V

⇡| q

Var[V⇡]N +

kV⇡k1N

28 / 34

Page 64: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 1: high order decomposition

Bellman equation V⇡= (I � �P⇡)

�1r

[Agarwal et al., 2019] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡� bV ⇡

(?)

[ours] bV ⇡ � V⇡= �

�I � �P⇡

��1� bP⇡ � P⇡�V

⇡+

+ �2⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘2V

+ �3⇣(I � �P⇡

��1� bP⇡ � P⇡)

⌘3V

+ . . .

Bernstein’s inequality: |� bP⇡ � P⇡

�V

⇡| q

Var[V⇡]N +

kV⇡k1N

28 / 34

Page 65: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20)

Fix any policy ⇡. For every 0 < " 11�� , plug-in estimator bV ⇡

obeys

k bV ⇡ � V⇡k1 "

with sample complexity at most

eO⇣ |S|(1� �)3"2

⌘.

• minimax lower bound [Azar et al., 2013, Pananjady and Wainwright, 2019]

• tackle sample size barrier: prior work requires sample size > |S|(1��)2

[Agarwal et al., 2019, Pananjady and Wainwright, 2019, Khamaru et al., 2020]

29 / 34

Page 66: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20)

Fix any policy ⇡. For every 0 < " 11�� , plug-in estimator bV ⇡

obeys

k bV ⇡ � V⇡k1 "

with sample complexity at most

eO⇣ |S|(1� �)3"2

⌘.

• minimax lower bound [Azar et al., 2013, Pananjady and Wainwright, 2019]

• tackle sample size barrier: prior work requires sample size > |S|(1��)2

[Agarwal et al., 2019, Pananjady and Wainwright, 2019, Khamaru et al., 2020]

29 / 34

Page 67: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Byproduct: policy evaluation

Theorem (Li, Wei, Chi, Gu, Chen’20)

Fix any policy ⇡. For every 0 < " 11�� , plug-in estimator bV ⇡

obeys

k bV ⇡ � V⇡k1 "

with sample complexity at most

eO⇣ |S|(1� �)3"2

⌘.

• minimax lower bound [Azar et al., 2013, Pananjady and Wainwright, 2019]

• tackle sample size barrier: prior work requires sample size > |S|(1��)2

[Agarwal et al., 2019, Pananjady and Wainwright, 2019, Khamaru et al., 2020]

29 / 34

Page 68: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 2: controlling bVb⇡?

� V b⇡?

A natural idea: apply our policy evaluation theory + union bound

• highly suboptimal!

key idea 2: a leave-one-out argument to decouple stat. dependency btw

b⇡ and samples

— inspired by [Agarwal et al., 2019] but quite di↵erent . . .

30 / 34

Page 69: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 2: controlling bVb⇡?

� V b⇡?

A natural idea: apply our policy evaluation theory + union bound

• highly suboptimal!

key idea 2: a leave-one-out argument to decouple stat. dependency btw

b⇡ and samples

— inspired by [Agarwal et al., 2019] but quite di↵erent . . .

30 / 34

Page 70: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Step 2: controlling bVb⇡?

� V b⇡?

A natural idea: apply our policy evaluation theory + union bound

• highly suboptimal!

key idea 2: a leave-one-out argument to decouple stat. dependency btw

b⇡ and samples

— inspired by [Agarwal et al., 2019] but quite di↵erent . . .

30 / 34

Page 71: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Key idea 2: leave-one-out argument

• state-action absorbing MDP for each (s, a): (S,A, bP(s,a), r , �)

• (bP � P)s,aVb⇡? = (bP � P)s,aVb⇡?s,a

(b⇡?s,a: optimal for new MDP)

Caveat: require b⇡?to stand out from other policies

31 / 34

Page 72: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Key idea 2: leave-one-out argument

• state-action absorbing MDP for each (s, a): (S,A, bP(s,a), r , �)

• (bP � P)s,aVb⇡? = (bP � P)s,aVb⇡?s,a

(b⇡?s,a: optimal for new MDP)

Caveat: require b⇡?to stand out from other policies

31 / 34

Page 73: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Key idea 2: leave-one-out argument

• state-action absorbing MDP for each (s, a): (S,A, bP(s,a), r , �)

• (bP � P)s,aVb⇡? = (bP � P)s,aVb⇡?s,a

(b⇡?s,a: optimal for new MDP)

Caveat: require b⇡?to stand out from other policies

31 / 34

Page 74: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Key idea 3: tie-breaking via perturbation

• How to ensure the optimal policy stand out from other policies?

8s 2 S, bQ?(s, b⇡?

(s))� maxa:a 6=b⇡?(s)

bQ?(s, a) � !

• Solution: slightly perturb rewards r =) b⇡?p

I ensures the uniqueness of b⇡?p

I Vb⇡?p ⇡ V

b⇡?

32 / 34

Page 75: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Key idea 3: tie-breaking via perturbation

• How to ensure the optimal policy stand out from other policies?

8s 2 S, bQ?(s, b⇡?

(s))� maxa:a 6=b⇡?(s)

bQ?(s, a) � !

• Solution: slightly perturb rewards r =) b⇡?p

I ensures the uniqueness of b⇡?p

I Vb⇡?p ⇡ V

b⇡?

32 / 34

Page 76: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Concluding remarks

Understanding RL requires modern statistics and optimization

Future directions

• beyond the tabular setting

[Feng et al., 2020, Jin et al., 2019, Duan and Wang, 2020]

• finite-horizon episodic MDPs

[Dann and Brunskill, 2015, Jiang and Agarwal, 2018, Wang et al., 2020]

33 / 34

Page 77: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Concluding remarks

Understanding RL requires modern statistics and optimization

Future directions

• beyond the tabular setting

[Feng et al., 2020, Jin et al., 2019, Duan and Wang, 2020]

• finite-horizon episodic MDPs

[Dann and Brunskill, 2015, Jiang and Agarwal, 2018, Wang et al., 2020]

33 / 34

Page 78: Breaking the Sample Size Barrier in Model-Based Reinforcement …ytwei/documents/slides/model-based-rl... · 2020-08-06 · Breaking the Sample Size Barrier in Model-Based Reinforcement

Paper:

“Breaking the sample size barrier in model-based reinforcement learning with a

generative model,” G. Li, Y. Wei, Y. Chi, Y. Gu, Y. Chen, arxiv:2005.12900, 2020

34 / 34