variance reduction for policy gradient...
TRANSCRIPT
![Page 1: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/1.jpg)
Variance Reduction for Policy Gradient Methods
March 13, 2017
![Page 2: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/2.jpg)
Reward Shaping
![Page 3: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/3.jpg)
Reward Shaping
![Page 4: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/4.jpg)
Reward Shaping
I Reward shaping: r(s, a, s ′) = r(s, a, s ′) + γΦ(s ′)− Φ(s) for arbitrary“potential” Φ
I Theorem: r admits the same optimal policies as r .1
I Proof sketch: suppose Q∗ satisfies Bellman equation (T Q = Q). If wetransform r → r , policy’s value function satisfies Q(s, a) = Q∗(s, a)− Φ(s)
I Q∗ satisfies Bellman equation ⇒ Q also satisfies Bellman equation
1A. Y. Ng, D. Harada, and S. Russell. “Policy invariance under reward transformations: Theory and application to reward shaping”. ICML. 1999.
![Page 5: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/5.jpg)
Reward ShapingI Theorem: R admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell.
“Policy invariance under reward transformations: Theory and application to rewardshaping”. ICML. 1999
I Alternative proof: advantage function is invariant. Let’s look at effect on V π and Qπ:
E[r0 + γr1 + γ2r2 + . . .
]condition on either s0 or (s0, a0)
= E[r0 + γ r1 + γ2r2 + . . .
]= E
[(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1)) + γ2(r2 + γΦ(s3)− Φ(s2)) + . . .
]= E
[r0 + γr1 + γ2r2 + · · · − Φ(s0)
]Thus,
V π(s) = V π(s)− Φ(s)
Qπ(s) = Qπ(s, a)− Φ(s)
Aπ(s) = Aπ(s, a)
Aπ(s, π(s)) = 0 at all states ⇒ π is optimal
![Page 6: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/6.jpg)
Reward Shaping and Problem Difficulty
I Shape with Φ = V ∗ ⇒ problem is solved in one step of value iteration
I Shaping leaves policy gradient invariant (and just adds baseline to estimator)
E[∇θ log πθ(a0 | s0)(r0 + γΦ(s1)− Φ(s0)) + γ(r1 + γΦ(s2)− Φ(s1))
+ γ2(r2 + γΦ(s3)− Φ(s2)) + . . .]
= E[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + · · · − Φ(s0))
]= E
[∇θ log πθ(a0 | s0)(r0 + γr1 + γ2r2 + . . . )
]
![Page 7: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/7.jpg)
Reward Shaping and Policy GradientsI First note connection between shaped reward and advantage function:
Es1 [r0 + γV π(s1)− V π(s0) | s0 = s, a0 = a] = Aπ(s, a)
Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretendγ = 0 after shaping) we get
E
[∑t
∇θ log πθ(at | st)rt
]= E
[∑t
∇θ log πθ(at | st)(rt + γV π(st+1)− V π(st))
]
= E
[∑t
∇ log πθ(at | st)Aπ(st , at)
]I Compromise: use more aggressive discount γλ, with λ ∈ (0, 1): called generalized
advantage estimation
E
[∑t
∇θ log πθ(at | st)∞∑k=0
(γλ)k rt+k
]
![Page 8: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/8.jpg)
Reward Shaping – Summary
I Reward shaping transformation leaves policy gradient and optimal policyinvariant
I Shaping with Φ ≈ V π makes consequences of actions more immediate
I Shaping, and then ignoring all but first term, gives policy gradient
![Page 9: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/9.jpg)
[Aside] Reward Shaping: Very Important in PracticeI I. Mordatch, E. Todorov, and Z. Popovic. “Discovery of complex behaviors through
contact-invariant optimization”. ACM Transactions on Graphics (TOG) 31.4 (2012),p. 43
I Y. Tassa, T. Erez, and E. Todorov. “Synthesis and stabilization of complex behaviorsthrough online trajectory optimization”. Intelligent Robots and Systems (IROS), 2012IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906–4913
![Page 10: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/10.jpg)
Variance Reduction for Policy Gradients
![Page 11: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/11.jpg)
Variance Reduction
I We have the following policy gradient formula:
∇θEτ [R] = Eτ
[T−1∑t=0
∇θ log π(at | st , θ)Aπ(st , at)
]
I Aπ is not known, but we can plug in At , an advantage estimator
I Previously, we showed that taking
At = rt + rt+1 + rt+2 + · · · − b(st)
for any function b(st), gives an unbiased policy gradient estimator.b(st) ≈ V π(st) gives variance reduction.
![Page 12: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/12.jpg)
The Delayed Reward Problem
I With policy gradient methods, we are confounding the effect of multipleactions:
At = rt + rt+1 + rt+2 + · · · − b(st)
mixes effect of at , at+1, at+2, . . .
I SNR of At scales roughly as 1/TI Only at contributes to signal Aπ(st , at), but at+1, at+2, . . . contribute to
noise.
![Page 13: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/13.jpg)
Variance Reduction with DiscountsI Discount factor γ, 0 < γ < 1, downweights the effect of rewards that are far
in the future—ignore long term dependencies
I We can form an advantage estimator using the discounted return:
Aγt = rt + γrt+1 + γ2rt+2 + . . .︸ ︷︷ ︸discounted return
−b(st)
reduces to our previous estimator when γ = 1.
I So advantage has expectation zero, we should fit baseline to be discountedvalue function
V π,γ(s) = Eτ[r0 + γr1 + γ2r2 + . . . | s0 = s
]I Discount γ is similar to using a horizon of 1/(1− γ) timesteps
I Aγt is a biased estimator of the advantage function
![Page 14: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/14.jpg)
Value Functions in the Future
I Baseline accounts for and removes the effect of past actions
I Can also use the value function to estimate future rewards
rt + γV (st+1) cut off at one timestep
rt + γrt+1 + γ2V (st+2) cut off at two timesteps
. . .
rt + γrt+1 + γ2rt+2 + . . . ∞ timesteps (no V )
![Page 15: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/15.jpg)
Value Functions in the Future
I Subtracting out baselines, we get advantage estimators
A(1)t = rt + γV (st+1)−V (st)
A(2)t = rt + rt+1 + γ2V (st+2)−V (st)
. . .
A(∞)t = rt + γrt+1 + γ2rt+2 + . . .−V (st)
I A(1)t has low variance but high bias, A
(∞)t has high variance but low bias.
I Using intermediate k (say, 20) gives an intermediate amount of bias and variance
![Page 16: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/16.jpg)
Finite-Horizon Methods: Advantage Actor-Critic
I A2C / A3C uses this fixed-horizon advantage estimator
I Pseudocode
for iteration=1, 2, . . . doAgent acts for T timesteps (e.g., T = 20),For each timestep t, compute
Rt = rt + γrt+1 + · · ·+ γT−t+1rT−1 + γT−tV (st)
At = Rt − V (st)
Rt is target value function, in regression problemAt is estimated advantage function
Compute loss gradient g = ∇θ
∑Tt=1
[− log πθ(at | st)At + c(V (s)− Rt)
2]
g is plugged into a stochastic gradient descent variant, e.g., Adam.end for
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)
![Page 17: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/17.jpg)
A3C Video
![Page 18: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/18.jpg)
A3C Results
0 2 4 6 8 10 12 14Training time (hours)
0
2000
4000
6000
8000
10000
12000
14000
16000
Sco
re
Beamrider
DQN1-step Q1-step SARSAn-step QA3C
0 2 4 6 8 10 12 14Training time (hours)
0
100
200
300
400
500
600
Sco
re
Breakout
DQN1-step Q1-step SARSAn-step QA3C
0 2 4 6 8 10 12 14Training time (hours)
30
20
10
0
10
20
30
Sco
re
Pong
DQN1-step Q1-step SARSAn-step QA3C
0 2 4 6 8 10 12 14Training time (hours)
0
2000
4000
6000
8000
10000
12000
Sco
re
Q*bert
DQN1-step Q1-step SARSAn-step QA3C
0 2 4 6 8 10 12 14Training time (hours)
0
200
400
600
800
1000
1200
1400
1600
Sco
re
Space Invaders
DQN1-step Q1-step SARSAn-step QA3C
![Page 19: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/19.jpg)
Reward Shaping
![Page 20: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/20.jpg)
TD(λ) Methods: Generalized Advantage EstimationI Recall, finite-horizon advantage estimators
A(k)t = rt + γrt+1 + · · ·+ γk−1rt+k−1 + γkV (st+k)− V (st)
I Define the TD error δt = rt + γV (st+1)− V (st)
I By a telescoping sum,
A(k)t = δt + γδt+1 + · · ·+ γk−1δt+k−1
I Take exponentially weighted average of finite-horizon estimators:
Aλ = A(1)t + λA
(2)t + λ2A
(3)t + . . .
I We obtain
Aλt = δt + (γλ)δt+1 + (γλ)2δt+2 + . . .
I This scheme named generalized advantage estimation (GAE) in [1], though versions haveappeared earlier, e.g., [2]. Related to TD(λ)
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-dimensional continuous control using generalized advantage estimation”.(2015)
H. Kimura and S. Kobayashi. “An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect ValueFunction.” ICML. 1998
![Page 21: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/21.jpg)
Choosing parameters γ, λ
Performance as γ, λ are varied
0 100 200 300 400 500
number of policy iterations
2.5
2.0
1.5
1.0
0.5
0.0
cost
3D Biped
γ=0.96,λ=0.96
γ=0.98,λ=0.96
γ=0.99,λ=0.96
γ=0.995,λ=0.92
γ=0.995,λ=0.96
γ=0.995,λ=0.98
γ=0.995,λ=0.99
γ=0.995,λ=1.0
γ=1,λ=0.96
γ=1, No value fn
![Page 22: Variance Reduction for Policy Gradient Methodsrail.eecs.berkeley.edu/deeprlcoursesp17/docs/lec6.pdfCompute loss gradient g = r P T t=1 h 2log ˇ (a t js t)A ^ t + c(V(s) R t) i g is](https://reader033.vdocuments.net/reader033/viewer/2022043017/5f39cd16c2350c023b735caf/html5/thumbnails/22.jpg)
TRPO+GAE Video