reinforcement learning lecture policy search · reinforcement learning ... linear parametrization:...
TRANSCRIPT
Reinforcement Learning
Policy Search
Continuous state/action space, policy gradient, natural policy gradient, actor-critic,natural actor-critic
Vien NgoMLR, University of Stuttgart
Outline
• Policy search overview: exploration, evaluation, updates
• Policy gradient methods– Plain policy gradient– Second-order policy gradient: Gauss-Newton, Natural gradient– Actor-critic, natural actor-critic.
• Inference-based methods
• Information-theoretic methods
• Black-box optimization methods
2/46
Policy Search: motivations
• Value-based methods:– Inefficient in high-dimensional space– Value function approximation suffers from bias due to the use ofbootstrapping (value updates based on current estimates)– The value function (using function approximation) might bediscontinuous (e.g. in under-actuated systems).
• Advantages:– Guaranteed convergence properties– Learning in high-dimensional space problems– Enable imitation learning, learning from demonstration effectively– Integrate other optimal control policies straightforwardly
• Disadvantages: computationally inefficient
5/46
Policy Search: motivations
• Value-based methods:– Inefficient in high-dimensional space– Value function approximation suffers from bias due to the use ofbootstrapping (value updates based on current estimates)– The value function (using function approximation) might bediscontinuous (e.g. in under-actuated systems).
• Advantages:– Guaranteed convergence properties– Learning in high-dimensional space problems– Enable imitation learning, learning from demonstration effectively– Integrate other optimal control policies straightforwardly
• Disadvantages: computationally inefficient
5/46
Policy Search
• Directly optimize over the parameter space θ ∈ Θ of paramterizedpolicies. For example,– Linear policies: π(s) = θ>φ(s)
– Stochastic Gaussian policies π(a|s, θ) ∼ e−f(s,a)
• Linear parametrization: f(s, a) = 12
(Wφ(s)− a
)>Σ−1
(Wφ(s)− a
),
where assuming that a ∈ Rn, each θi ∈ Θ for each action dimension, and
W =
θ>1
· · ·θ>n
hence θ = (W,Σ)
• Functional policy: f(s, a) = 12
(h(s)− a
)>Σ−1
(h(s)− a
); hence
θ = (h,Σ)
– (Recurrent) neural network: a = f(θ, s) (e.g. autonomous helicopter,AlphaGO, robotic control tasks, etc.)
6/46
Policy Search
Algorithm 1 Model-Free Policy Search
1: while (!converged) do
2: EXPLORATION: Generating trajectories D = {τi} using π(θk)
3: EVALUATION: Evaluating the quality of π(θk;D)
4: UPDATE: Updating π(θk+1) given the evaluations and trajectories.
(Deisenroth, Neumann and Peters, 2011)
7/46
Exploration in Model-Free Policy Search
• Exploration in action space vs. exploration in parameter space
• Episode-based vs. step-based exploration
• Uncorrelated vs. correlated exploration
9/46
Exploration in Model-Free Policy Search
• Exploration in action space vs. exploration in parameter space
• Episode-based vs. step-based exploration
• Uncorrelated vs. correlated exploration
9/46
Exploration in Model-Free Policy Search
• Exploration in action space vs. exploration in parameter space
• Episode-based vs. step-based exploration
• Uncorrelated vs. correlated exploration
9/46
Exploration in Model-Free Policy Search• Exploration in action space: an exploration noise is added directly to
the executed actions (e.g. REINFORCE, eNAC)e.g. exploration at each step (e.g. PoWER, PI2): ut = φt(x)>θ
πθ(ut|xt) = N(ut|θ>φ(xt), φ(xt)
>Σφ(xt))
• Exploration in parameter sapce: the parameter is perturbed in thebeginning of an episode (or can be used at each time step) (e.g.black-box approaches, etc.), e.g.
u = µu(x, θ), θ ∼ N (θ|πθ(ω),Σθ)
A graphical model of learning an upper-level policy (Deisenroth, Neumann, Peters, 2011)10/46
Exploration in Model-Free Policy Search
• Step-based exploration (e.g. PoWER, PI2): e.g. ut = φt(x)>θ as alower-level policy, or can be rewritten as
πθ(ut|xt, t) = N (ut|φt(x)>µθ, φt(x)>Σθφt(x))
• Episode-based exploration: e.g. sampling θ ∼ N (θ|πθ(ω),Σθ) at thebeginning of each episode
11/46
Exploration in Model-Free Policy Search
• Step-based exploration (e.g. PoWER, PI2): e.g. ut = φt(x)>θ as alower-level policy, or can be rewritten as
πθ(ut|xt, t) = N (ut|φt(x)>µθ, φt(x)>Σθφt(x))
• Episode-based exploration: e.g. sampling θ ∼ N (θ|πθ(ω),Σθ) at thebeginning of each episode
11/46
Exploration in Model-Free Policy Search
• Uncorrelated exploration: Σθ is diagonal
• Correlated exploration: Σθ is full. This stragegy often gives betterlearning speed, however requires large amount of data to accuratelyestimate Σ (CMA-ES is an efficient algorithm, but should not be usedwhen |Θ| > 50)
12/46
Step-based Policy EvaluationGiven a set of sampled trajectoris D =
{ξi = (xit, u
it)Tit=0
}Mi=1
• the expected cost-to-go of (xit, uit)
Qit = Qπt (xit, uit) = Eπθ(ξ)
[Ti∑k=t
c(xk, uk)
∣∣∣∣∣xt = xit, ut = uit
]
≈Ti∑k=t
c(xik, uik)
• step-based update is based on a data set D′ = {{xit, uit, Qit}Tit=0}Mi=1
• algorithms: REINFORCE, G(PO)MDP, NAC, eNAC, PoWER, PI2.
14/46
Episode-based Policy EvaluationGiven a set of sampled trajectoris D =
{ξi = (xit, u
it)Tit=0
}Mi=1
• using the expected cost of a parameter θi
Ci = C(θi) = Eπθ(ξ)
[Ti∑k=0
c(xk, uk)
∣∣∣∣∣θ = θi
]
• episode-based update is based on a data set D′ = {θi, Ci}Mi=1
• algorithms: Episode-based REPS, Episode-based PI2, PEPG, NES,CMA-ES, RWR
15/46
Step-based vs. Episode-based
• Episode-based might suffer from large variance.
• Episode-based does (almost) exploration directly on parameter space.
• the choice between two depends on the given problem.
16/46
Step-based vs. Episode-based
• Episode-based might suffer from large variance.
• Episode-based does (almost) exploration directly on parameter space.
• the choice between two depends on the given problem.
16/46
Policy Update Strategies
• Policy gradient methods: Finite Difference methods, REINFORCE,G(PO)MDP, Natural Gradient, Natural Actor-Critic (NAC), episodicNAC, etc.
• Inference-based techniques
• Information-theoretic techniques
• Black-box techniques
18/46
Policy gradient
• Family of randomized policy µ(s, a) = p(a|s) (deterministic policy is aspecial case).
• A performance measure is
J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .
]
• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes
J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .
]• Solution: gradient-ascent algorithms
θk+1 = θk + α∇θJ(θk)
20/46
Policy gradient
• Family of randomized policy µ(s, a) = p(a|s) (deterministic policy is aspecial case).
• A performance measure is
J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .
]• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes
J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .
]
• Solution: gradient-ascent algorithms
θk+1 = θk + α∇θJ(θk)
20/46
Policy gradient
• Family of randomized policy µ(s, a) = p(a|s) (deterministic policy is aspecial case).
• A performance measure is
J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .
]• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes
J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .
]• Solution: gradient-ascent algorithms
θk+1 = θk + α∇θJ(θk)
20/46
Policy gradient
• Policy gradient updates
θµ′ = θµ + α∇θJ(θµ)
• Guarantee the performance improvement: J(θµ′) ≥ J(θµ)⇒ µ′ at leastbetter than or equal to µ
21/46
Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).
• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By– perturbing the parameters: θ + δθ
– applying the new policy µ(θ + δθ) to get δJi = J(θ + δθ)− J(θ)
• the finite different (FD) gradient estimation by regression
gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J
where
∆Θ =
δθ>1
· · ·δθ>M
,∆J =
δJ1
· · ·δJM
• gradient update
θ ← θ + αgFD(θ)
(Jan Peters, Scholarpedia 2010)
• This method works well if the return J is not too noisy.
23/46
Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By
– perturbing the parameters: θ + δθ
– applying the new policy µ(θ + δθ) to get δJi = J(θ + δθ)− J(θ)
• the finite different (FD) gradient estimation by regression
gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J
where
∆Θ =
δθ>1
· · ·δθ>M
,∆J =
δJ1
· · ·δJM
• gradient update
θ ← θ + αgFD(θ)
(Jan Peters, Scholarpedia 2010)
• This method works well if the return J is not too noisy.
23/46
Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By
– perturbing the parameters: θ + δθ
– applying the new policy µ(θ + δθ) to get δJi = J(θ + δθ)− J(θ)
• the finite different (FD) gradient estimation by regression
gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J
where
∆Θ =
δθ>1
· · ·δθ>M
,∆J =
δJ1
· · ·δJM
• gradient update
θ ← θ + αgFD(θ)
(Jan Peters, Scholarpedia 2010)
• This method works well if the return J is not too noisy.
23/46
Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By
– perturbing the parameters: θ + δθ
– applying the new policy µ(θ + δθ) to get δJi = J(θ + δθ)− J(θ)
• the finite different (FD) gradient estimation by regression
gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J
where
∆Θ =
δθ>1
· · ·δθ>M
,∆J =
δJ1
· · ·δJM
• gradient update
θ ← θ + αgFD(θ)
(Jan Peters, Scholarpedia 2010)
• This method works well if the return J is not too noisy. 23/46
Policy gradient: Likelihood Ratio Gradient
• rewrite the performance measure J(θ)
J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .
]=
∫p(ξ|µθ)R(ξ)dξ
where ξ = {s0, a0, r0, s1, a1, r1, . . .} is a trajectory.
R(ξ) = r0 + γr1 + γ2r2 + . . .
p(ξ|µθ) = p(s0)∏
p(st+1|st, at)µθ(at|st)
25/46
Policy gradient: Likelihood Ratio Gradient• Gradient derivation
∇θJ(θ) =
∫∇θp(ξ|µθ)R(ξ)dξ
=
∫p(ξ|µθ)∇θ log p(ξ|µθ)R(ξ)dξ
(the trick is ∇f = f∇ log f)
= E[∇θ log p(ξ|µθ)R(ξ)
]• Using Monte-Carlo simulations: sampling M tracjectories ξi from policyµθ.
∇θJ(θ) ≈ 1
M
M∑i=1
∇θ log p(ξi|µθ)R(ξi)
=1
M
M∑i=1
Ti∑t=0
∇θ logµθ(at|st)R(ξi)
because p(st+1|st, at) not depends on θ, so its gradient w.r.t θ is 0.26/46
Policy gradient: Likelihood Ratio GradientA vanilla policy gradient algorithm
• initialize θ0
• for k = 0 :∞ (until convergence)
– generate data: M trajectories ξi from policy µθk
– evaluate: ∇θJ(θk) from {ξi}Mi=1
– update: θ: θk+1 = θk + α∇θJ(θk)
• end for
27/46
Variance Reduction
• Vanilla policy gradient method suffers from large variance.
• Variance reduction using a baseline:– retain non-biased estimate gradients– reduce variance of gradient estimates
• REINFORCE and G(PO)MDP algorithms
28/46
Variance Reduction
• Vanilla policy gradient method suffers from large variance.
• Variance reduction using a baseline:– retain non-biased estimate gradients– reduce variance of gradient estimates
• REINFORCE and G(PO)MDP algorithms
28/46
Variance Reduction
• Vanilla policy gradient method suffers from large variance.
• Variance reduction using a baseline:– retain non-biased estimate gradients– reduce variance of gradient estimates
• REINFORCE and G(PO)MDP algorithms
28/46
Variance Reduction: REINFORCE• Introduce a baseline b that satisfies (un-biased)
∇RFθi J(θ) = Epθ(ξ)
[T−1∑t=0
∇θi logµθ(at|st)(R(ξ)− b)
]
= Epθ(ξ)
[T−1∑t=0
∇θi logµθ(at|st)R(ξ))
]
• Non-bias gradient estimate: ∇θJ(θ) = ∇RFθ J(θ)
• Choose b s.t minimize the variance of ∇RFθ J(θ)
∂
∂biVar[∇RFθi J(θ)
]=
∂
∂bi
{E[(∇RFθi J(θ)
)2]− E[∇RFθi J(θ)]2}
=∂
∂bi
{E[(∇RFθi J(θ)
)2]}= 0
where i is the ith-dim of θ.
29/46
Variance Reduction: REINFORCE• Introduce a baseline b that satisfies (un-biased)
∇RFθi J(θ) = Epθ(ξ)
[T−1∑t=0
∇θi logµθ(at|st)(R(ξ)− b)
]
= Epθ(ξ)
[T−1∑t=0
∇θi logµθ(at|st)R(ξ))
]
• Non-bias gradient estimate: ∇θJ(θ) = ∇RFθ J(θ)
• Choose b s.t minimize the variance of ∇RFθ J(θ)
∂
∂biVar[∇RFθi J(θ)
]=
∂
∂bi
{E[(∇RFθi J(θ)
)2]− E[∇RFθi J(θ)]2}
=∂
∂bi
{E[(∇RFθi J(θ)
)2]}= 0
where i is the ith-dim of θ. 29/46
• The resulting baseline b is (for each dimension i)
bi =
Epθ(ξ)
[( T−1∑t=0
∇θi logµθ(at|st))2R(ξ)
]
Epθ(ξ)
[( T−1∑t=0
∇θi logµθ(at|st))2]
30/46
Variance Reduction: G(PO)MDP Algorithm
• The rewards at a given time are independent from future actions
Epθ(ξ)
[∂θ logµθ(at|st, t)rj
]= 0, ∀j < t
• The gradient computed by G(PO)MDP is
∇G(PO)MDPθi
J(θ) = Epθ(ξ)
[T−1∑j=0
j∑t=0
∇θi logµθ(at|st)(rj − bj)
]
• The bias is
bji =
Epθ(ξ)
[( j∑t=0
∇θi logµθ(at|st))2rj
]
Epθ(ξ)
[( j∑t=0
∇θi logµθ(at|st))2]
31/46
Second-order methods
• Natural policy gradientSham Kakade (2003)
• Gauss-Newton policy gradientThomas Furmston, Guy Lever, David Barber (2016)
32/46
Actor-Critic methods
• The policy structure is known as the actor, (→ tuned by gradient updates.)
• The estimated value function is known as the critic (→ tuned by TD errors).
(Introduction to RL, Sutton & Barto)
34/46
Policy Gradient Theorem Algorithm
• R(ξ) can be estimated by Qπt (st, at)
∇θJ(θ) = E[ T∑t=0
∇θ logµθ(at|st)( T∑j=t
rj)]
= E[ T∑t=0
∇θ logµθ(at|st)Qπt (st, at)]
(Baxter & Bartlett, 2001)
• Policy gradient with function approximation
Qπt (st, at) = φ(st, at)> ·w
• Compatible function approximation: how to choose φ(st, at)? such that
– does not introduce bias
– reduce the variance
35/46
Policy Gradient Theorem Algorithm
• R(ξ) can be estimated by Qπt (st, at)
∇θJ(θ) = E[ T∑t=0
∇θ logµθ(at|st)( T∑j=t
rj)]
= E[ T∑t=0
∇θ logµθ(at|st)Qπt (st, at)]
(Baxter & Bartlett, 2001)
• Policy gradient with function approximation
Qπt (st, at) = φ(st, at)> ·w
• Compatible function approximation: how to choose φ(st, at)? such that
– does not introduce bias
– reduce the variance
35/46
Compatible function approximationThe function approximation is compatible with the policy parametrization.
• The feature function is found φ(s, a) = ∇θ logµθ(a|s), by solving theleast-square error
Eφ(ξ)
[T−1∑t=0
(Q(st, at)− φ(st, at)>w)2
](Sutton, 1999; Konda & Tsitsiklis, 1999)
• The gradient with compatible function approximation
∇θJ(θ) = E[ T∑t=0
∇θ logµθ(at|st)(∇θ logµθ(at|st)
)>]︸ ︷︷ ︸
Fisher information matrix
w
= Gθw (Actor-critic algorithms)
• The natural policy gradient with compatible function approximation
∇NGθ J(θ) = G−1θ Gθw = w (Natural actor-critic algorithms)
36/46
Compatible function approximationThe function approximation is compatible with the policy parametrization.
• The feature function is found φ(s, a) = ∇θ logµθ(a|s), by solving theleast-square error
Eφ(ξ)
[T−1∑t=0
(Q(st, at)− φ(st, at)>w)2
](Sutton, 1999; Konda & Tsitsiklis, 1999)
• The gradient with compatible function approximation
∇θJ(θ) = E[ T∑t=0
∇θ logµθ(at|st)(∇θ logµθ(at|st)
)>]︸ ︷︷ ︸
Fisher information matrix
w
= Gθw (Actor-critic algorithms)
• The natural policy gradient with compatible function approximation
∇NGθ J(θ) = G−1θ Gθw = w (Natural actor-critic algorithms)
36/46
Compatible function approximationThe function approximation is compatible with the policy parametrization.
• The feature function is found φ(s, a) = ∇θ logµθ(a|s), by solving theleast-square error
Eφ(ξ)
[T−1∑t=0
(Q(st, at)− φ(st, at)>w)2
](Sutton, 1999; Konda & Tsitsiklis, 1999)
• The gradient with compatible function approximation
∇θJ(θ) = E[ T∑t=0
∇θ logµθ(at|st)(∇θ logµθ(at|st)
)>]︸ ︷︷ ︸
Fisher information matrix
w
= Gθw (Actor-critic algorithms)
• The natural policy gradient with compatible function approximation
∇NGθ J(θ) = G−1θ Gθw = w (Natural actor-critic algorithms)36/46
Actor-critic algorithmsA regular-gradient actor-critic algorithm (a stochastic update version)
• initialize θ0
• For t = 0 :∞ (until convergence)
– choose an action at ∼ µθt (at|st)
– Take at, observe rt, and st+1.
– Compute TD error: δt = rt + γφ(st+1, at+1)>wt − φ(st, at)>wt.
– Critic update: wt+1 = wt + αδtφ(st, at)
– Actor update: θt+1 = θt + βφ(st, at)φ(st, at)>wt
• end For
37/46
Episodic natural actor-critic (eNAC)
• initialize θ0
• For t = 0 :∞ (until convergence)
– sample a set of trajectories {ξi} from π(θt)
– Critic update: by regression [w
v
]= [Φ>Φ]−1Φ>R
– Actor update:∇eNACθ J(θ) = w
• end For
38/46
Information-theoretic Policy Search
• Information-theoretic bounds the update of the trajectory distributions,i.e. regularization limits the information loss in policy update. Forexample, natural policy gradient.
• Regularization helps avoid premature on local optima as in EM-basedmethods.
• Information-theoretic policy exploits advantages from inference-based(updates without a step-size), and natural policy gradient (informationloss bound).
Relative Entropy Policy Search (REPS) (Jan Peters 2010).
41/46
Evolutionary policy search
• Evolutionary policy search: Cross-entropy (CE), covariance matrixadaptation evolution strategy (CMA-ES), Natural Evolution Strategies(NES), etc.
43/46