continuous control with deep reinforcement learning (ddpg)
TRANSCRIPT
Continuous�control�with�deep�reinforcement�learning
2016-06-28
Taehoon�Kim
Motivation
• DQN�can�only�handle
• discrete (not�continuous)
• low-dimensional action�spaces
• Simple�approach�to�adapt�DQN�to�continuous�domain�is�discretizing
• 7�degree�of�freedom�system�with�discretization�𝑎" ∈ {−𝑘, 0, 𝑘}
• Now�space�dimensionality�becomes�3+ = 2187
• explosion�of�the�number�of�discrete�actions
2
Contribution
• Present�a�model-free,�off-policy�actor-critic�algorithm
• learn�policies�in�high-dimensional,�continuous�action�spaces
• Work�based�on�DPG�(Deterministic�policy�gradient)
3
Background
• actions�𝑎" ∈ ℝ2,�action�space�𝒜 = ℝ2
• history�of�observation,�action�pairs�𝑠" = (𝑥7, 𝑎7,… , 𝑎"97, 𝑥")
• assume�fully-observable�so�𝑠" = 𝑥"
• policy�𝜋: 𝒮 → 𝒫(𝒜)
• Model�environment�as�Markov�decision�process
• initial�state�distribution�𝑝(𝑠7)
• transition�dynamics�𝑝(𝑠"A7|𝑠", 𝑎")
4
Background
• Discounted�future�reward�𝑅" = ∑ 𝛾F9"𝑟(𝑠F, 𝑎F)HFI"
• Goal�of�RL�is�to�learn�a�policy�𝜋 which�maximizes�the�expected�return
• from�the�start�distribution�𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]
• Discounted�state�visitation�distribution�for�a�policy�𝜋:�ρR
5
Background
• action-value�function�𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]
• expected�return�after�taking�an�action�𝑎" in�state�𝑠" and�following�policy�𝜋
• Bellman�equation
• 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R(𝑠"A7, 𝑎"A7) ]
• With�deterministic policy�𝜇: 𝒮 → 𝒜
• 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )]
6
Background
• Expectation�only�depends�on�the�environment• possible�to�learn�𝑄𝝁 off-policy,�where�transitions�are�generated�fromdifferent�stochastic�policy�𝜷
• Q-learning�with�greedy�policy�𝜇 𝑠 = argmaxf𝑄 𝑠, 𝑎
• 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n]
• where�𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i)
• To�scale�Q-learning�into�large�non-linear�approximators:• a�replay�buffer,�a�separate�target�network
7
(acommonly usedoff-policy algorithm)
Deterministic�Policy�Gradient�(DPG)
• In�continuous space,�finding�the�greedy�policy�requires�an�optimization�of�𝑎" at�
every�timestep
• too�slow�to�large,�unconstrained�function�approximators�and�nontrivial�action�spaces
• Instead,�used�an�actor-critic approach�based�on�the�DPG algorithm
• actor:�𝜇 𝑠 𝜃^ : 𝒮 → 𝒜
• critic:�𝑄(𝑠, 𝑎|𝜃i)
8
Learning�algorithm
• Actor�is�updated�by�following�the�applying�the�chain�rule�to�the�expected�return�
from�the�start�distribution�𝒥 w.r.t 𝜃^
• 𝛻rs𝒥 ≈ 𝔼N~j𝜷 𝛻rs𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =
𝔼N~j𝜷 𝛻Q𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs𝜇 𝑠 𝜃^ |NIN"
• Silver�et�al.�(2014)�proved�this�is�the�policy�gradient
• the�gradient�of�policy’s�performance
9
Contributions
• Introducing�non-linear�function�approximators�means�that�
convergence�is�no�longer�guaranteed
• But�essential�to�learn�and�generalize�on�large�state�spaces
• Contribution
• To�provide�modifications�to�DPG,�inspired�by�the�success�of�DQN
• Allow�to�use�neural�network�function�approximators�to�learn�in�large�state�and�
action�spaces�online
10
Challenges�1
• NN�for�RL�usually�assume�that�the�samples�are�i.i.d.
• but�when�the�samples�are�generated�from�exploring�sequentially�in�an�environment,�
this�assumption�no�longer�holds.
• As�DQN,�we�use�replay�buffer to�address�this�issue
• As�DQN,�we�used�target�network�for�stable�learning�but�use�“soft”�target�
updates
• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`,�with�𝜏 ≪ 1
• Target�network�slowly�change�that�greatly�improve�the�stability�of�learning
11
Challenges�2
• When�learning�from�low�dimensional�feature�vector,�observations�may�have�
different�physical�units�(i.e.�positions and�velocities)
• make�it�difficult to�learn�effectively and�also�to�find�hyper-parameters which�generalize across�environments
• Use�batch�normalization�[Ioffe &�Szegedy,�2015]�to�normalize each�dimension�across�the�samples�in�a�minibatch to�have�unit�mean�and�variance
• Also�maintains�a�running�average of�the�mean and�variance�for�normalization during�testing
• Use�all�layers�of�𝜇 and�𝑄 prior�to�the�action�input
• Can�train�different�units�without�needing�to�manually�ensure�the�units�were�within�a�set�range
12
(explorationorevaluation)
Challenges�3
• Advantage�of�off-policies�algorithm�(i.e.�DDPG)�is�that�we�can�treat�the�problem�
of�exploration�independently�from�the�learning�algorithm
• Constructed�an�exploration�policy�𝜇` by�adding�noise�sampled�from�a�noise�
process�𝒩
• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"^ + 𝒩
• Use�an�Ornstein-Uhlenbeck process�to�generate�temporally�
correlated�exploration�for�exploration�efficiency�with�inertia
13
14
Experiment�details
• Adam.�𝑙𝑟^ = 109|,�𝑙𝑟i = 109}
• 𝑄 include�𝐿n weight�decay�of�109n and�𝛾 = 0.99
• 𝜏 = 0.001
• ReLU for�hidden�layers,�tanh for�output�layer�of�the�actor�to�bound�the�actions
• NN:�2�hidden� layers�with�400�and�300�units• Action�is�not�included�until�the�2nd hidden�layer�of�𝑄
• The�final�layer�weights�and�biases�are�initialized�from�a�uniform�distribution� −3×109},3×109}
• to�ensure�the�initial�outputs�for�the�policy�and�value�estimates�were�near�zero
• The�other�layers�are�initialized�from�uniform�distributions� − 7�, 7�where�𝑓 is�the�fan-in�of�the�layer
• Replay�buffer� ℛ = 10�,�Ornstein-Uhlenbeck process:�𝜃 = 0.15, 𝜎 = 0.2
15
References
1. [Wang,�2015]�Wang,�Z.,�de�Freitas,�N.,�&�Lanctot,�M.�(2015).�Dueling�network�architectures�for�
deep�reinforcement�learning. arXiv preprint�arXiv:1511.06581.
2. [Van,�2015]�Van�Hasselt,�H.,�Guez,�A.,�&�Silver,�D.�(2015).�Deep�reinforcement�learning�with�
double�Q-learning. CoRR,�abs/1509.06461.
3. [Schaul,�2015]�Schaul,�T.,�Quan,�J.,�Antonoglou,�I.,�&�Silver,�D.�(2015).�Prioritized�experience�
replay. arXiv preprint�arXiv:1511.05952.
4. [Sutton,�1998]�Sutton,�R.�S.,�&�Barto,�A.�G.�(1998). Reinforcement�learning:�An�introduction(Vol.�
1,�No.�1).�Cambridge:�MIT�press.
16