introduction to dueling network

ディープラーニングの最新動向強化学習とのコラボ編③　Dueling Network

2016/7/5 株式会社ウェブファーマー

大政　孝充

今回取り上げるのはこれ

[1] Z. Wang, et. al “Dueling Network Architectures for Deep Reinforcement Learning.” arXiv1511.06581. 2016. Q値をV値と行動aに分離することにより性能を向上させた！

DQNやDDQNの解説は

DQNの解説に関しては私の[2]「ディープラーニングの最新動向　強化学習とのコラボ編①　DQN」 http://www.slideshare.net/ssuser07aa33/introduction-to-deep-q-learning DDQNの解説に関しては私の[3]「ディープラーニングの最新動向　強化学習とのコラボ編②　DDQN」 http://www.slideshare.net/ssuser07aa33/introduction-to-double-deep-qlearning などを参考にして下さい

Dueling Networkの仕組み

[1]のFigure 1より

このへんが特徴

DQN

Dueling Network

DQNからDueling Networkまで

DQN 2013Nips

評価のQと選択のQを分ける

DQN 2015Nature

DDQN

Prioritized Replay

Qを時々コピー

学習用データを選別？

Dualing Networks

状態 s と行動 a の advantageを分ける

まず強化学習の基本から

the value of the state-action Qπ s,a( ) = E Rt st = s,at = a,π⎡⎣ ⎤⎦

V π s( ) = Ea≈π a( )

Qπ s,a( )⎡⎣ ⎤⎦the value of the state

st

st+1 st+2

st+2st+1

st+1

at1

at2

at3

Qπ s,a( )

V π s( )

the advantage functionを定義

the value of the state-action Qπ s,a( ) = E Rt st = s,at = a,π⎡⎣ ⎤⎦

V π s( ) = Ea≈π a( )

Qπ s,a( )⎡⎣ ⎤⎦the value of the state

st

st+1 st+2

st+2st+1

st+1

at1

at2

at3

Qπ s,a( )

Aπ s,a( ) =Qπ s,a( )−V π s( )the advantage function

V π s( )

差をとってる

　　から　　　を引いて　　　とするV πQπ Aπ

the advantage functionとは

st

st+1

st+1

st+1

at1

at2

at3

Qπ s,a1( ) = 3

それってどういうこと？例えば状態　　からの行動　　に対する　　値がそれぞれ・・・

Qπ s,a2( ) = 4

Qπ s,a3( ) = 2 ・・・の時

st at Q


st

st+1

st+1

st+1

at1

at2

at3

Qπ s,a1( ) = 3

はざっくり・・・

Qπ s,a2( ) = 4

Qπ s,a3( ) = 2

V V π s( ) = Ea≈π a( )

Qπ s,a( )⎡⎣ ⎤⎦=3+ 4+ 23

= 3

V π s( )


st

st+1

st+1

st+1

at1

at2

at3

Qπ s,a1( ) = 3

は・・・

Qπ s,a2( ) = 4

Qπ s,a3( ) = 2

A Aπ s,a( ) =Qπ s,a( )−V π s( ) =4−3=1!Aπ s,a1( )3−3= 0!Aπ s,a2( )2−3= −1!Aπ s,a3( )

⎧

⎨⎪⎪

⎩⎪⎪

となる

Aπ s,a1( )

Aπ s,a3( )

Aπ s,a2( )V π s( )

Dueling Networkのモデル

st

st+1

st+1

st+1

at1

at2

at3

V π

Qπ

Aπ

ここで

ここで

両方足して

実際のモデルではこうなってる

実際の計算

Aの平均を０として足し合わせる

Q s,a;θ,α( ) =V s;θ,β( )+ A s,a;θ,β( )− 1Α

A s,a ';θ,α( )a '∑

⎛

⎝⎜⎜

⎞

⎠⎟⎟

平均を引く

Q s,a;θ,α( )

V s;θ,β( )

A s,a;θ,β( )

おわり

introduction to dueling network

Data & Analytics