reinforcement learning - algorytmy · institute of computing science poznan university of...
TRANSCRIPT
![Page 1: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/1.jpg)
Institute of Computing SciencePoznan University of Technology
Reinforcement LearningAlgorytmy
Michał Kempka
April 9, 2018
![Page 2: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/2.jpg)
1
Przypomnienie
Michał Kempka | Algorytmy
![Page 3: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/3.jpg)
2
Przypomnienie MDP
Jak sformalizowac uczenie ze wzmocnieniem?
Michał Kempka | Algorytmy
![Page 4: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/4.jpg)
3
Przypomnienie MDP - wzorki
I S - skonczony zbiór stanów,I A - skonczony zbiór akcji,I Pa(s,s′) = P(st+1 = s′|st = s,at = a) - model przejsc,
prawdopodobienstwo, ze bedac w stanie s, robiac akcje a,znajdziemy sie w stanie s’,
I Ra(s,s′) - nagroda przyznawana za przejscie ze stanu s dostanu s’ wykonujac akcje a,
I γ ∈ [0,1] - “discount factor” - jak bardzo patrzymy w przyszłosc
Michał Kempka | Algorytmy
![Page 5: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/5.jpg)
4
MDP - rozwiazanie
max∞∑
t=0
γtRat (st , st+1)
Analogie do uczenia nadzorowanego:I akcje→ klasyI loss→ -nagrodyI agent→ klasyfikator
Michał Kempka | Algorytmy
![Page 6: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/6.jpg)
5
Równanie Bellmana[1]
State Value
V (s) = maxa∈A
∑s′
Pa(s, s′)(Ra(s, s′ + γV (s′)))
Action-State Value (Q-value)
Q(s,a) =∑
s′
Pa(s, s′)(Ra(s, s′) + γmaxa′∈A
(Q(s′,a′)))
Michał Kempka | Algorytmy
![Page 7: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/7.jpg)
6
Full Disclosure
Troche podpatrywałem slajdy z przedmiotu MiSIO (ISWD) WojtkaJaskowskiego.
Michał Kempka | Algorytmy
![Page 8: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/8.jpg)
7
Polityka (policy)
Polityka nazywamy mapowanie π:
π : S → A
State Value
a = π(s) = argmaxa∈A
∑s′
Pa(s, s′)(Ra(s, s′ + γVπ(s′)))
Action-State Value (Q-value)
a = π(s) = argmaxa∈A
∑s′
Pa(s, s′)(Ra(s, s′) + γmaxa′∈A
(Qπ(s′,a′)))
Michał Kempka | Algorytmy
![Page 9: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/9.jpg)
8
Polityka (policy)
Polityka nazywamy mapowanie π:
π : S → A
State Value
a = π(s) = argmaxa∈A
∑s′
Pa(s, s′)(Ra(s, s′ + γVπ(s′)))
Action-State Value (Q-value)(model-free)
a = π(s) = argmaxa∈A
Qπ(s,a)
Michał Kempka | Algorytmy
![Page 10: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/10.jpg)
9
Value Iteration
Algorithm 1 Value Iteration1: initialize V[.], P(s,a,s’), R(s,a,s’) arbitrarily2: repeat3: Vt−1 = V4: for s ∈ S do5: for a ∈ A do6: V (s) = 07: for s′ ∈ S do8: V (s)+ = P(s,a, s′)(R(s,a,a′) + γVt−1(s′))9: end for
10: end for11: end for12: until V (.)converges
Michał Kempka | Algorytmy
![Page 11: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/11.jpg)
10
Temporal Difference Learning (TD)
Algorithm 2 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >
10: Vtarget = r + γV (s′)11: TDerror = V (s)− Vtarget12: V (s) = V (s)− η(TDerror )13: until V (.) converges
Michał Kempka | Algorytmy
![Page 12: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/12.jpg)
11
Temporal Difference Learning (TD)
Algorithm 3 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >
10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges
Co sie stanie jesli η = 1?
Michał Kempka | Algorytmy
![Page 13: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/13.jpg)
11
Temporal Difference Learning (TD)
Algorithm 4 Temporal Difference Learning1: initialize V[.] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose an action somehow9: observe < s,a, s′, r >
10: V (s) = V (s)− η(V (s)− (r + γV (s′)))11: until V (.) converges
Co sie stanie jesli η = 1?
Michał Kempka | Algorytmy
![Page 14: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/14.jpg)
12
Exploracja
Jak robic akcje?
ε-greedy policy!:Start with ε ≈ 1
I with probability ε make a random actionI with probability 1− ε make the best action according to your
current policy π(s)
I decrease ε as you wish (unless ε = 0)
Michał Kempka | Algorytmy
![Page 15: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/15.jpg)
12
Exploracja
Jak robic akcje?ε-greedy policy!:Start with ε ≈ 1
I with probability ε make a random actionI with probability 1− ε make the best action according to your
current policy π(s)
I decrease ε as you wish (unless ε = 0)
Michał Kempka | Algorytmy
![Page 16: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/16.jpg)
13
Ale jak wybrac najlepsza akcje?
Jak wybrac a = π(s) majac Vπ(s)?
Michał Kempka | Algorytmy
![Page 17: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/17.jpg)
13
Ale jak wybrac najlepsza akcje?
Jak wybrac a = π(s) majac Vπ(s)?
Michał Kempka | Algorytmy
![Page 18: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/18.jpg)
14
Ale jak wybrac najlepsza akcje?
Jak wybrac a = π(s) majac Vπ(s)?
a = π(s) = argmaxa∈A
∑s′
Pa(s, s′)(Ra(s, s′ + γVπ(s′)))
Potrzebujemy miec P i R!
Michał Kempka | Algorytmy
![Page 19: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/19.jpg)
15
Q-learning na ratunek!
Algorithm 5 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >
10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A
γQ(s′,a′)))
11: until Q(..) converges
Michał Kempka | Algorytmy
![Page 20: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/20.jpg)
15
Q-learning na ratunek!
Algorithm 6 Q-learning1: initialize Q[..] arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >
10: Q(s,a) = Q(s,a)− η(Q(s,a)− (r + maxa′∈A
γQ(s′,a′)))
11: until Q(..) converges
Michał Kempka | Algorytmy
![Page 21: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/21.jpg)
16
Jak wybrac najlepsza akcje?
Jak wybrac a = π(s) majac Qπ(s, .)?
a = π(s) = argmaxa∈A
Qπ(s,a)
Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).
Michał Kempka | Algorytmy
![Page 22: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/22.jpg)
16
Jak wybrac najlepsza akcje?
Jak wybrac a = π(s) majac Qπ(s, .)?
a = π(s) = argmaxa∈A
Qπ(s,a)
Nie potrzebujemy miec P i R - czyli modelu swiata! Dlatego oq-learningu mówi sie, ze jest model-free i wszyscy go kochaja(przynajmniej kochali przez wiele lat).
Michał Kempka | Algorytmy
![Page 23: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/23.jpg)
17
Aproksymatory
Przechowywanie Q w LUT (look-up-tables) jest zabójcze dla realnyproblemów, wiec potrzebujemy parametryzowanych aproksymatorówwartosci Q czyli funkcji:
Q : Q(s,a,Θ)→ R
Michał Kempka | Algorytmy
![Page 24: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/24.jpg)
18
Q-learning na ratunek!
Algorithm 7 Q-learning1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r >
10: loss = (Q(s,a,Θ)− (r + maxa′∈A
γQ(s′,a′,Θ))))2
11: Θ = Θ− η∇loss12: until Θ converges
Michał Kempka | Algorytmy
![Page 25: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/25.jpg)
19
SARSA - on-policy learning
Algorithm 8 SARSA1: initialize Θ arbitrarily2: const η ∈ (0,∞) - learning rate3: s = s0 - whatever that is4: repeat5: if s is a terminal state then6: s = s07: end if8: a = choose action (ε-gready)9: observe < s,a, s′, r ,a′ >
10: loss = (Q(s,a,Θ)− (r + γQ(s′,a′,Θ))))2
11: Θ = Θ− η∇loss12: until Θ converges
Michał Kempka | Algorytmy
![Page 26: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/26.jpg)
20
Ekstrakcja cech
Czesto przestrzen stanów jest ogromna i skomplikowana wiec robiłosie to co w innych obszarach ML czyli feature engineering i naszafunkcja zmienia sie w
Q : Q(extractf eatures(s),a,Θ)→ R
Michał Kempka | Algorytmy
![Page 27: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/27.jpg)
21
Deep Learning
Ale z nadejsciem głebokich sieci neuronowych troche siepozmieniało!
Michał Kempka | Algorytmy
![Page 28: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/28.jpg)
22
Atari i DeepMind
Human-level control through deep reinforcement learning
Michał Kempka | Algorytmy
![Page 29: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/29.jpg)
23
Atari Games
Michał Kempka | Algorytmy
![Page 30: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/30.jpg)
24
DQN Code
Michał Kempka | Algorytmy
![Page 31: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/31.jpg)
25
DQN wazne pomysły
I replay memoryI network freezingI deep-networksI frame-stackingI frame-skipping
Michał Kempka | Algorytmy
![Page 32: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/32.jpg)
26
Co dalej?
I kurs AI na Berkley z edX (wykłady takze na YT)I Wykłady Davida Silvera (z Deep Mind)I popatrzec na AIGym
Michał Kempka | Algorytmy
![Page 33: Reinforcement Learning - Algorytmy · Institute of Computing Science Poznan University of Technology Reinforcement Learning Algorytmy Michał Kempka April 9, 2018](https://reader035.vdocuments.net/reader035/viewer/2022071408/60ffb8630334f757931e46ff/html5/thumbnails/33.jpg)
27
References I
[1] Richard S. Sutton and Andrew G. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.
Michał Kempka | Algorytmy