markov decision processes iii + rl

39
CS 188: Artificial Intelligence Markov Decision Processes III + RL Instructor: Nathan Lambert University of California, Berkeley

Upload: others

Post on 23-May-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Markov Decision Processes III + RL

CS 188: Artificial IntelligenceMarkov Decision Processes III + RL

Instructor: Nathan Lambert

University of California, Berkeley

Page 2: Markov Decision Processes III + RL

Recap: Defining MDPs

o Markov decision processes:o Set of states So Start state s0o Set of actions Ao Transitions P(s’|s,a) (or T(s,a,s’))o Rewards R(s,a,s’) (and discount g)

o MDP quantities so far:o Policy = Choice of action for each stateoUtility = sum of (discounted) rewards

a

s

s, a

s,a,s’s’

Page 3: Markov Decision Processes III + RL

Values of States

o Recursive definition of value:

a

s

s, a

s,a,s’s’

V⇤(s) = Q⇤(s, a)maxa

Q⇤(s, a) = R(s, a, s0)+ V⇤(s0)g[ ]Âs0

T(s, a, s0)

V⇤(s) = maxa Â

s0T(s, a, s0)[R(s, a, s0) + gV⇤(s0)]

Page 4: Markov Decision Processes III + RL

Value Iteration

o Start with V0(s) = 0: no time steps left means an expected reward sum of zero

o Given vector of Vk(s) values, do one ply of expectimax from each state:

o Repeat until convergence

o (Complexity of each iteration: O(S2A) )

a

Vk+1(s)

s, a

s,a,s’

Vk(s’)

Page 5: Markov Decision Processes III + RL

Computing Actions from Values

o Let’s imagine we have the optimal values V*(s)

o How should we act?o It’s not obvious!

o We need to do a mini-expectimax (one step)

o This is called policy extraction, since it gets the policy implied by the values

Page 6: Markov Decision Processes III + RL

Policy Evaluationo How do we calculate the V’s for a fixed policy p?

o Idea 1: Turn recursive Bellman equations into updates(like value iteration)

o Efficiency: O(S2) per iteration

o Idea 2: Without the maxes, the Bellman equations are just a linear systemo Solve with Matlab (or your favorite linear system solver)

p(s)

s

s, p(s)

s, p(s),s’s’

Page 7: Markov Decision Processes III + RL

Policy Iteration

o Evaluation: For fixed current policy p, find values with policy evaluation:o Iterate until values converge:

o Improvement: For fixed values, get a better policy using policy extractiono One-step look-ahead:

Page 8: Markov Decision Processes III + RL

Comparison

o Both value iteration and policy iteration compute the same thing (all optimal values)

o In value iteration:o Every iteration updates both the values and (implicitly) the policyo We don’t track the policy, but taking the max over actions implicitly recomputes it

o In policy iteration:o We do several passes that update utilities with fixed policy (each pass is fast because we

consider only one action, not all of them)o After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)o The new policy will be better (or we’re done)

o Both are dynamic programs for solving MDPs

Page 9: Markov Decision Processes III + RL

Convergence when Solving MDPs

o Redefine value update as general Bellman Utility updateo Recursive update or utility (sum of discounted reward)

o How does this converge?oAssume fixed policy 𝜋"(𝑠).o R(s) is the short term reward of being in s

Ui+1(s) R(s) + � maxa2A(s)

XP (s0|s, a)Ui(s

0)

<latexit sha1_base64="QfA+m7VGcEy2UnpgFeVvuJjVsW8=">AAACOXicbVC7TsMwFHV4U14FRhaLFtEKVCUIBCOPhbEgAkhNFd24TrGwnch2gCr0t1j4CzYkFgYQYuUHcEsHXkeyfHzOvbq+J0o508Z1H52h4ZHRsfGJycLU9MzsXHF+4VQnmSLUJwlP1HkEmnImqW+Y4fQ8VRRExOlZdHnQ88+uqNIskSemk9KmgLZkMSNgrBQW6+WyH+ZszetWdBUHnMYGlEqu8XHvvYaDNggBOBBwE+b2ZhLvWaeLA50JXK/o1Vu9DlXsh8zyarkcFktuze0D/yXegJTQAPWw+BC0EpIJKg3hoHXDc1PTzEEZRjjtFoJM0xTIJbRpw1IJgupm3t+8i1es0sJxouyRBvfV7x05CK07IrKVAsyF/u31xP+8RmbinWbOZJoZKsnXoDjj2CS4FyNuMUWJ4R1LgChm/4rJBSggxoZdsCF4v1f+S043at5mbetoo7S7P4hjAi2hZVRBHtpGu+gQ1ZGPCLpDT+gFvTr3zrPz5rx/lQ45g55F9APOxyfISaiq</latexit>

Page 10: Markov Decision Processes III + RL

Convergence when Solving MDPs

o How does this update rule converge?

o Re-write update:o B is a linear operator (like a matrix)oU is a vector

o Interested in delta between Utilities:

Ui+1(s) R(s) + � maxa2A(s)

XP (s0|s, a)Ui(s

0)

<latexit sha1_base64="QfA+m7VGcEy2UnpgFeVvuJjVsW8=">AAACOXicbVC7TsMwFHV4U14FRhaLFtEKVCUIBCOPhbEgAkhNFd24TrGwnch2gCr0t1j4CzYkFgYQYuUHcEsHXkeyfHzOvbq+J0o508Z1H52h4ZHRsfGJycLU9MzsXHF+4VQnmSLUJwlP1HkEmnImqW+Y4fQ8VRRExOlZdHnQ88+uqNIskSemk9KmgLZkMSNgrBQW6+WyH+ZszetWdBUHnMYGlEqu8XHvvYaDNggBOBBwE+b2ZhLvWaeLA50JXK/o1Vu9DlXsh8zyarkcFktuze0D/yXegJTQAPWw+BC0EpIJKg3hoHXDc1PTzEEZRjjtFoJM0xTIJbRpw1IJgupm3t+8i1es0sJxouyRBvfV7x05CK07IrKVAsyF/u31xP+8RmbinWbOZJoZKsnXoDjj2CS4FyNuMUWJ4R1LgChm/4rJBSggxoZdsCF4v1f+S043at5mbetoo7S7P4hjAi2hZVRBHtpGu+gQ1ZGPCLpDT+gFvTr3zrPz5rx/lQ45g55F9APOxyfISaiq</latexit>

Ui+1 BUi

<latexit sha1_base64="Mf3oKIjV8tSgkis1ERqVOWKoPug=">AAACB3icbVBNS8NAEN3Ur1q/oh4FWWwEQShJUfRY6sVjBdMW2hA22027dJMNuxulhN68+Fe8eFDEq3/Bm//GTZuDVh8MPN6bYWZekDAqlW1/GaWl5ZXVtfJ6ZWNza3vH3N1rS54KTFzMGRfdAEnCaExcRRUj3UQQFAWMdILxVe537oiQlMe3apIQL0LDmIYUI6Ul3zy0LNfP6KkzhX1GQoWE4PewCV2fQsuq+GbVrtkzwL/EKUgVFGj55md/wHEakVhhhqTsOXaivAwJRTEj00o/lSRBeIyGpKdpjCIivWz2xxQea2UAQy50xQrO1J8TGYqknESB7oyQGslFLxf/83qpCi+9jMZJqkiM54vClEHFYR4KHFBBsGITTRAWVN8K8QgJhJWOLg/BWXz5L2nXa85Z7fymXm00izjK4AAcgRPggAvQANegBVyAwQN4Ai/g1Xg0no03433eWjKKmX3wC8bHNzg1lvk=</latexit>

kUi+1 � Uik

<latexit sha1_base64="wSRlwsUUm5IVyLZ09cX50OZXVk8=">AAACA3icbZDNSsNAFIUn9a/Wv6g73Qw2giCWpCi6LLpxWcG0hTaEyXTSDk4mYWYilFhw46u4caGIW1/CnW/jpM1CWw8MfJx7L3fuCRJGpbLtb6O0sLi0vFJeraytb2xumds7LRmnAhMXxywWnQBJwignrqKKkU4iCIoCRtrB3VVeb98TIWnMb9UoIV6EBpyGFCOlLd/csyzYe4Cun9FjZwxPNNHcsKyKb1btmj0RnAengCoo1PTNr14/xmlEuMIMSdl17ER5GRKKYkbGlV4qSYLwHRqQrkaOIiK9bHLDGB5qpw/DWOjHFZy4vycyFEk5igLdGSE1lLO13Pyv1k1VeOFllCepIhxPF4UpgyqGeSCwTwXBio00ICyo/ivEQyQQVjq2PARn9uR5aNVrzmnt7KZebVwWcZTBPjgAR8AB56ABrkETuACDR/AMXsGb8WS8GO/Gx7S1ZBQzu+CPjM8f48SUeg==</latexit>

kBUi+1 �BUik �kUi+1 � Uik

<latexit sha1_base64="PXlyEObw0jJuAWUAq/5rU/5jzCw=">AAACKXicbZBPS8MwGMZT/875r+rRS3AVBHG0Q9HjmBePE+w2WMdIs3QLS9qapMKo/Tpe/CpeFBT16hex7Qrq5gOBX573fUnexw0Zlco0P7SFxaXlldXSWnl9Y3NrW9/ZbckgEpjYOGCB6LhIEkZ9YiuqGOmEgiDuMtJ2x5dZvX1HhKSBf6MmIelxNPSpRzFSqdXX64YBnXvYgHY/psdWAk9yppnpMHILnSHiHGXXn46MkswyjHJfr5hVMxecB6uACijU7OsvziDAESe+wgxJ2bXMUPViJBTFjCRlJ5IkRHiMhqSboo84kb043zSBh6kzgF4g0uMrmLu/J2LEpZxwN+3kSI3kbC0z/6t1I+Vd9GLqh5EiPp4+5EUMqgBmscEBFQQrNkkBYUHTv0I8QgJhlYabhWDNrjwPrVrVOq2eXdcq9UYRRwnsgwNwBCxwDurgCjSBDTB4AE/gFbxpj9qz9q59TlsXtGJmD/yR9vUNlb2h+g==</latexit>

Page 11: Markov Decision Processes III + RL

Convergence when Solving MDPs

o How does this delta converge?

oUtility error estimate reduced by 𝛾 each iteration:o Total Utilities are bounded,

o Consider minimum initial error:o (Max norm)

o Max error: reduce by discount each step.

kBUi+1 �BUik �kUi+1 � Uik

<latexit sha1_base64="PXlyEObw0jJuAWUAq/5rU/5jzCw=">AAACKXicbZBPS8MwGMZT/875r+rRS3AVBHG0Q9HjmBePE+w2WMdIs3QLS9qapMKo/Tpe/CpeFBT16hex7Qrq5gOBX573fUnexw0Zlco0P7SFxaXlldXSWnl9Y3NrW9/ZbckgEpjYOGCB6LhIEkZ9YiuqGOmEgiDuMtJ2x5dZvX1HhKSBf6MmIelxNPSpRzFSqdXX64YBnXvYgHY/psdWAk9yppnpMHILnSHiHGXXn46MkswyjHJfr5hVMxecB6uACijU7OsvziDAESe+wgxJ2bXMUPViJBTFjCRlJ5IkRHiMhqSboo84kb043zSBh6kzgF4g0uMrmLu/J2LEpZxwN+3kSI3kbC0z/6t1I+Vd9GLqh5EiPp4+5EUMqgBmscEBFQQrNkkBYUHTv0I8QgJhlYabhWDNrjwPrVrVOq2eXdcq9UYRRwnsgwNwBCxwDurgCjSBDTB4AE/gFbxpj9qz9q59TlsXtGJmD/yR9vUNlb2h+g==</latexit>

± Rmax

(1� �)

<latexit sha1_base64="6YNsgPV9ego3I0locy5v/dowkEs=">AAACDHicbVDLSgMxFM3UV62vqks3wVaoC8tMUXRZdOOyin1Ap5Q7aaYNTWaGJCOWYT7Ajb/ixoUibv0Ad/6N6WOhrQcCh3PO5eYeL+JMadv+tjJLyyura9n13Mbm1vZOfnevocJYElonIQ9lywNFOQtoXTPNaSuSFITHadMbXo395j2VioXBnR5FtCOgHzCfEdBG6uYLxSJ2I4FdXwJJbruJgIc0TUrOidsHIeA4LRZNyi7bE+BF4sxIAc1Q6+a/3F5IYkEDTTgo1XbsSHcSkJoRTtOcGysaARlCn7YNDUBQ1Ukmx6T4yCg97IfSvEDjifp7IgGh1Eh4JilAD9S8Nxb/89qx9i86CQuiWNOATBf5Mcc6xONmcI9JSjQfGQJEMvNXTAZgatGmv5wpwZk/eZE0KmXntHx2UylUL2d1ZNEBOkQl5KBzVEXXqIbqiKBH9Ixe0Zv1ZL1Y79bHNJqxZjP76A+szx92dJn9</latexit>

kU0 � Uk 2Rmax

(1� �)

<latexit sha1_base64="wnvFoSUQbhqDVPmles1M9qGSZMY=">AAACHHicbZC7TsMwFIYdrqXcCowsFi1SGVolBQQjgoWxIFKQmio6cZ1iYSfBdhBVyIOw8CosDCDEwoDE2+BeBm6/ZOnTf87R8fmDhDOlbfvTmpicmp6ZLcwV5xcWl5ZLK6stFaeSUJfEPJYXASjKWURdzTSnF4mkIAJOz4Oro0H9/IZKxeLoTPcT2hHQi1jICGhj+aXtSgV7d9j1bVzD7gA9Tq+xF0ogWePUzwTc5nlWdWpeD4SArbxS8Utlu24Phf+CM4YyGqvpl969bkxSQSNNOCjVduxEdzKQmhFO86KXKpoAuYIebRuMQFDVyYbH5XjTOF0cxtK8SOOh+30iA6FUXwSmU4C+VL9rA/O/WjvV4X4nY1GSahqR0aIw5VjHeJAU7jJJieZ9A0AkM3/F5BJMLtrkWTQhOL9P/gutRt3Zqe+eNMoHh+M4CmgdbaAqctAeOkDHqIlcRNA9ekTP6MV6sJ6sV+tt1DphjWfW0A9ZH18EXJ7q</latexit>

1X

i=0

Rmax�i

<latexit sha1_base64="/8k495AvZLHnpcyVrR8UDgugBlA=">AAACEnicbVA9SwNBEN3zM8avqKXNYiJoE+6Coo0QtLGMYlTIxWNus5cs2d07dvfEcNxvsPGv2FgoYmtl579xE1No9MHA470ZZuaFCWfauO6nMzU9Mzs3X1goLi4tr6yW1tYvdZwqQpsk5rG6DkFTziRtGmY4vU4UBRFyehX2T4b+1S1VmsXywgwS2hbQlSxiBIyVgtJupYJ9nYogY0dufuMzGZkBPg8yAXc59rsgBNxkLK9UglLZrboj4L/EG5MyGqMRlD78TkxSQaUhHLRueW5i2hkowwinedFPNU2A9KFLW5ZKEFS3s9FLOd62SgdHsbIlDR6pPycyEFoPRGg7BZienvSG4n9eKzXRYTtjMkkNleR7UZRybGI8zAd3mKLE8IElQBSzt2LSAwXE2BSLNgRv8uW/5LJW9faq+2e1cv14HEcBbaIttIM8dIDq6BQ1UBMRdI8e0TN6cR6cJ+fVeftunXLGMxvoF5z3L/HYnQo=</latexit>

Page 12: Markov Decision Processes III + RL

Utility Error Bound

o Error at step 0:

o Error at step N:

o Steps for error below 𝜖:

kU0 � Uk 2Rmax

(1� �)

<latexit sha1_base64="wnvFoSUQbhqDVPmles1M9qGSZMY=">AAACHHicbZC7TsMwFIYdrqXcCowsFi1SGVolBQQjgoWxIFKQmio6cZ1iYSfBdhBVyIOw8CosDCDEwoDE2+BeBm6/ZOnTf87R8fmDhDOlbfvTmpicmp6ZLcwV5xcWl5ZLK6stFaeSUJfEPJYXASjKWURdzTSnF4mkIAJOz4Oro0H9/IZKxeLoTPcT2hHQi1jICGhj+aXtSgV7d9j1bVzD7gA9Tq+xF0ogWePUzwTc5nlWdWpeD4SArbxS8Utlu24Phf+CM4YyGqvpl969bkxSQSNNOCjVduxEdzKQmhFO86KXKpoAuYIebRuMQFDVyYbH5XjTOF0cxtK8SOOh+30iA6FUXwSmU4C+VL9rA/O/WjvV4X4nY1GSahqR0aIw5VjHeJAU7jJJieZ9A0AkM3/F5BJMLtrkWTQhOL9P/gutRt3Zqe+eNMoHh+M4CmgdbaAqctAeOkDHqIlcRNA9ekTP6MV6sJ6sV+tt1DphjWfW0A9ZH18EXJ7q</latexit>

�N · 2Rmax

(1� �)< ✏

<latexit sha1_base64="VOpZbZqVA3PnILbUtr2GyxskQlw=">AAACI3icbVC7SgNBFJ31bXxFLW0GE0ELw25QFLEQbaxExaiQjeHuZDYZnMcyMyuGZf/Fxl+xsVDExsJ/cfIofB0YOJxzLnfuiRLOjPX9D29kdGx8YnJqujAzOze/UFxcujQq1YTWiOJKX0dgKGeS1iyznF4nmoKIOL2Kbo96/tUd1YYpeWG7CW0IaEsWMwLWSc3iXrmMwzYIATcnOCQtZXEYayBZ9byZCbjP82w92BwkNnK8j0OaGMaVLJebxZJf8fvAf0kwJCU0xGmz+Ba2FEkFlZZwMKYe+IltZKAtI5zmhTA1NAFyC21ad1SCoKaR9W/M8ZpTWjhW2j1pcV/9PpGBMKYrIpcUYDvmt9cT//PqqY13GxmTSWqpJINFccqxVbhXGG4xTYnlXUeAaOb+ikkHXEXW1VpwJQS/T/5LLquVYKuyfVYtHRwO65hCK2gVraMA7aADdIxOUQ0R9ICe0At69R69Z+/Nex9ER7zhzDL6Ae/zC1aOot8=</latexit>

N =log

�2Rmax✏(1��)

log( 1� )

<latexit sha1_base64="ppfcEuiUgrFTWOtmtLVl9pS3iBg=">AAACRXicbVA9TxsxGPZR2ob0g6OMLBZJpTA0uotA7VIpgoUJBUQSpFwUvef4Dgt/nGxfRXS6P9eFnY1/0KUDCLG2TnIDDX0lS4+fD732E2ecGRsEd97aq/XXb97WNurv3n/4uOlvfRoYlWtC+0RxpS9iMJQzSfuWWU4vMk1BxJwO46ujuT78QbVhSp7bWUbHAlLJEkbAOmriR80mPsHfcZRoIAXmKo1ilraqe+dsUgi4LssioplhXEncCr9EKQgBeyWeW/fKwoVaS3/ojAuxdGqzOfEbQTtYDH4Jwgo0UDW9iX8bTRXJBZWWcDBmFAaZHRegLSOclvUoNzQDcgUpHTkoQVAzLhYtlPizY6Y4UdodafGCfZ4oQBgzE7FzCrCXZlWbk//TRrlNvo0LJrPcUkmWi5KcY6vwvFI8ZZoSy2cOANHMvRWTS3B9WFd83ZUQrn75JRh02uF+++C00+geVnXU0A7aRS0Uoq+oi45RD/URQT/RL3SPHrwb77f36D0trWteldlG/4z35y9IwbBz</latexit>

Page 13: Markov Decision Processes III + RL

MDP Convergence Visualized

o Value iteration converges exponentially (with discount factor)

o Policy iteration willconverge linearly to 0.

Page 14: Markov Decision Processes III + RL

Summary: MDP Algorithms

o So you want to….o Compute optimal values: use value iteration or policy iterationo Compute values for a particular policy: use policy evaluationo Turn your values into a policy: use policy extraction (one-step lookahead)

o These all look the same!o They basically are – they are all variations of Bellman updateso They all use one-step lookahead expectimax fragmentso They differ only in whether we plug in a fixed policy or max over actions

Page 15: Markov Decision Processes III + RL

Partially Observed MDPs

o How accurate is our model of MDPs?o Example: Roboto Example: Video game

o Do the pixels constitute all the information of a system?o Notion of observing some states!o Belief distribution

o Partially Observed MDP (POMDP)

Page 16: Markov Decision Processes III + RL

Partially Observed MDPs

o How accurate is our model of MDPs?o Example: Roboto Example: Video game

o Do the pixels constitute all the information of a system?o Notion of observing some states!o Belief distribution over true states (from observations)

o Partially Observed MDP (POMDP)

Page 17: Markov Decision Processes III + RL

Partially Observed MDPs

o Consider an example:o Finite number of pixelso Finite number of actionso Finite number of timestepsoHuge state-space.

o Can we solve an MDP with observations instead of states? oHow is this solved?oDeep Q-learning (approximate Q values)oAre the transitions and reward functions known?

Page 18: Markov Decision Processes III + RL

Seminal Paper

o Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.

Page 19: Markov Decision Processes III + RL

DeepMind Atari (©Two Minute Lectures)

19

Page 20: Markov Decision Processes III + RL

Reinforcement Learning

Page 21: Markov Decision Processes III + RL

Double Bandits

Page 22: Markov Decision Processes III + RL

Double-Bandit MDP

o Actions: Blue, Redo States: Win, Lose

W L$1

1.0

$1

1.0

0.25 $0

0.75 $2

0.75 $2

0.25 $0

No discount10 time steps

Both states have the same value

Page 23: Markov Decision Processes III + RL

Offline Planningo Solving MDPs is offline planning

o You determine all quantities through computationo You need to know the details of the MDPo You do not actually play the game!

Play Red

Play Blue

Value

No discount10 time steps

Both states have the same value

15

10

W L$1

1.0

$1

1.0

0.25 $0

0.75 $2

0.75 $2

0.25 $0

Page 24: Markov Decision Processes III + RL

Let’s Play!

$2 $2 $0 $2 $2$2 $2 $0 $0 $0

Page 25: Markov Decision Processes III + RL

Online Planning

o Rules changed! Red’s win chance is different.

W L$1

1.0

$1

1.0

?? $0

?? $2

?? $2

?? $0

Page 26: Markov Decision Processes III + RL

Let’s Play!

$0 $0 $0$2 $2 $2 $0 $0 $0$2

Page 27: Markov Decision Processes III + RL

What Just Happened?

o That wasn’t planning, it was learning!o Specifically, reinforcement learningo There was an MDP, but you couldn’t solve it with just computationo You needed to actually act to figure it out

o Important ideas in reinforcement learning that came upo Exploration: you have to try unknown actions to get informationo Exploitation: eventually, you have to use what you knowo Regret: even if you learn intelligently, you make mistakeso Sampling: because of chance, you have to try things repeatedlyo Difficulty: learning can be much harder than solving a known MDP

Page 28: Markov Decision Processes III + RL

Reinforcement Learning

o Still assume a Markov decision process (MDP):o A set of states s Î So A set of actions (per state) Ao A model T(s,a,s’)o A reward function R(s,a,s’)

o Still looking for a policy p(s)

o New twist: don’t know T or Ro I.e. we don’t know which states are good or what the actions doo Must actually try actions and states out to learn

Page 29: Markov Decision Processes III + RL

Reinforcement Learning

o Basic idea:o Receive feedback in the form of rewardso Agent’s utility is defined by the reward functiono Must (learn to) act so as to maximize expected rewardso All learning is based on observed samples of outcomes!

Environment

Agent

Actions: aState: s

Reward: r

Page 30: Markov Decision Processes III + RL

Example: Learning to Fly

[Video: quad-fly][Lambert et. al, RA-L 2019]

Page 31: Markov Decision Processes III + RL

Example: Sidewinding

[Andrew Ng] [Video: SNAKE – climbStep+sidewinding]

Page 32: Markov Decision Processes III + RL

Example: Toddler Robot

[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]

Page 33: Markov Decision Processes III + RL

The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]

Page 34: Markov Decision Processes III + RL

Video of Demo Crawler Bot

Page 35: Markov Decision Processes III + RL

Reinforcement Learning

o Still assume a Markov decision process (MDP):o A set of states s Î So A set of actions (per state) Ao A model T(s,a,s’)o A reward function R(s,a,s’)

o Still looking for a policy p(s)

o New twist: don’t know T or Ro I.e. we don’t know which states are good or what the actions doo Must actually try actions and states out to learno Get ‘measurement’ of R at each step

Page 36: Markov Decision Processes III + RL

Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

Page 37: Markov Decision Processes III + RL

Model-Free Learning

Page 38: Markov Decision Processes III + RL

Model-Based Learning

Page 39: Markov Decision Processes III + RL

Passive Reinforcement Learning