markov decision processes iii + rl

CS 188: Artificial IntelligenceMarkov Decision Processes III + RL

Instructor: Nathan Lambert

University of California, Berkeley

Recap: Defining MDPs

o Markov decision processes:o Set of states So Start state s0o Set of actions Ao Transitions P(s’|s,a) (or T(s,a,s’))o Rewards R(s,a,s’) (and discount g)

o MDP quantities so far:o Policy = Choice of action for each stateoUtility = sum of (discounted) rewards

a

s

s, a

s,a,s’s’

Values of States

o Recursive definition of value:

a

s

s, a

s,a,s’s’

V⇤(s) = Q⇤(s, a)maxa

Q⇤(s, a) = R(s, a, s0)+ V⇤(s0)g[ ]Âs0

T(s, a, s0)

V⇤(s) = maxa Â

s0T(s, a, s0)[R(s, a, s0) + gV⇤(s0)]

Value Iteration

o Start with V0(s) = 0: no time steps left means an expected reward sum of zero

o Given vector of Vk(s) values, do one ply of expectimax from each state:

o Repeat until convergence

o (Complexity of each iteration: O(S2A) )

a

Vk+1(s)

s, a

s,a,s’

Vk(s’)

Computing Actions from Values

o Let’s imagine we have the optimal values V*(s)

o How should we act?o It’s not obvious!

o We need to do a mini-expectimax (one step)

o This is called policy extraction, since it gets the policy implied by the values

Policy Evaluationo How do we calculate the V’s for a fixed policy p?

o Idea 1: Turn recursive Bellman equations into updates(like value iteration)

o Efficiency: O(S2) per iteration

o Idea 2: Without the maxes, the Bellman equations are just a linear systemo Solve with Matlab (or your favorite linear system solver)

p(s)

s

s, p(s)

s, p(s),s’s’

Policy Iteration

o Evaluation: For fixed current policy p, find values with policy evaluation:o Iterate until values converge:

o Improvement: For fixed values, get a better policy using policy extractiono One-step look-ahead:

Comparison

o Both value iteration and policy iteration compute the same thing (all optimal values)

o In value iteration:o Every iteration updates both the values and (implicitly) the policyo We don’t track the policy, but taking the max over actions implicitly recomputes it

o In policy iteration:o We do several passes that update utilities with fixed policy (each pass is fast because we

consider only one action, not all of them)o After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)o The new policy will be better (or we’re done)

o Both are dynamic programs for solving MDPs

Convergence when Solving MDPs

o Redefine value update as general Bellman Utility updateo Recursive update or utility (sum of discounted reward)

o How does this converge?oAssume fixed policy 𝜋"(𝑠).o R(s) is the short term reward of being in s

Ui+1(s) R(s) + � maxa2A(s)

XP (s0|s, a)Ui(s

0)

<latexit sha1_base64="QfA+m7VGcEy2UnpgFeVvuJjVsW8=">AAACOXicbVC7TsMwFHV4U14FRhaLFtEKVCUIBCOPhbEgAkhNFd24TrGwnch2gCr0t1j4CzYkFgYQYuUHcEsHXkeyfHzOvbq+J0o508Z1H52h4ZHRsfGJycLU9MzsXHF+4VQnmSLUJwlP1HkEmnImqW+Y4fQ8VRRExOlZdHnQ88+uqNIskSemk9KmgLZkMSNgrBQW6+WyH+ZszetWdBUHnMYGlEqu8XHvvYaDNggBOBBwE+b2ZhLvWaeLA50JXK/o1Vu9DlXsh8zyarkcFktuze0D/yXegJTQAPWw+BC0EpIJKg3hoHXDc1PTzEEZRjjtFoJM0xTIJbRpw1IJgupm3t+8i1es0sJxouyRBvfV7x05CK07IrKVAsyF/u31xP+8RmbinWbOZJoZKsnXoDjj2CS4FyNuMUWJ4R1LgChm/4rJBSggxoZdsCF4v1f+S043at5mbetoo7S7P4hjAi2hZVRBHtpGu+gQ1ZGPCLpDT+gFvTr3zrPz5rx/lQ45g55F9APOxyfISaiq</latexit>


o How does this update rule converge?

o Re-write update:o B is a linear operator (like a matrix)oU is a vector

o Interested in delta between Utilities:

Ui+1(s) R(s) + � maxa2A(s)

XP (s0|s, a)Ui(s

0)

<latexit sha1_base64="QfA+m7VGcEy2UnpgFeVvuJjVsW8=">AAACOXicbVC7TsMwFHV4U14FRhaLFtEKVCUIBCOPhbEgAkhNFd24TrGwnch2gCr0t1j4CzYkFgYQYuUHcEsHXkeyfHzOvbq+J0o508Z1H52h4ZHRsfGJycLU9MzsXHF+4VQnmSLUJwlP1HkEmnImqW+Y4fQ8VRRExOlZdHnQ88+uqNIskSemk9KmgLZkMSNgrBQW6+WyH+ZszetWdBUHnMYGlEqu8XHvvYaDNggBOBBwE+b2ZhLvWaeLA50JXK/o1Vu9DlXsh8zyarkcFktuze0D/yXegJTQAPWw+BC0EpIJKg3hoHXDc1PTzEEZRjjtFoJM0xTIJbRpw1IJgupm3t+8i1es0sJxouyRBvfV7x05CK07IrKVAsyF/u31xP+8RmbinWbOZJoZKsnXoDjj2CS4FyNuMUWJ4R1LgChm/4rJBSggxoZdsCF4v1f+S043at5mbetoo7S7P4hjAi2hZVRBHtpGu+gQ1ZGPCLpDT+gFvTr3zrPz5rx/lQ45g55F9APOxyfISaiq</latexit>

Ui+1 BUi

<latexit sha1_base64="Mf3oKIjV8tSgkis1ERqVOWKoPug=">AAACB3icbVBNS8NAEN3Ur1q/oh4FWWwEQShJUfRY6sVjBdMW2hA22027dJMNuxulhN68+Fe8eFDEq3/Bm//GTZuDVh8MPN6bYWZekDAqlW1/GaWl5ZXVtfJ6ZWNza3vH3N1rS54KTFzMGRfdAEnCaExcRRUj3UQQFAWMdILxVe537oiQlMe3apIQL0LDmIYUI6Ul3zy0LNfP6KkzhX1GQoWE4PewCV2fQsuq+GbVrtkzwL/EKUgVFGj55md/wHEakVhhhqTsOXaivAwJRTEj00o/lSRBeIyGpKdpjCIivWz2xxQea2UAQy50xQrO1J8TGYqknESB7oyQGslFLxf/83qpCi+9jMZJqkiM54vClEHFYR4KHFBBsGITTRAWVN8K8QgJhJWOLg/BWXz5L2nXa85Z7fymXm00izjK4AAcgRPggAvQANegBVyAwQN4Ai/g1Xg0no03433eWjKKmX3wC8bHNzg1lvk=</latexit>

kUi+1 � Uik

<latexit sha1_base64="wSRlwsUUm5IVyLZ09cX50OZXVk8=">AAACA3icbZDNSsNAFIUn9a/Wv6g73Qw2giCWpCi6LLpxWcG0hTaEyXTSDk4mYWYilFhw46u4caGIW1/CnW/jpM1CWw8MfJx7L3fuCRJGpbLtb6O0sLi0vFJeraytb2xumds7LRmnAhMXxywWnQBJwignrqKKkU4iCIoCRtrB3VVeb98TIWnMb9UoIV6EBpyGFCOlLd/csyzYe4Cun9FjZwxPNNHcsKyKb1btmj0RnAengCoo1PTNr14/xmlEuMIMSdl17ER5GRKKYkbGlV4qSYLwHRqQrkaOIiK9bHLDGB5qpw/DWOjHFZy4vycyFEk5igLdGSE1lLO13Pyv1k1VeOFllCepIhxPF4UpgyqGeSCwTwXBio00ICyo/ivEQyQQVjq2PARn9uR5aNVrzmnt7KZebVwWcZTBPjgAR8AB56ABrkETuACDR/AMXsGb8WS8GO/Gx7S1ZBQzu+CPjM8f48SUeg==</latexit>

kBUi+1 �BUik �kUi+1 � Uik

<latexit sha1_base64="PXlyEObw0jJuAWUAq/5rU/5jzCw=">AAACKXicbZBPS8MwGMZT/875r+rRS3AVBHG0Q9HjmBePE+w2WMdIs3QLS9qapMKo/Tpe/CpeFBT16hex7Qrq5gOBX573fUnexw0Zlco0P7SFxaXlldXSWnl9Y3NrW9/ZbckgEpjYOGCB6LhIEkZ9YiuqGOmEgiDuMtJ2x5dZvX1HhKSBf6MmIelxNPSpRzFSqdXX64YBnXvYgHY/psdWAk9yppnpMHILnSHiHGXXn46MkswyjHJfr5hVMxecB6uACijU7OsvziDAESe+wgxJ2bXMUPViJBTFjCRlJ5IkRHiMhqSboo84kb043zSBh6kzgF4g0uMrmLu/J2LEpZxwN+3kSI3kbC0z/6t1I+Vd9GLqh5EiPp4+5EUMqgBmscEBFQQrNkkBYUHTv0I8QgJhlYabhWDNrjwPrVrVOq2eXdcq9UYRRwnsgwNwBCxwDurgCjSBDTB4AE/gFbxpj9qz9q59TlsXtGJmD/yR9vUNlb2h+g==</latexit>


o How does this delta converge?

oUtility error estimate reduced by 𝛾 each iteration:o Total Utilities are bounded,

o Consider minimum initial error:o (Max norm)

o Max error: reduce by discount each step.

kBUi+1 �BUik �kUi+1 � Uik

<latexit sha1_base64="PXlyEObw0jJuAWUAq/5rU/5jzCw=">AAACKXicbZBPS8MwGMZT/875r+rRS3AVBHG0Q9HjmBePE+w2WMdIs3QLS9qapMKo/Tpe/CpeFBT16hex7Qrq5gOBX573fUnexw0Zlco0P7SFxaXlldXSWnl9Y3NrW9/ZbckgEpjYOGCB6LhIEkZ9YiuqGOmEgiDuMtJ2x5dZvX1HhKSBf6MmIelxNPSpRzFSqdXX64YBnXvYgHY/psdWAk9yppnpMHILnSHiHGXXn46MkswyjHJfr5hVMxecB6uACijU7OsvziDAESe+wgxJ2bXMUPViJBTFjCRlJ5IkRHiMhqSboo84kb043zSBh6kzgF4g0uMrmLu/J2LEpZxwN+3kSI3kbC0z/6t1I+Vd9GLqh5EiPp4+5EUMqgBmscEBFQQrNkkBYUHTv0I8QgJhlYabhWDNrjwPrVrVOq2eXdcq9UYRRwnsgwNwBCxwDurgCjSBDTB4AE/gFbxpj9qz9q59TlsXtGJmD/yR9vUNlb2h+g==</latexit>

± Rmax

(1� �)

<latexit sha1_base64="6YNsgPV9ego3I0locy5v/dowkEs=">AAACDHicbVDLSgMxFM3UV62vqks3wVaoC8tMUXRZdOOyin1Ap5Q7aaYNTWaGJCOWYT7Ajb/ixoUibv0Ad/6N6WOhrQcCh3PO5eYeL+JMadv+tjJLyyura9n13Mbm1vZOfnevocJYElonIQ9lywNFOQtoXTPNaSuSFITHadMbXo395j2VioXBnR5FtCOgHzCfEdBG6uYLxSJ2I4FdXwJJbruJgIc0TUrOidsHIeA4LRZNyi7bE+BF4sxIAc1Q6+a/3F5IYkEDTTgo1XbsSHcSkJoRTtOcGysaARlCn7YNDUBQ1Ukmx6T4yCg97IfSvEDjifp7IgGh1Eh4JilAD9S8Nxb/89qx9i86CQuiWNOATBf5Mcc6xONmcI9JSjQfGQJEMvNXTAZgatGmv5wpwZk/eZE0KmXntHx2UylUL2d1ZNEBOkQl5KBzVEXXqIbqiKBH9Ixe0Zv1ZL1Y79bHNJqxZjP76A+szx92dJn9</latexit>

kU0 � Uk 2Rmax

(1� �)

<latexit sha1_base64="wnvFoSUQbhqDVPmles1M9qGSZMY=">AAACHHicbZC7TsMwFIYdrqXcCowsFi1SGVolBQQjgoWxIFKQmio6cZ1iYSfBdhBVyIOw8CosDCDEwoDE2+BeBm6/ZOnTf87R8fmDhDOlbfvTmpicmp6ZLcwV5xcWl5ZLK6stFaeSUJfEPJYXASjKWURdzTSnF4mkIAJOz4Oro0H9/IZKxeLoTPcT2hHQi1jICGhj+aXtSgV7d9j1bVzD7gA9Tq+xF0ogWePUzwTc5nlWdWpeD4SArbxS8Utlu24Phf+CM4YyGqvpl969bkxSQSNNOCjVduxEdzKQmhFO86KXKpoAuYIebRuMQFDVyYbH5XjTOF0cxtK8SOOh+30iA6FUXwSmU4C+VL9rA/O/WjvV4X4nY1GSahqR0aIw5VjHeJAU7jJJieZ9A0AkM3/F5BJMLtrkWTQhOL9P/gutRt3Zqe+eNMoHh+M4CmgdbaAqctAeOkDHqIlcRNA9ekTP6MV6sJ6sV+tt1DphjWfW0A9ZH18EXJ7q</latexit>

1X

i=0

Rmax�i

<latexit sha1_base64="/8k495AvZLHnpcyVrR8UDgugBlA=">AAACEnicbVA9SwNBEN3zM8avqKXNYiJoE+6Coo0QtLGMYlTIxWNus5cs2d07dvfEcNxvsPGv2FgoYmtl579xE1No9MHA470ZZuaFCWfauO6nMzU9Mzs3X1goLi4tr6yW1tYvdZwqQpsk5rG6DkFTziRtGmY4vU4UBRFyehX2T4b+1S1VmsXywgwS2hbQlSxiBIyVgtJupYJ9nYogY0dufuMzGZkBPg8yAXc59rsgBNxkLK9UglLZrboj4L/EG5MyGqMRlD78TkxSQaUhHLRueW5i2hkowwinedFPNU2A9KFLW5ZKEFS3s9FLOd62SgdHsbIlDR6pPycyEFoPRGg7BZienvSG4n9eKzXRYTtjMkkNleR7UZRybGI8zAd3mKLE8IElQBSzt2LSAwXE2BSLNgRv8uW/5LJW9faq+2e1cv14HEcBbaIttIM8dIDq6BQ1UBMRdI8e0TN6cR6cJ+fVeftunXLGMxvoF5z3L/HYnQo=</latexit>

Utility Error Bound

o Error at step 0:

o Error at step N:

o Steps for error below 𝜖:

kU0 � Uk 2Rmax

(1� �)

<latexit sha1_base64="wnvFoSUQbhqDVPmles1M9qGSZMY=">AAACHHicbZC7TsMwFIYdrqXcCowsFi1SGVolBQQjgoWxIFKQmio6cZ1iYSfBdhBVyIOw8CosDCDEwoDE2+BeBm6/ZOnTf87R8fmDhDOlbfvTmpicmp6ZLcwV5xcWl5ZLK6stFaeSUJfEPJYXASjKWURdzTSnF4mkIAJOz4Oro0H9/IZKxeLoTPcT2hHQi1jICGhj+aXtSgV7d9j1bVzD7gA9Tq+xF0ogWePUzwTc5nlWdWpeD4SArbxS8Utlu24Phf+CM4YyGqvpl969bkxSQSNNOCjVduxEdzKQmhFO86KXKpoAuYIebRuMQFDVyYbH5XjTOF0cxtK8SOOh+30iA6FUXwSmU4C+VL9rA/O/WjvV4X4nY1GSahqR0aIw5VjHeJAU7jJJieZ9A0AkM3/F5BJMLtrkWTQhOL9P/gutRt3Zqe+eNMoHh+M4CmgdbaAqctAeOkDHqIlcRNA9ekTP6MV6sJ6sV+tt1DphjWfW0A9ZH18EXJ7q</latexit>

�N · 2Rmax

(1� �)< ✏

<latexit sha1_base64="VOpZbZqVA3PnILbUtr2GyxskQlw=">AAACI3icbVC7SgNBFJ31bXxFLW0GE0ELw25QFLEQbaxExaiQjeHuZDYZnMcyMyuGZf/Fxl+xsVDExsJ/cfIofB0YOJxzLnfuiRLOjPX9D29kdGx8YnJqujAzOze/UFxcujQq1YTWiOJKX0dgKGeS1iyznF4nmoKIOL2Kbo96/tUd1YYpeWG7CW0IaEsWMwLWSc3iXrmMwzYIATcnOCQtZXEYayBZ9byZCbjP82w92BwkNnK8j0OaGMaVLJebxZJf8fvAf0kwJCU0xGmz+Ba2FEkFlZZwMKYe+IltZKAtI5zmhTA1NAFyC21ad1SCoKaR9W/M8ZpTWjhW2j1pcV/9PpGBMKYrIpcUYDvmt9cT//PqqY13GxmTSWqpJINFccqxVbhXGG4xTYnlXUeAaOb+ikkHXEXW1VpwJQS/T/5LLquVYKuyfVYtHRwO65hCK2gVraMA7aADdIxOUQ0R9ICe0At69R69Z+/Nex9ER7zhzDL6Ae/zC1aOot8=</latexit>

N =log

�2Rmax✏(1��)

�

log( 1� )

<latexit sha1_base64="ppfcEuiUgrFTWOtmtLVl9pS3iBg=">AAACRXicbVA9TxsxGPZR2ob0g6OMLBZJpTA0uotA7VIpgoUJBUQSpFwUvef4Dgt/nGxfRXS6P9eFnY1/0KUDCLG2TnIDDX0lS4+fD732E2ecGRsEd97aq/XXb97WNurv3n/4uOlvfRoYlWtC+0RxpS9iMJQzSfuWWU4vMk1BxJwO46ujuT78QbVhSp7bWUbHAlLJEkbAOmriR80mPsHfcZRoIAXmKo1ilraqe+dsUgi4LssioplhXEncCr9EKQgBeyWeW/fKwoVaS3/ojAuxdGqzOfEbQTtYDH4Jwgo0UDW9iX8bTRXJBZWWcDBmFAaZHRegLSOclvUoNzQDcgUpHTkoQVAzLhYtlPizY6Y4UdodafGCfZ4oQBgzE7FzCrCXZlWbk//TRrlNvo0LJrPcUkmWi5KcY6vwvFI8ZZoSy2cOANHMvRWTS3B9WFd83ZUQrn75JRh02uF+++C00+geVnXU0A7aRS0Uoq+oi45RD/URQT/RL3SPHrwb77f36D0trWteldlG/4z35y9IwbBz</latexit>

MDP Convergence Visualized

o Value iteration converges exponentially (with discount factor)

o Policy iteration willconverge linearly to 0.

Summary: MDP Algorithms

o So you want to….o Compute optimal values: use value iteration or policy iterationo Compute values for a particular policy: use policy evaluationo Turn your values into a policy: use policy extraction (one-step lookahead)

o These all look the same!o They basically are – they are all variations of Bellman updateso They all use one-step lookahead expectimax fragmentso They differ only in whether we plug in a fixed policy or max over actions

Partially Observed MDPs

o How accurate is our model of MDPs?o Example: Roboto Example: Video game

o Do the pixels constitute all the information of a system?o Notion of observing some states!o Belief distribution

o Partially Observed MDP (POMDP)


o How accurate is our model of MDPs?o Example: Roboto Example: Video game

o Do the pixels constitute all the information of a system?o Notion of observing some states!o Belief distribution over true states (from observations)

o Partially Observed MDP (POMDP)


o Consider an example:o Finite number of pixelso Finite number of actionso Finite number of timestepsoHuge state-space.

o Can we solve an MDP with observations instead of states? oHow is this solved?oDeep Q-learning (approximate Q values)oAre the transitions and reward functions known?

Seminal Paper

o Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.

DeepMind Atari (©Two Minute Lectures)

19

Reinforcement Learning

Double Bandits

Double-Bandit MDP

o Actions: Blue, Redo States: Win, Lose

W L$1

1.0

$1

1.0

0.25 $0

0.75 $2

0.75 $2

0.25 $0

No discount10 time steps

Both states have the same value

Offline Planningo Solving MDPs is offline planning

o You determine all quantities through computationo You need to know the details of the MDPo You do not actually play the game!

Play Red

Play Blue

Value

No discount10 time steps

Both states have the same value

15

10

W L$1

1.0

$1

1.0

0.25 $0

0.75 $2

0.75 $2

0.25 $0

Let’s Play!

$2 $2 $0 $2 $2$2 $2 $0 $0 $0

Online Planning

o Rules changed! Red’s win chance is different.

W L$1

1.0

$1

1.0

?? $0

?? $2

?? $2

?? $0

Let’s Play!

$0 $0 $0$2 $2 $2 $0 $0 $0$2

What Just Happened?

o That wasn’t planning, it was learning!o Specifically, reinforcement learningo There was an MDP, but you couldn’t solve it with just computationo You needed to actually act to figure it out

o Important ideas in reinforcement learning that came upo Exploration: you have to try unknown actions to get informationo Exploitation: eventually, you have to use what you knowo Regret: even if you learn intelligently, you make mistakeso Sampling: because of chance, you have to try things repeatedlyo Difficulty: learning can be much harder than solving a known MDP


o Still assume a Markov decision process (MDP):o A set of states s Î So A set of actions (per state) Ao A model T(s,a,s’)o A reward function R(s,a,s’)

o Still looking for a policy p(s)

o New twist: don’t know T or Ro I.e. we don’t know which states are good or what the actions doo Must actually try actions and states out to learn


o Basic idea:o Receive feedback in the form of rewardso Agent’s utility is defined by the reward functiono Must (learn to) act so as to maximize expected rewardso All learning is based on observed samples of outcomes!

Environment

Agent

Actions: aState: s

Reward: r

Example: Learning to Fly

[Video: quad-fly][Lambert et. al, RA-L 2019]

Example: Sidewinding

[Andrew Ng] [Video: SNAKE – climbStep+sidewinding]

Example: Toddler Robot

[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]

The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]

Video of Demo Crawler Bot


o Still assume a Markov decision process (MDP):o A set of states s Î So A set of actions (per state) Ao A model T(s,a,s’)o A reward function R(s,a,s’)

o Still looking for a policy p(s)

o New twist: don’t know T or Ro I.e. we don’t know which states are good or what the actions doo Must actually try actions and states out to learno Get ‘measurement’ of R at each step

Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

Model-Free Learning

Model-Based Learning

Passive Reinforcement Learning

markov decision processes iii + rl

Documents