![Page 1: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/1.jpg)
Machine Learning for RoboticsIntelligent Systems Series
Lecture 7
Georg Martius
MPI for Intelligent Systems, Tübingen, Germany
June 12, 2017
Georg Martius Machine Learning for Robotics June 12, 2017 1 / 31
![Page 2: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/2.jpg)
Markov Decision Processes
Georg Martius Machine Learning for Robotics June 12, 2017 2 / 31
![Page 3: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/3.jpg)
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.
Reminder: Markov propertyA state St is Markov if and only if
P (St+1 | St) = P (St+1 | S1, . . . , St)
Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,
Pss′ = P (St+1 = s′ | St = s)
Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31
![Page 4: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/4.jpg)
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.
Reminder: Markov propertyA state St is Markov if and only if
P (St+1 | St) = P (St+1 | S1, . . . , St)
Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,
Pss′ = P (St+1 = s′ | St = s)
Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31
![Page 5: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/5.jpg)
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.
Reminder: Markov propertyA state St is Markov if and only if
P (St+1 | St) = P (St+1 | S1, . . . , St)
Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,
Pss′ = P (St+1 = s′ | St = s)
Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31
![Page 6: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/6.jpg)
Example: Student Markov Chain
Georg Martius Machine Learning for Robotics June 12, 2017 4 / 31
![Page 7: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/7.jpg)
Example: Student Markov Chain Episodes
Sample episodes for starting from S1 =C1
S1, S2, . . . , ST
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31
![Page 8: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/8.jpg)
Example: Student Markov Chain Episodes
Sample episodes for starting from S1 =C1
S1, S2, . . . , ST
• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31
![Page 9: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/9.jpg)
Example: Student Markov Chain Episodes
Sample episodes for starting from S1 =C1
S1, S2, . . . , ST
• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep
Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31
![Page 10: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/10.jpg)
Example: Student Markov Chain Episodes
Sample episodes for starting from S1 =C1
S1, S2, . . . , ST
• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep
Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31
![Page 11: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/11.jpg)
Example: Student Markov Chain Transition Matrix
Georg Martius Machine Learning for Robotics June 12, 2017 6 / 31
![Page 12: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/12.jpg)
Markov Reward Process
A Markov reward process is a Markov chain with values.
Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,
P (St+1 | St) = P (St+1 | S1, . . . , St)
• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]
Note that the reward can be stochastic (Rs is in expectation)
Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31
![Page 13: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/13.jpg)
Markov Reward Process
A Markov reward process is a Markov chain with values.
Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,
P (St+1 | St) = P (St+1 | S1, . . . , St)
• R is a reward function, Rs = E[Rt+1|St = s]
• γ is a discount factor, γ ∈ [0, 1]
Note that the reward can be stochastic (Rs is in expectation)
Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31
![Page 14: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/14.jpg)
Markov Reward Process
A Markov reward process is a Markov chain with values.
Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,
P (St+1 | St) = P (St+1 | S1, . . . , St)
• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]
Note that the reward can be stochastic (Rs is in expectation)
Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31
![Page 15: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/15.jpg)
Example: Student MRP
Georg Martius Machine Learning for Robotics June 12, 2017 8 / 31
![Page 16: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/16.jpg)
Return
DefinitionThe return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + · · · =∞∑k=0
γkRt+k+1
• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.
• Extreme cases:
γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation
Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31
![Page 17: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/17.jpg)
Return
DefinitionThe return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + · · · =∞∑k=0
γkRt+k+1
• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:
γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation
Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31
![Page 18: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/18.jpg)
Return
DefinitionThe return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + · · · =∞∑k=0
γkRt+k+1
• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:
γ close to 0 leads to immidate reward maximization only
γ close to 1 leads to far-sighted evaluation
Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31
![Page 19: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/19.jpg)
Return
DefinitionThe return Gt is the total discounted reward from time-step t.
Gt = Rt+1 + γRt+2 + · · · =∞∑k=0
γkRt+k+1
• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:
γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation
Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31
![Page 20: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/20.jpg)
Discussion on discounting
Why is discouting often used?
• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewards
In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 21: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/21.jpg)
Discussion on discounting
Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)
• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 22: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/22.jpg)
Discussion on discounting
Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 23: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/23.jpg)
Discussion on discounting
Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewards
In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 24: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/24.jpg)
Discussion on discounting
Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewards
In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 25: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/25.jpg)
Discussion on discounting
Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)
• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.
Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31
![Page 26: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/26.jpg)
Value Function
The value function describes the value of a state (in the stationary state)
DefinitionThe state value function v(s) of an MRP is the exected return starting fromstate s
v(s) = E[Gt|St = s]
Georg Martius Machine Learning for Robotics June 12, 2017 11 / 31
![Page 27: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/27.jpg)
Example: Value Function for Student MRP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31
![Page 28: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/28.jpg)
Example: Value Function for Student MRP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31
![Page 29: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/29.jpg)
Example: Value Function for Student MRP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31
![Page 30: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/30.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 31: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/31.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]
= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 32: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/32.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]
= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 33: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/33.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 34: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/34.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1
Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 35: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/35.jpg)
Bellman Equation (MRP) I
Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards
v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]
Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:
v(s) = Rs + γ∑s′∈SPss′v(s′)
Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31
![Page 36: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/36.jpg)
Example: Bellman Equation for Student MRP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 14 / 31
![Page 37: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/37.jpg)
Bellman Equation (MRP) II
Bellman equations in matrix form:
v = R+ γPv
where v ∈ R|S| and R are vectors
The Bellman equation can be solved directly:
v = (I− γP)−1R
• computational complexity is O(|S|3)
Georg Martius Machine Learning for Robotics June 12, 2017 15 / 31
![Page 38: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/38.jpg)
Markov Decision Process
A Markov reward process has no agent, there is no influence on the system.
And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian
Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,
Pass′ = P (St+1 | St, At = a)
• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]
Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31
![Page 39: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/39.jpg)
Markov Decision Process
A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian
Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,
Pass′ = P (St+1 | St, At = a)
• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]
Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31
![Page 40: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/40.jpg)
Markov Decision Process
A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian
Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,
Pass′ = P (St+1 | St, At = a)
• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]
Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31
![Page 41: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/41.jpg)
Example: Student MDP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 17 / 31
![Page 42: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/42.jpg)
How to model decision taking?
The agent has a action function called policy.
DefinitionA policy π is a distribution over actions given states,
π(a|s) = P (At = a | St = s)
• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)
An MDP with a given policy turns into a MRP:
Pπss′ =∑a∈A
π(a|s)Pass′
Rπs =∑a∈A
π(a|s)Ras
Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31
![Page 43: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/43.jpg)
How to model decision taking?
The agent has a action function called policy.
DefinitionA policy π is a distribution over actions given states,
π(a|s) = P (At = a | St = s)
• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)
An MDP with a given policy turns into a MRP:
Pπss′ =∑a∈A
π(a|s)Pass′
Rπs =∑a∈A
π(a|s)Ras
Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31
![Page 44: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/44.jpg)
How to model decision taking?
The agent has a action function called policy.
DefinitionA policy π is a distribution over actions given states,
π(a|s) = P (At = a | St = s)
• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)
An MDP with a given policy turns into a MRP:
Pπss′ =∑a∈A
π(a|s)Pass′
Rπs =∑a∈A
π(a|s)Ras
Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31
![Page 45: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/45.jpg)
How to model decision taking?
The agent has a action function called policy.
DefinitionA policy π is a distribution over actions given states,
π(a|s) = P (At = a | St = s)
• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)
An MDP with a given policy turns into a MRP:
Pπss′ =∑a∈A
π(a|s)Pass′
Rπs =∑a∈A
π(a|s)Ras
Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31
![Page 46: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/46.jpg)
Modelling expected returns in MDP
How good is each state when we follow the policy pi?
DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.
vπ(s) = E[Gt|St = s]
Should we change the policy?How much does choosing a different action change the value?
DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.
qπ(s, a) = E[Gt|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31
![Page 47: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/47.jpg)
Modelling expected returns in MDP
How good is each state when we follow the policy pi?
DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.
vπ(s) = E[Gt|St = s]
Should we change the policy?How much does choosing a different action change the value?
DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.
qπ(s, a) = E[Gt|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31
![Page 48: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/48.jpg)
Modelling expected returns in MDP
How good is each state when we follow the policy pi?
DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.
vπ(s) = E[Gt|St = s]
Should we change the policy?How much does choosing a different action change the value?
DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.
qπ(s, a) = E[Gt|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31
![Page 49: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/49.jpg)
Modelling expected returns in MDP
How good is each state when we follow the policy pi?
DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.
vπ(s) = E[Gt|St = s]
Should we change the policy?How much does choosing a different action change the value?
DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.
qπ(s, a) = E[Gt|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31
![Page 50: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/50.jpg)
Example: State-Value function for Student MDP
from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 20 / 31
![Page 51: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/51.jpg)
Bellman Expectation Equation
Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,
vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]
The action-value function can be similarly decomposed,
qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31
![Page 52: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/52.jpg)
Bellman Expectation Equation
Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,
vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]
The action-value function can be similarly decomposed,
qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31
![Page 53: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/53.jpg)
Bellman Equation: joined update of vπ and qπ
Value function can be derived from qπ:
vπ(s) =∑a∈A
π(a|s)qπ(s, a)
. . . and q can be computed from transition model
qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)
Substituting q in v:
vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈SPass′vπ(s′)
)
Substituting v in q:
qπ(s, a) = Ras + γ∑s′∈SPass′
∑a′∈A
π(a′|s′)qπ(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31
![Page 54: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/54.jpg)
Bellman Equation: joined update of vπ and qπ
Value function can be derived from qπ:
vπ(s) =∑a∈A
π(a|s)qπ(s, a)
. . . and q can be computed from transition model
qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)
Substituting q in v:
vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈SPass′vπ(s′)
)
Substituting v in q:
qπ(s, a) = Ras + γ∑s′∈SPass′
∑a′∈A
π(a′|s′)qπ(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31
![Page 55: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/55.jpg)
Bellman Equation: joined update of vπ and qπ
Value function can be derived from qπ:
vπ(s) =∑a∈A
π(a|s)qπ(s, a)
. . . and q can be computed from transition model
qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)
Substituting q in v:
vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈SPass′vπ(s′)
)
Substituting v in q:
qπ(s, a) = Ras + γ∑s′∈SPass′
∑a′∈A
π(a′|s′)qπ(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31
![Page 56: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/56.jpg)
Bellman Equation: joined update of vπ and qπ
Value function can be derived from qπ:
vπ(s) =∑a∈A
π(a|s)qπ(s, a)
. . . and q can be computed from transition model
qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)
Substituting q in v:
vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈SPass′vπ(s′)
)
Substituting v in q:
qπ(s, a) = Ras + γ∑s′∈SPass′
∑a′∈A
π(a′|s′)qπ(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31
![Page 57: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/57.jpg)
Bellman Equation: joined update of vπ and qπ
Value function can be derived from qπ:
vπ(s) =∑a∈A
π(a|s)qπ(s, a)
. . . and q can be computed from transition model
qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)
Substituting q in v:
vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈SPass′vπ(s′)
)
Substituting v in q:
qπ(s, a) = Ras + γ∑s′∈SPass′
∑a′∈A
π(a′|s′)qπ(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31
![Page 58: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/58.jpg)
Example: Bellman update for v in Student MDP
from David Silverπ(a|s) = 0.5
Georg Martius Machine Learning for Robotics June 12, 2017 23 / 31
![Page 59: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/59.jpg)
Explicit solution for vπ
Since a policy induces a MRP vπ can be diretly computed (as before)
v = (I− γPπ)−1Rπ
But do we want vπ?
We want to find the optimal policy and its value function!
Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31
![Page 60: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/60.jpg)
Explicit solution for vπ
Since a policy induces a MRP vπ can be diretly computed (as before)
v = (I− γPπ)−1Rπ
But do we want vπ?
We want to find the optimal policy and its value function!
Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31
![Page 61: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/61.jpg)
Explicit solution for vπ
Since a policy induces a MRP vπ can be diretly computed (as before)
v = (I− γPπ)−1Rπ
But do we want vπ?
We want to find the optimal policy and its value function!
Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31
![Page 62: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/62.jpg)
Optimal Value Function
DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies
v∗(s) = maxπ vπ(s)
DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies
q∗(s, a) = maxπ qπ(s, a)
What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )
Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31
![Page 63: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/63.jpg)
Optimal Value Function
DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies
v∗(s) = maxπ vπ(s)
DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies
q∗(s, a) = maxπ qπ(s, a)
What does it mean?
• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )
Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31
![Page 64: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/64.jpg)
Optimal Value Function
DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies
v∗(s) = maxπ vπ(s)
DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies
q∗(s, a) = maxπ qπ(s, a)
What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )
Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31
![Page 65: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/65.jpg)
Example: Optimal Value Function v∗ in Student MDP
from David Silver
Georg Martius Machine Learning for Robotics June 12, 2017 26 / 31
![Page 66: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/66.jpg)
Example: Optimal State Function q∗ in Student MDP
from David Silver
Georg Martius Machine Learning for Robotics June 12, 2017 27 / 31
![Page 67: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/67.jpg)
Optimal Policy
Actually solving the MDP means we have also the optimal policy.
Define a partial ordering over policies
π ≥ π′ if vπ(s) ≥ vπ′(s),∀s
TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other
policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)
Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31
![Page 68: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/68.jpg)
Optimal Policy
Actually solving the MDP means we have also the optimal policy.Define a partial ordering over policies
π ≥ π′ if vπ(s) ≥ vπ′(s),∀s
TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other
policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)
Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31
![Page 69: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/69.jpg)
Finding an Optimal Policy
Given the optimal action-value function q∗:
the optimimal policy is given
by maximizing it.
π∗(a|s) = Ja = argmaxa∈A
q∗(s, a)K
J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)
Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31
![Page 70: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/70.jpg)
Finding an Optimal Policy
Given the optimal action-value function q∗: the optimimal policy is given
by maximizing it.
π∗(a|s) = Ja = argmaxa∈A
q∗(s, a)K
J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)
Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31
![Page 71: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/71.jpg)
Bellman Equation for optimal value functions
Also for the optimal value functions we can use Bellmans optimiality equations:
v∗(s) = maxa∈A q∗(s, a)
q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)
Substituting q in v:
v∗(s) = maxa∈A
(Ras + γ
∑s′∈SPass′v∗(s′)
)
Substituting v in q:
q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31
![Page 72: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/72.jpg)
Bellman Equation for optimal value functions
Also for the optimal value functions we can use Bellmans optimiality equations:
v∗(s) = maxa∈A q∗(s, a)
q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)
Substituting q in v:
v∗(s) = maxa∈A
(Ras + γ
∑s′∈SPass′v∗(s′)
)
Substituting v in q:
q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31
![Page 73: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/73.jpg)
Bellman Equation for optimal value functions
Also for the optimal value functions we can use Bellmans optimiality equations:
v∗(s) = maxa∈A q∗(s, a)
q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)
Substituting q in v:
v∗(s) = maxa∈A
(Ras + γ
∑s′∈SPass′v∗(s′)
)
Substituting v in q:
q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)
Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31
![Page 74: MachineLearningforRobotics IntelligentSystemsSeries Lecture7public.hronopik.de/files/ML4Rob/lecture7.pdf · MachineLearningforRobotics IntelligentSystemsSeries Lecture7 GeorgMartius](https://reader035.vdocuments.net/reader035/viewer/2022071015/5fce5ee10daca00d7a649f31/html5/thumbnails/74.jpg)
Solving the Bellman Optimality Equation
Bellman Optimality Equation is non-linear• No closed form solution (in general)• Many iterative solution methods
Value IterationPolicy IterationQ-learningSARSA
Georg Martius Machine Learning for Robotics June 12, 2017 31 / 31