![Page 1: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/1.jpg)
CS 598 Statistical Reinforcement Learning
Nan Jiang
![Page 2: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/2.jpg)
Overview
![Page 3: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/3.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL
3
![Page 4: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/4.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses
3
![Page 5: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/5.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation
3
![Page 6: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/6.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course
3
![Page 7: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/7.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures
3
![Page 8: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/8.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)
3
![Page 9: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/9.jpg)
What’s this course about?
• A grad-level seminar course on theory of RL• with focus on sample complexity analyses• all about proofs, some perspectives, 0 implementation• Seminar course can be anywhere between students presenting
papers and a rigorous course • In this course I will deliver most (or all) of the lectures• No text book; material is created by myself (course notes)• will share slides/notes on course website
3
![Page 10: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/10.jpg)
Who should take this course?
• This course will be a good fit for you if you…
4
![Page 11: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/11.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically
4
![Page 12: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/12.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area
4
![Page 13: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/13.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me
4
![Page 14: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/14.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
4
![Page 15: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/15.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
• This course will not be a good fit if you…
4
![Page 16: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/16.jpg)
Who should take this course?
• This course will be a good fit for you if you…• are interested in understanding RL mathematically• want to do research in theory of RL or related area• want to work with me• are comfortable with maths
• This course will not be a good fit if you…• are mostly interested in implementing RL algorithms and
making things to work we won’t cover any engineering tricks (which are essential to e.g., Deep RL); check other RL courses offered across the campus
4
![Page 17: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/17.jpg)
Prerequisites
• Maths
5
![Page 18: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/18.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus
5
![Page 19: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/19.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains
5
![Page 20: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/20.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis
5
![Page 21: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/21.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning
5
![Page 22: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/22.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
5
![Page 23: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/23.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
• e.g., CS 446 Machine Learning
5
![Page 24: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/24.jpg)
Prerequisites
• Maths• Linear algebra, probability & statistics, basic calculus• Markov chains• Optional: stochastic processes, numerical analysis• Useful for research: TCS background, empirical processes
and statistical learning theory, optimization, online learning• Exposure to ML
• e.g., CS 446 Machine Learning• Experience with RL
5
![Page 25: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/25.jpg)
Coursework
• Some readings after/before class
6
![Page 26: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/26.jpg)
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment
6
![Page 27: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/27.jpg)
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
6
![Page 28: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/28.jpg)
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers
6
![Page 29: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/29.jpg)
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to
the paper and explore the novel research question yourself
6
![Page 30: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/30.jpg)
Coursework
• Some readings after/before class• Ad hoc homework to help digest particular material. Deadline
will be lenient & TBA at the time of assignment• Main assignment: course project (work on your own)
• Baseline: reproduce theoretical analysis in existing papers• Advanced: identify an interesting/challenging extension to
the paper and explore the novel research question yourself• Or, just work on a novel research question (must have a
significant theoretical component; need to discuss with me)
6
![Page 31: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/31.jpg)
Course project (cont.)
• See list of references and potential topics on website
7
![Page 32: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/32.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
7
![Page 33: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/33.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb
7
![Page 34: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/34.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on
7
![Page 35: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/35.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?
7
![Page 36: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/36.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
7
![Page 37: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/37.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
• Final report: clarity, precision, and brevity are greatly valued. More details to come…
7
![Page 38: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/38.jpg)
Course project (cont.)
• See list of references and potential topics on website• You will need to submit:
• A brief proposal (~1/2 page). Tentative deadline: end of Feb• what’s the topic and what papers you plan to work on• why you choose the topic: what interest you?• which aspect(s) you will focus on?
• Final report: clarity, precision, and brevity are greatly valued. More details to come…
• All docs should be in pdf. Final report should be prepared using LaTeX.
7
![Page 39: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/39.jpg)
Course project (cont. 2)
Rule of thumb
8
![Page 40: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/40.jpg)
Course project (cont. 2)
Rule of thumb1. learn something that interests you
8
![Page 41: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/41.jpg)
Course project (cont. 2)
Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not
understand your report due to lack of clarity)
8
![Page 42: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/42.jpg)
Course project (cont. 2)
Rule of thumb1. learn something that interests you2. teach me something! (I wouldn’t learn if I could not
understand your report due to lack of clarity)3. write a report similar to (or better than!) the notes I will share
with you
8
![Page 43: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/43.jpg)
Contents of the course
• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL
9
![Page 44: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/44.jpg)
Contents of the course
• many important topics in RL will not be covered in depth (e.g., TD). read more if you want to get a more comprehensive view of RL
• the other opportunity to learn what’s not covered in lectures is the project, as potential topics for projects are much broader than what’s covered in class.
9
![Page 45: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/45.jpg)
My goals
• Encourage you to do research in this area
10
![Page 46: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/46.jpg)
My goals
• Encourage you to do research in this area• Introduce useful mathematical tools to you
10
![Page 47: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/47.jpg)
My goals
• Encourage you to do research in this area• Introduce useful mathematical tools to you• Learn from you
10
![Page 48: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/48.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/
11
Logistics
![Page 49: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/49.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
11
Logistics
![Page 50: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/50.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.
11
Logistics
![Page 51: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/51.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)
11
Logistics
![Page 52: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/52.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
11
Logistics
![Page 53: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/53.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
• Questions about material: ad hoc meetings after lectures subject to my availability
11
Logistics
![Page 54: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/54.jpg)
• Course website: http://nanjiang.cs.illinois.edu/cs598/• logistics, links to slides/notes (uploaded after lectures),
and resources (e.g., textbooks to consult, related courses), deadlines & announcements
• Time & Location: Tue & Thu 12:30-1:45pm, 1304 Siebel.• TA: Jinglin Chen (jinglinc)• Office hours: By appointment. 3322 Siebel.
• Questions about material: ad hoc meetings after lectures subject to my availability
• Other things about the class (e.g., project): by appointment
11
Logistics
![Page 55: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/55.jpg)
Introduction to MDPs and RL
![Page 56: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/56.jpg)
13
Reinforcement Learning (RL)Applications
![Page 57: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/57.jpg)
13
[Levine et al’16] [Ng et al’03]
Reinforcement Learning (RL)Applications
![Page 58: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/58.jpg)
13
[Levine et al’16] [Ng et al’03] [Singh et al’02]
Reinforcement Learning (RL)Applications
![Page 59: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/59.jpg)
13
[Levine et al’16] [Ng et al’03] [Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12]
Applications
![Page 60: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/60.jpg)
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12]
Applications
![Page 61: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/61.jpg)
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Lei et al’12][Tesauro et al’07]
Applications
![Page 62: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/62.jpg)
13
[Levine et al’16] [Ng et al’03][Mandel et al’16]
[Singh et al’02]
Reinforcement Learning (RL)
[Mnih et al’15][Silver et al’16][Lei et al’12]
[Tesauro et al’07]
Applications
![Page 63: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/63.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
1
Shortest Path
![Page 64: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/64.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
1
Shortest Path
state
action
![Page 65: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/65.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
Need long-term planning
Shortest Path
state
action
![Page 66: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/66.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
Need long-term planning
Shortest Path
state
action
![Page 67: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/67.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
state
action
Bellman Equation
![Page 68: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/68.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
state
action
Bellman Equation
![Page 69: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/69.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2V*(c) = 4
V*(b) = 5
V*(s0) = 6
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
Bellman Equation
![Page 70: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/70.jpg)
14
s0
b
c
d
e
f
g1
2
3
2
1
3
14
Greedy is suboptimal due to delayed effects
1
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2V*(c) = 4
V*(b) = 5
V*(s0) = 6
V*(d) = min{3 + V*(g) , 2 + V*(f)}
Need long-term planning
Shortest Path
Bellman Equation
![Page 71: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/71.jpg)
15
s0
b
c
d
e
f
g1
2
2
1
3
14
1
Shortest Path
4
![Page 72: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/72.jpg)
16
s0
b
c
d
e
f
g0.7
0.3
Stochastic
1
2
2
1
3
14
1
Shortest Path
e
0.50.5
4
![Page 73: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/73.jpg)
16
s0
b
c
d
e
f
g0.7
0.3
Stochastic
1
2
2
1
3
14
1
Shortest Path
transition distribution
e
0.50.5
Markov Decision Process (MDP)
4
state
action
![Page 74: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/74.jpg)
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
Stochastic Shortest Path
Greedy is suboptimal due to delayed effectsNeed long-term planning
![Page 75: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/75.jpg)
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
![Page 76: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/76.jpg)
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
V*(g) = 0
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
![Page 77: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/77.jpg)
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2.5V*(c) = 4.5
V*(b) = 6
V*(s0) = 6.5
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
![Page 78: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/78.jpg)
1
17
s0
b
c
d
e
f
g
0.50.5
1
2
4
2
1
3
14
0.7
0.3
V*(c) = min{4 + 0.7 × V*(d) + 0.3 × V*(e) , 2 + V*(e)}
Stochastic Shortest Path
✓: optimal policy π*
V*(g) = 0
V*(f) = 1
V*(d) = 2
V*(e) = 2.5V*(c) = 4.5
V*(b) = 6
V*(s0) = 6.5
Bellman Equation
Greedy is suboptimal due to delayed effectsNeed long-term planning
✓✓
✓
✓
✓ ✓
![Page 79: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/79.jpg)
?
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 80: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/80.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 81: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/81.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 82: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/82.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 83: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/83.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 84: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/84.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 85: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/85.jpg)
Trajectory 1: s0↘ c ↗ d → g
18
s0
b
c
d
e
f
g
1
1
2
4
2
1
3
14 0.5
0.5
0.7
0.3
via trial-and-errorStochastic Shortest Path
![Page 86: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/86.jpg)
1
2
4
2
1
3
14
1
19
s0
b
c
d
e
f
g
0.50.5
0.7
0.3
Trajectory 2: s0↘ c ↗ e → f ↗ g
via trial-and-error
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 87: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/87.jpg)
1
2
4
2
1
3
14
1
19
s0
b
c
d
e
f
g
0.50.5
0.7
0.3s0
c ee
f
g
Trajectory 2: s0↘ c ↗ e → f ↗ g
via trial-and-error
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 88: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/88.jpg)
20
…
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 89: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/89.jpg)
20
…
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 90: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/90.jpg)
20
…
0.72
0.28s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 91: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/91.jpg)
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
![Page 92: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/92.jpg)
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest PathStochastic
Model-based RL
![Page 93: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/93.jpg)
20
…
0.72
0.28s0
b
c
d
e
f
g
0.55
0.45
1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
Trajectory 2: s0↘ c ↗ e → f ↗ g
Trajectory 1: s0↘ c ↗ d → g
Shortest Path
sample complexity
Stochastic
Model-based RL
![Page 94: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/94.jpg)
21
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly
Shortest PathStochastic
![Page 95: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/95.jpg)
21
s0
b
c
d
e
f
g1
2
4
2
1
3
14
1
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
Shortest PathStochastic
![Page 96: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/96.jpg)
21
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
s0
b
c
d
e
f
g1
2
2
1
3
14
1
?
??
?
Shortest PathStochastic
4
![Page 97: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/97.jpg)
21
via trial-and-error
How many trajectories do we need to compute a near-optimal policy?
• Assume states & actions are visited uniformly• #trajectories needed ≤ n ⋅ (#state-action pairs)
#samples needed to estimate a multinomial distribution
s0
b
c
d
e
f
g1
2
2
1
3
14
1
Nontrivial! Need exploration
?
??
?
Shortest PathStochastic
4
![Page 98: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/98.jpg)
22
Video game playing
state st ∈ S
![Page 99: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/99.jpg)
22
Video game playing
action at ∈ A
+20
reward rt = R (st , at)
state st ∈ S
![Page 100: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/100.jpg)
22
Video game playing
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
(unknown)
![Page 101: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/101.jpg)
22
Video game playing
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
policy π: S → A
objective: maximize (or ) E⇥P
H
t=1 rt |⇡⇤
E⇥P1
t=1 �t�1rt |⇡
⇤
(unknown)
![Page 102: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/102.jpg)
22
Video game playing
5 10 15
γ = 0.8
γ = 0.9
5 10 15
H = 5
H = 10
e.g., random spawn of enemies
action at ∈ A
+20
reward rt = R (st , at)
transition dynamics P ( ⋅ | st , at)
state st ∈ S
policy π: S → A
objective: maximize (or ) E⇥P
H
t=1 rt |⇡⇤
E⇥P1
t=1 �t�1rt |⇡
⇤
(unknown)
![Page 103: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/103.jpg)
23
Video game playing
![Page 104: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/104.jpg)
23
Need generalization
Video game playing
![Page 105: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/105.jpg)
23
Need generalization
Video game playing
Value function approximation
![Page 106: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/106.jpg)
24
f (x;θ)
state features x
Need generalization
Video game playing
Value function approximation
![Page 107: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/107.jpg)
f (x;θ)
24
f (x;θ)
state features x
state features x
θNeed generalization
Video game playing
Value function approximation
![Page 108: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/108.jpg)
f (x;θ)
24
f (x;θ)
state features x
state features x
θFind θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]
…
Need generalization
⇒ f (⋅ ; θ) ≈ V*
Video game playing
Value function approximation
![Page 109: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/109.jpg)
f (x;θ)
24
f (x;θ)
state features x
state features x
θ
x
x’
+r
Find θ s.t. f (x;θ) ≈ r + γ ⋅ Ex’|x [ f (x’;θ)]…
Need generalization
⇒ f (⋅ ; θ) ≈ V*
Video game playing
Value function approximation
![Page 110: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/110.jpg)
25
value
state features x
Adaptive medical treatment
• State: diagnosis • Action: treatment • Reward: progress in recovery
![Page 111: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/111.jpg)
A Machine Learning view of RL
![Page 112: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/112.jpg)
27
Lecture 1: Introduction to Reinforcement Learning
About RL
Many Faces of Reinforcement Learning
Computer Science
Economics
Mathematics
Engineering Neuroscience
Psychology
Machine Learning
Classical/OperantConditioning
Optimal Control
RewardSystem
Operations Research
BoundedRationality
Reinforcement Learning
slide credit: David Silver
![Page 113: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/113.jpg)
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y
28
![Page 114: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/114.jpg)
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
28
![Page 115: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/115.jpg)
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
28
![Page 116: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/116.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
28
![Page 117: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/117.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
28
![Page 118: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/118.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions
28
![Page 119: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/119.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)
28
![Page 120: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/120.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone
28
![Page 121: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/121.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone• “Full information setting”
28
![Page 122: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/122.jpg)
^
Supervised Learning
Given {(x(i), y(i))}, learn f : x ↦ y• Online version: for round t = 1, 2,…, the learner
• observes x(t)
• predicts y(t)
• receives y(t)
• Want to maximize # of correct predictions• e.g., classifies if an image is about a dog, a cat, a plane, etc.
(multi-class classification)• Dataset is fixed for everyone• “Full information setting”• Core challenge: generalization
28
![Page 123: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/123.jpg)
Contextual bandits
For round t = 1, 2,…, the learner
29
![Page 124: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/124.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A
29
![Page 125: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/125.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)
29
![Page 126: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/126.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward
29
![Page 127: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/127.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!
29
![Page 128: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/128.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
29
![Page 129: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/129.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.
29
![Page 130: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/130.jpg)
Contextual bandits
For round t = 1, 2,…, the learner• Given xi , chooses from a set of actions ai ∈ A• Receives reward r(t) ~ R(x(t) , a(t)) (i.e., can be random)• Want to maximize total reward• You generate your own dataset {(x(t) , a(t) , r(t))}!• e.g., for an image, the learner guesses a label, and is told
whether correct or not (reward = 1 if correct and 0 otherwise). Do not know what’s the true label.
• e.g., for an user, the website recommends a movie, and observes whether the user likes it or not. Do not know what movies the user really want to see.
• “Partial information setting”
29
![Page 131: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/131.jpg)
Contextual bandits
Contextual Bandits (cont.)
30
![Page 132: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/132.jpg)
Contextual bandits
Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)
30
![Page 133: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/133.jpg)
Contextual bandits
Contextual Bandits (cont.)• Simplification: no x, Multi-Armed Bandits (MAB)• Bandit is a research area by itself. we will not do a lot of bandits
but may go through some material that have important implications on general RL (e.g., lower bounds)
30
![Page 134: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/134.jpg)
RL
For round t = 1, 2,…,
31
![Page 135: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/135.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
31
![Page 136: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/136.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
31
![Page 137: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/137.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
31
![Page 138: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/138.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
31
![Page 139: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/139.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
31
![Page 140: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/140.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
• Bandits + “Delayed rewards/consequences”
31
![Page 141: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/141.jpg)
RL
For round t = 1, 2,…,
• For time step h=1, 2, …, H, the learner
• Observes xh(t)
• Chooses ah(t)
• Receives rh(t) ~ R(xh(t) , ah(t))
• Next xh+1(t) is generated as a function of xh(t) and ah(t)
(or sometimes, all previous x’s and a’s within round t)
• Bandits + “Delayed rewards/consequences”
• The protocol here is for episodic RL (each t is an episode).
31
![Page 142: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/142.jpg)
Why statistical RL?
32
![Page 143: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/143.jpg)
Two types of scenarios in RL research
Why statistical RL?
32
![Page 144: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/144.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
Why statistical RL?
32
![Page 145: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/145.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics
Why statistical RL?
32
![Page 146: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/146.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states
Why statistical RL?
32
![Page 147: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/147.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
Why statistical RL?
32
![Page 148: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/148.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem
Why statistical RL?
32
![Page 149: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/149.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment
Why statistical RL?
32
![Page 150: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/150.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)
Why statistical RL?
32
![Page 151: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/151.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach- e.g., AlphaGo, video game playing, simulated robotics- Transition dynamics (Go rules) known, but too many states- Run the simulator to collect data
2. Solving a learning problem- e.g., adaptive medical treatment- Transition dynamics unknown (and too many states)- Interact with the environment to collect data
Why statistical RL?
32
![Page 152: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/152.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
Why statistical RL?
33
![Page 153: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/153.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.
Why statistical RL?
33
![Page 154: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/154.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second.
Why statistical RL?
33
![Page 155: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/155.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second. • Even for #1, sample complexity lower-bounds computational
complexity — statistical-first approach is reasonable.
Why statistical RL?
33
![Page 156: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/156.jpg)
Two types of scenarios in RL research
1. Solving a large planning problem using a learning approach
2. Solving a learning problem
• I am more interested in #2. More challenging in some aspects.• Data (real-world interactions) is highest priority. Computation
second. • Even for #1, sample complexity lower-bounds computational
complexity — statistical-first approach is reasonable.• caveat to this argument: you can do a lot more in a simulator;
see http://hunch.net/?p=8825714
Why statistical RL?
33
![Page 157: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/157.jpg)
MDP Planning
![Page 158: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/158.jpg)
Infinite-horizon discounted MDPs
35
![Page 159: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/159.jpg)
An MDP M = (S, A, P, R, γ)
Infinite-horizon discounted MDPs
35
![Page 160: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/160.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
Infinite-horizon discounted MDPs
35
![Page 161: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/161.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
Infinite-horizon discounted MDPs
35
![Page 162: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/162.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
![Page 163: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/163.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
![Page 164: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/164.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
![Page 165: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/165.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
• Discount factor γ ∈ [0,1)
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
![Page 166: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/166.jpg)
An MDP M = (S, A, P, R, γ)
• State space S.
• Action space A.
• Transition function P : S×A→∆(S). ∆(S) is the probability simplex over S, i.e., all non-negative vectors of length |S| that sums up to 1
• Reward function R: S×A→ℝ. (deterministic reward function)
• Discount factor γ ∈ [0,1)
• The agent starts in some state s1, takes action a1, receives reward r2 ~ R(s1, a1), transitions to s2 ~ P(s1, a1), takes action a2, so on so forth — the process continues indefinitely
Infinite-horizon discounted MDPs
35
We will only consider discrete and finite spaces in this course.
![Page 167: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/167.jpg)
• Want to take actions in a way that maximizes value (or return):
Value and policy
36
𝔼 [∑∞t=1 γt−1rt]
![Page 168: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/168.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act
Value and policy
36
𝔼 [∑∞t=1 γt−1rt]
![Page 169: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/169.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
Value and policy
36
rt ∈ [0, Rmax]
𝔼 [∑∞t=1 γt−1rt]
![Page 170: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/170.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt]
𝔼 [∑∞t=1 γt−1rt]
![Page 171: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/171.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
![Page 172: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/172.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
![Page 173: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/173.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
![Page 174: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/174.jpg)
• Want to take actions in a way that maximizes value (or return):
• This value depends on where you start and how you act• Often assume boundedness of rewards:
• What’s the range of ?
• A (deterministic) policy π: S→A describes how the agent acts: at state st, chooses action at = π(st).
• More generally, the agent may choose actions randomly (π: S→ ∆(A)), or even in a way that varies across time steps (“non-stationary policies”)
• Define
Value and policy
36
rt ∈ [0, Rmax]𝔼 [∑∞
t=1 γt−1rt] [0,Rmax
1 − γ ]
𝔼 [∑∞t=1 γt−1rt]
Vπ(s) = 𝔼 [∑∞t=1 γt−1rt s1 = s, π ]
![Page 175: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/175.jpg)
Bellman equation for policy evaluation
37
V ⇡(s) = E" 1X
t=1
�t�1rt��s1 = s,⇡
#
= E"r1 +
1X
t=2
�t�1rt��s1 = s,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s1 = s, s2 = s0,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s2 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))E
" 1X
t=1
�t�1rt��s1 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))V ⇡(s0)
= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>
![Page 176: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/176.jpg)
Bellman equation for policy evaluation
37
V ⇡(s) = E" 1X
t=1
�t�1rt��s1 = s,⇡
#
= E"r1 +
1X
t=2
�t�1rt��s1 = s,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s1 = s, s2 = s0,⇡
#
= R(s,⇡(s)) +X
s02SP (s0|s,⇡(s))E
"�
1X
t=2
�t�2rt��s2 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))E
" 1X
t=1
�t�1rt��s1 = s0,⇡
#
= R(s,⇡(s)) + �X
s02SP (s0|s,⇡(s))V ⇡(s0)
= R(s,⇡(s)) + �hP (·|s,⇡(s)), V ⇡(·)i<latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit><latexit sha1_base64="EOqUViW3eaQRlDRi0wywMhhOOhU=">AAAFNniczVRbaxNBFJ52V63xluqjLweDJsE07AZBXwpFEXzwIV6SFjLJMjuZbIbOXpiZFcJ2+6d88Xf41hcfFPHVn+DsJoaYVqUFwYFhD+fyfd85nB0/EVxpxznZ2LTsS5evbF2tXLt+4+at6vbtvopTSVmPxiKWBz5RTPCI9TTXgh0kkpHQF2zfP3xWxPffMal4HL3Vs4QNQxJEfMIp0cblbVsv+6MMJzxvqCbswoNjAMAh0VPfz57nWLCJHmCVhl6md918hHk00TPAAQlDMsr0jpuD9DQcY58HR6A814CoFhhEwOYjeTDVQ4wrv4GWpuAh/CToXJQAlgyvG6plYqab5hJY1Q1qyUyJyN7kOXQbqn40RykzC6T1pksFf5TWOVua8jqFUf8/RZ5T2yrBhSTCubfnH4pb7nq9+TdQQaJAMAOC6TjWKzitVaAy2DR7Lct0r1pz2k554LThLowaWpyuV/2IxzFNQxZpKohSA9dJ9DAjUnMqWF7BqWIJoYckYANjRiRkapiVv30O941nDJNYmhtpKL2rFRkJlZqFvsks5qLWY4XzrNgg1ZMnw4xHSapZROdEk1SAjqF4Q2DMJaNazIxBqORGK9ApkYRq89JUzBDc9ZZPG/1O23Xa7qtHtb2ni3FsobvoHmogFz1Ge+gF6qIeotZ768T6bH2xP9if7K/2t3nq5sai5g765djffwCC5qb0</latexit>
![Page 177: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/177.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
![Page 178: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/178.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
![Page 179: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/179.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
![Page 180: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/180.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
![Page 181: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/181.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
![Page 182: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/182.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
Vπ = (I − γPπ)−1Rπ
![Page 183: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/183.jpg)
Bellman equation for policy evaluation
38
Matrix form: define
• Vπ as the |S|×1 vector [Vπ(s)]s∈S
• Rπ as the vector [R(s, π(s))]s∈S
• Pπ as the matrix [P(s’|s, π(s))]s∈S, s’∈S
Vπ(s) = R(s, π(s)) + γ⟨P( ⋅ |s, π(s)), Vπ( ⋅ )⟩
Vπ = Rπ + γPπVπ
(I − γPπ)Vπ = Rπ
Vπ = (I − γPπ)−1Rπ
This is always invertible. Proof?
![Page 184: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/184.jpg)
State occupancy
39
(I − γPπ)−1
![Page 185: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/185.jpg)
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
State occupancy
39
(I − γPπ)−1
![Page 186: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/186.jpg)
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]
![Page 187: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/187.jpg)
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
![Page 188: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/188.jpg)
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
• Vπ(s) is the dot product between and reward vector
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
dπs
![Page 189: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/189.jpg)
Each row (indexed by s) is the discounted state occupancy dsπ,
whose (s’)-th entry is
• Each row is like a distribution vector—except that the entries sum up to 1/(1-γ). Let denote the normalized vector.
• Vπ(s) is the dot product between and reward vector
• Can also be interpreted as the value function of indicator reward function
State occupancy
39
(I − γPπ)−1
dπs (s′�) = 𝔼 [
∞
∑t=1
γt−1 𝕀[st = s′�] s1 = s, π ]ηπ
s = (1 − γ) dπs
dπs
![Page 190: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/190.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
Optimality
40
![Page 191: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/191.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
Optimality
40
![Page 192: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/192.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
![Page 193: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/193.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
![Page 194: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/194.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
![Page 195: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/195.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
• Easier to work with Q-values: Q*(s, a), as
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
π⋆(s) = arg maxa∈A
Q⋆(s, a)
![Page 196: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/196.jpg)
• For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously• Proof: Puterman’94, Thm 6.2.7 (reference due to Shipra Agrawal)
• Let π* denote this optimal policy, and V* := Vπ*
• Bellman Optimality Equation:
• If we know V*, how to get π* ?
• Easier to work with Q-values: Q*(s, a), as
Optimality
40
V⋆(s) = maxa∈A (R(s, a) + γ𝔼s′�∼P(s,a) [V⋆(s′�)])
π⋆(s) = arg maxa∈A
Q⋆(s, a)
Q⋆(s, a) = R(s, a) + γ𝔼s′�∼P(s,a) [maxa′�∈A
Q⋆(s′�, a′�)]
![Page 197: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/197.jpg)
Ad Hoc Homework 1
41
![Page 198: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/198.jpg)
• uploaded on course website
Ad Hoc Homework 1
41
![Page 199: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/199.jpg)
• uploaded on course website
• help understand the relationships between alternative MDP formulations
Ad Hoc Homework 1
41
![Page 200: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/200.jpg)
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
Ad Hoc Homework 1
41
![Page 201: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/201.jpg)
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
• no need to submit
Ad Hoc Homework 1
41
![Page 202: CS 598 Statistical Reinforcement Learningnanjiang.cs.illinois.edu/files/cs598/slides_intro_s19.pdf · What’s this course about? • A grad-level seminar course on theory of RL •](https://reader034.vdocuments.net/reader034/viewer/2022042101/5e7d6ae75058af22f145f4db/html5/thumbnails/202.jpg)
• uploaded on course website
• help understand the relationships between alternative MDP formulations
• more like readings w/ questions to think about
• no need to submit
• will go through in class next week
Ad Hoc Homework 1
41