a peek into rl

34
A Peek into RL Introduction to Reinforcement Learning

Upload: others

Post on 17-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Peek into RL

A Peek into RLIntroduction to Reinforcement Learning

Page 2: A Peek into RL

01Terminology

Basic concepts in RL

The major part of this workshop

Markov Process

02

Markov Decision Process,

Markov Reward Process,

Bellman Equationsā€¦

Classic Algorithms

03

MC, TD, DPā€¦

A simple introduction

Extensions and Recommendation

04

Real-life applications; Courses

and materials for our audience to

study independently

Page 3: A Peek into RL

Introduction

Whatā€™s Reinforcement Learning?

Page 4: A Peek into RL
Page 5: A Peek into RL

Teminologies: Rewards

ā€¢ Reward Rt: A scalar feedback signal

ā€¢ Indicates how well agent is doing at step t

ā€¢ The agentā€™s job is to maximize cumulative reward (Reward Hypothesis)

ā€¢ E.g., +/āˆ’ve reward for winning/losing a game

Page 6: A Peek into RL

Find an optimal way to make decisions

- Sequential Decision Making

- With delayed rewards / penalties

- No future / long term feedbacks

Page 7: A Peek into RL

- Policy: Agentā€™s behavior function

- Value function: How good is each state and/or action

- Model: Agentā€™s representation of the environment

Page 8: A Peek into RL

Teminologies: Policy

ā€¢ Policy: Agentā€™s behavior, a map from state to action

ā€¢ Deterministic policy: a = Ļ€(s)

ā€¢ Stochastic policy: Ļ€(a | s) = P[At = a | St = s]

Page 9: A Peek into RL

Teminologies: Value Functionā€¢ Value Function: A prediction of future reward

ā€¢ Used to evaluate the goodness/badness of states (to select between actions)

ā€¢ Future reward / Return: a total sum of discounted rewards going forward

ā€¢ State Value:

ā€¢ Action Value: ā†’

Page 10: A Peek into RL

Optimal Value and Policyā€¢ The optimal value function produces the maximum return:

ā€¢ The optimal policy achieves optimal value functions:

ā€¢ Relationship: ,

Page 11: A Peek into RL

Teminologies: Model

ā€¢ A model predicts what the environment will do next

ā€¢ P predicts the next state

ā€¢ R predicts the next (immediate) reward

Page 12: A Peek into RL

- Know the model / Model-based RL: Planning with perfect information

Find the optimal solution by Dynamic Programming (DP)

E.g., Longest increasing subsequence

- Does not know the model: learning with incomplete information

Model-free RL or learn the model in the algorithm

Page 13: A Peek into RL

Categories of RL Algorithms

ā€¢ Model-based: Rely on the model of the environment; Either the model is known or

the algorithm learns it explicitly.

ā€¢ Model-free: No dependency on the model during learning.

ā€¢ On-policy: Use the deterministic outcomes or samples from the target policy to train

the algorithm.

ā€¢ Off-policy: Training on a distribution of transitions or episodes produced by a

different behavior policy rather than that produced by the target policy.

Page 14: A Peek into RL

Markov ProcessMRP, MDP, Bellman Equations

Page 15: A Peek into RL

MDPs

ā€¢ Markov decision processes (MDPs) formally describe an environment for reinforcement

learning where the environment is fully observable

ā€¢ All states in MDP has the Markov property: the future only depends on the current state, not the history. A state St is Markov if and only if:

Page 16: A Peek into RL

MDPsā€¢ Recap: State transition probability

ā€¢ A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property. Represented as a tuple <S, P>

Page 17: A Peek into RL

Example: Markov Chain Transition Matrix

Page 18: A Peek into RL

MDPs

ā€¢ Markov Reward Process (MRP): A Markov reward process is a Markov chain with values.

ā€¢ Represented as a tuple <S, P, R, Ī³>

ā€¢ A Markov deicison process (MDP) consists of five elements M=āŸØS,A,P,R,Ī³āŸ©

Page 19: A Peek into RL

Bellman Equations

ā€¢ Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values.

Page 20: A Peek into RL

Bellman Equations

ā€¢ Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values.

ā€¢ C Matrix Form:

Page 21: A Peek into RL

Bellman Equations

ā€¢ Linear Equations ā†’ Could be solved directly

ā€¢ Computational complexity is O(n^3) for n states

ā€¢ Direct solution only possible for small MRPs

ā€¢ There are many iterative methods for large MRPs,

e.g. Dynamic programming (DP)

Monte-Carlo evaluation (MC)Temporal-Difference learning (TD)

Page 22: A Peek into RL

Bellman Expectation Equation (For MDPs)

Page 23: A Peek into RL

Bellman Expectation Equations

Page 24: A Peek into RL

Bellman Expectation Equations

Matrix Form:

Page 25: A Peek into RL

Bellman Optimality Equations

- Non-linear

- No closed form solutions in general

- Many iterative solution methods

Value Iteration

Policy Iteration

Q-learning

Sarsa

Page 26: A Peek into RL

ClassicalAlgorithms

Page 27: A Peek into RL

Dynamic Programming

ā€¢ Policy Evaluation:

ā€¢ Policy Improvement:

ā€¢ Policy Iteration: An iterative procedure to improve the policy when combining policy evaluation and improvement

Page 28: A Peek into RL

About the ConvergenceSomeone raised a good question in the workshop: the convergence property of the policy iteration.

Typically, the proof in this link would be helpful.

Also, it should be noted that for the policy-based RL, only local optimum is guaranteed.

Here are the pros and cons of the policy-based RL:

Advantages:

- Better convergence properties

- Effective in high-dimensional or continuous action spaces

- Can learn stochastic policies

Disadvantages:

- Typically converge to a local rather than global optimum

- Evaluating a policy is typically inefficient and high variance

Page 29: A Peek into RL

Monte-Carlo Learning

ā€¢ MC methods need to learn from complete episodes

Page 30: A Peek into RL

Temporal Difference Learning

ā€¢ TD learning can learn from incomplete episodes

Page 31: A Peek into RL

Extensions

- Exploration and Exploitation

- DeepMind AI: Link

- Games

- Different methods to solve the same problem

- More to be exploredā€¦

Page 32: A Peek into RL
Page 34: A Peek into RL

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik.

THANKS!Do you have any questions?

Contactļ¼šNickname, WISE@CUHKSubscribe: Google FormTelegram Group: Join