reinforcement learning (machine learning, sir) reinforcement learning (machine learning, sir)...

Download Reinforcement Learning (Machine Learning, SIR) Reinforcement Learning (Machine Learning, SIR) Matthieu

Post on 28-Sep-2020

4 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Reinforcement Learning (Machine Learning, SIR)

    Matthieu Geist (CentraleSupélec) matthieu.geist@centralesupelec.fr

    1 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Figure : The perception-action cycle in reinforcement learning.

    2 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Applications

    playing games (Backgammon, Go, Tetris, Atari...) robotics autonomous acrobatic helicopter control1

    operation research (pricing, vehicle routing...) human computer interactions (dialogue management, e-learning...) virtually any control problem2

    1http://heli.stanford.edu/ 2An old list: http://umichrl.pbworks.com/w/page/7597597/

    Successes_of_Reinforcement_Learning 3 / 66

    http://heli.stanford.edu/ http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    1 Formalism Markov Decision Processes Policy and value function Bellman operators

    2 Dynamic Programming Linear programming Value iteration Policy iteration

    3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

    4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

    5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

    4 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    1 Formalism Markov Decision Processes Policy and value function Bellman operators

    2 Dynamic Programming Linear programming Value iteration Policy iteration

    3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

    4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

    5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

    5 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    A Markov Decision Process (MDP) is a tuple {S,A,P, r , γ} where:

    S is the (finite) state space; A is the (finite) action space; P ∈ ∆S×AS is the Markovian transition kernel. The term P(s ′|s, a) denotes the probability of transiting in state s ′ given that action a was chosen in state s;

    r ∈ RS×A is the reward function, it associates the reward r(s, a) for taking action a in state s. The reward function is assumed to be uniformly bounded;

    γ ∈ (0, 1) is a discount factor that favors shorter term rewards (usually set to a value close to 1).

    6 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    1 Formalism Markov Decision Processes Policy and value function Bellman operators

    2 Dynamic Programming Linear programming Value iteration Policy iteration

    3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

    4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

    5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

    7 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Policy:

    π ∈ AS ; in state s, an agent applying policy π chooses the action π(s)

    Value function (quantify the quality of a policy):

    vπ(s) = E[ ∞∑ t=0

    γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))].

    Comparing policies (partial ordering):

    π1 ≥ π2 ⇔ ∀s ∈ S, vπ1(s) ≥ vπ2(s).

    Optimal policy: π∗ ∈ argmax

    π∈AS vπ.

    8 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    1 Formalism Markov Decision Processes Policy and value function Bellman operators

    2 Dynamic Programming Linear programming Value iteration Policy iteration

    3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

    4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

    5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

    9 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Rewriting the Bellman equation

    vπ(s) = E[ ∞∑ t=0

    γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

    = r(s, π(s)) + E[ ∞∑ t=1

    γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

    = r(s, π(s)) + γE[ ∞∑ t=0

    γtr(St+1, π(St+1))|S0 = s,St+1 ∼ P(.|St , π(St))]

    ⇔ vπ(s) = r(s, π(s)) + γ ∑ s′∈S

    P(s ′|s, π(s))vπ(s ′).

    10 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Rewriting the Bellman equation (cont.)

    Define the stochastic matrix Pπ ∈ RS×S and the vector rπ ∈ RS as

    Pπ = ( P(s ′|s, π(s))

    ) s,s′∈S and rπ = (r(s, π(s)))s∈S .

    Using these notations, we have:

    vπ = rπ + γPπvπ ⇔ vπ = (I − γPπ)−1rπ.

    11 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Bellman evaluation operator

    Define the Bellman evaluation operator Tπ : RS → RS as

    ∀v ∈ RS , Tπv = rπ + γPπv ,

    or equivalently componentwise

    ∀s ∈ S, [Tπv ](s) = r(s, π(s)) + γ ∑ s′∈S

    P(s ′|s, π(s))v(s ′).

    Tπ is a contraction (supremum norm) and vπ is its unique fixed point:

    vπ = Tπvπ.

    12 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Optimal value function and policies

    Assume that v∗ = vπ∗ is known, an optimal policy should be greedy resp. to v∗:

    π∗(s) ∈ argmax a∈A

    ( r(s, a) + γ

    ∑ s′∈S

    P(s ′|s, a)v∗(s ′)

    ) .

    Characterizing v∗:

    ∀s ∈ S, v∗(s) = max a∈A

    ( r(s, a) + γ

    ∑ s′∈S

    P(s ′|s, a)v∗(s ′)

    ) .

    13 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Markov Decision Processes Policy and value function Bellman operators

    Bellman optimality operator

    Define the Bellman optimality operator T∗ : RS → RS as

    ∀v ∈ RS , T∗v = max π∈AS

    (rπ + γPπv) ,

    or equivalently componentwise

    ∀s ∈ S, [T∗v ](s) = max a∈A

    ( r(s, a) + γ

    ∑ s′∈S

    P(s ′|s, a)v(s ′)

    ) .

    T∗ is a contraction (supremum norm) and v∗ is its unique fixed point:

    v∗ = T∗v∗.

    14 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Linear programming Value iteration Policy iteration

    1 Formalism Markov Decision Processes Policy and value function Bellman operators

    2 Dynamic Programming Linear programming Value iteration Policy iteration

    3 Approximate Dynamic Programming State-action value function Approximate value iteration Approximate policy iteration

    4 Online learning SARSA and Q-learning The exploration-exploitation dilemma

    5 Policy search and actor-critic methods The policy gradient theorem Actor-critic methods

    15 / 66

  • Formalism Dynamic Programming

    Approximate Dynamic Programming Online learning

    Policy search and actor-critic methods

    Linear programming Value iteration Policy iteration

    DP: solve an MDP when the model is known.

    In p

View more >