reinforcement learning and optimal controla selective...

26
Reinforcement Learning and Optimal Control A Selective Overview Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology 2018 CDC December 2018 Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Upload: buidiep

Post on 20-Feb-2019

250 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Reinforcement Learning and Optimal ControlA Selective Overview

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

2018 CDC

December 2018

Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Page 2: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Reinforcement Learning (RL): A Happy Union of AI andDecision/Control Ideas

Decision/Control/DP

Principle of Optimality

Markov Decision Problems

POMDP

Policy IterationValue Iteration

AI/RLLearning through

Experience

Simulation,Model-Free Methods

Feature-Based

Representations

A*/Games/Heuristics

Complementary Ideas

Late 80s-Early 90s

Historical highlightsExact DP, optimal control (Bellman, Shannon, 1950s ...)

First major successes: Backgammon programs (Tesauro, 1992, 1996)

Algorithmic progress, analysis, applications, first books (mid 90s ...)

Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)

AlphaGo and Alphazero (DeepMind, 2016, 2017)

Bertsekas (M.I.T.) Reinforcement Learning 2 / 33

Page 3: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

AlphaGo (2016) and AlphaZero (2017)

AlphaZero (Google-Deep Mind)

Plays different!

Learned from scratch ... with 4 hours of training!

Plays much better than all chess programs

Same algorithm learned multiple games (Go, Shogi)

Bertsekas (M.I.T.) Reinforcement Learning 3 / 33

Page 4: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

AlphaZero was Trained Using Self-Generated Data

φjf =

!1 if j ∈ If

0 if j /∈ If

dfi = 0 if i /∈ If

pff(u) =

n"

i=1

dfi

n"

j=1

pij(u)φjf

g(f, u) =n"

i=1

dfi

n"

j=1

pij(u)g(i, u, j)

Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4

f1 f2 f3 f4 f5 f6 f7

Neural Network Features Approximate Cost Jµ Policy Improvement

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

Representative Feature States dfi f f with Aggregation

Current Policy µ Improved Policy µµ

TµΦr Φr = ΠTµΦr

Generate “Improved” Policy µ

State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ

r = (r1, . . . , rs)

State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures

Cost = 2αϵ rk+1 = arg minr∈ℜs

m"

t=1

Nt−1"

τ=0

$φ(iτ,t)′r − cτ,t(rk)

%2

µℓ

µ1 − µℓ

µ

1

φjf =

!1 if j ∈ If

0 if j /∈ If

dfi = 0 if i /∈ If

pff(u) =

n"

i=1

dfi

n"

j=1

pij(u)φjf

g(f, u) =n"

i=1

dfi

n"

j=1

pij(u)g(i, u, j)

Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4

f1 f2 f3 f4 f5 f6 f7

Neural Network Features Approximate Cost Jµ Policy Improvement

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

Representative Feature States dfi f f with Aggregation

Current Policy µ Improved Policy µµ

TµΦr Φr = ΠTµΦr

Generate “Improved” Policy µ

State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ

r = (r1, . . . , rs)

State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures

Cost = 2αϵ rk+1 = arg minr∈ℜs

m"

t=1

Nt−1"

τ=0

$φ(iτ,t)′r − cτ,t(rk)

%2

µℓ

µ1 − µℓ

µ

1

φjf =

!1 if j ∈ If

0 if j /∈ If

dfi = 0 if i /∈ If

pff(u) =

n"

i=1

dfi

n"

j=1

pij(u)φjf

g(f, u) =n"

i=1

dfi

n"

j=1

pij(u)g(i, u, j)

Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4

f1 f2 f3 f4 f5 f6 f7

Neural Network Features Approximate Cost Jµ Policy Improvement

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

Representative Feature States dfi f f with Aggregation

Current Policy µ Improved Policy µµ

TµΦr Φr = ΠTµΦr

Generate “Improved” Policy µ

State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ

r = (r1, . . . , rs)

State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures

Cost = 2αϵ rk+1 = arg minr∈ℜs

m"

t=1

Nt−1"

τ=0

$φ(iτ,t)′r − cτ,t(rk)

%2

µℓ

µ1 − µℓ

µ

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation

U U1 U2

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

"

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation

U U1 U2

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

"

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation

U U1 U2

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

"

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

"

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “Improved”

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

"

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Has been used to learn other games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i)

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

Wp0 : Lyapounov region for p0 Wp0 VI converges to J 0p from within

Wp0

W+: Lyapounov region for p(x) ⌘ 1, x 6= t W+

1

φjf =

!1 if j ∈ If

0 if j /∈ If

dfi = 0 if i /∈ If

pff(u) =

n"

i=1

dfi

n"

j=1

pij(u)φjf

g(f, u) =n"

i=1

dfi

n"

j=1

pij(u)g(i, u, j)

Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4

f1 f2 f3 f4 f5 f6 f7

Neural Network Features Approximate Cost Jµ Policy Improvement

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

Representative Feature States dfi f f with Aggregation

Current Policy µ Improved Policy µµ

TµΦr Φr = ΠTµΦr

Generate “Improved” Policy µ

State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ

r = (r1, . . . , rs)

State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures

Cost = 2αϵ rk+1 = arg minr∈ℜs

m"

t=1

Nt−1"

τ=0

$φ(iτ,t)′r − cτ,t(rk)

%2

µℓ

µ1 − µℓ

µ

1

Neural Net Policy Evaluation Improvement of Current Policy µ byLookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

1

Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

1

Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

1

Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

1

Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

1

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

The “current" player plays games that are used to “train" an “improved" player

At a given position, the “move probabilities" and the “value" of a position areapproximated by a deep neural net (NN)

Successive NNs are trained using self-generated data and a form of regression

A form of randomized policy improvement Monte-Carlo Tree Search (MCTS)generates move probabilities

AlphaZero bears similarity to earlier works, e.g., TD-Gammon (Tesauro,1992), butis more complicated because of the MCTS and the deep NN

The success of AlphaZero is due to a skillful implementation/integration of knownideas, and awesome computational power

Bertsekas (M.I.T.) Reinforcement Learning 4 / 33

Page 5: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Approximate DP/RL Methodology is now Ambitious and Universal

Exact DP applies (in principle) to a very broad range of optimization problemsDeterministic <—-> Stochastic

Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces

One decision maker <—-> Two player games

... BUT is plagued by the curse of dimensionality and need for a math model

Approximate DP/RL overcomes the difficulties of exact DP by:Approximation (use neural nets and other architectures to reduce dimension)

Simulation (use a computer model in place of a math model)

State of the art:Broadly applicable methodology: Can address broad range of challengingproblems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc

There are no methods that are guaranteed to work for all or even most problems

There are enough methods to try with a reasonable chance of success for mosttypes of optimization problems

Role of the theory: Guide the art, delineate the sound ideas

Bertsekas (M.I.T.) Reinforcement Learning 5 / 33

Page 6: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Approximation in Value Space

Central Idea: Lookahead with an approximate cost

Compute an approximation J to the optimal cost function J∗

At current state, apply control that attains the minimum in

Current Stage Cost + J(Next State)

Multistep lookahead extension

At current state solve an `-step DP problem using terminal cost J

Apply the first control in the optimal policy for the `-step problem

Example approaches to compute J:

Problem approximation: Use as J the optimal cost function of a simpler problem

Rollout and model predictive control: Use a single policy iteration, with costevaluated on-line by simulation or limited optimization

Self-learning/approximate policy iteration (API): Use as J an approximation to thecost function of the final policy obtained through a policy iteration process

Role of neural networks: “Learn" the cost functions of policies in the context ofAPI; “learn" policies obtained by value space approximation

Bertsekas (M.I.T.) Reinforcement Learning 6 / 33

Page 7: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Aims and References of this Talk

The purpose of this talkTo selectively review some of the methods, and bring out some of the AI-DPconnections

ReferencesQuite a few Exact DP books (1950s-present starting with Bellman; my latest book“Abstract DP" came out earlier this year)

Quite a few DP/Approximate DP/RL/Neural Nets books (1996-Present)I Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996I Sutton and Barto, 1998, Reinforcement Learning (new edition 2019, Draft on-line)I NEW DRAFT BOOK: Bertsekas, Reinforcement Learning and Optimal Control, 2019,

on-line

Many surveys on all aspects of the subject; Tesauro’s papers on computerbackgammon, and Silver, et al., papers on AlphaZero

Bertsekas (M.I.T.) Reinforcement Learning 7 / 33

Page 8: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Terminology in RL/AI and DP/Control

RL uses Max/Value, DP uses Min/CostReward of a stage = (Opposite of) Cost of a stage.

State value = (Opposite of) State cost.

Value (or state-value) function = (Opposite of) Cost function.

Controlled system terminologyAgent = Decision maker or controller.

Action = Control.

Environment = Dynamic system.

Methods terminologyLearning = Solving a DP-related problem using simulation.

Self-learning (or self-play in the context of games) = Solving a DP problem usingsimulation-based policy iteration.

Planning vs Learning distinction = Solving a DP problem with model-based vsmodel-free simulation.

Bertsekas (M.I.T.) Reinforcement Learning 8 / 33

Page 9: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Outline

1 Approximation in Value Space

2 Problem Approximation

3 Rollout and Model Predictive Control

4 Parametric Approximation - Neural Networks

5 Neural Networks and Approximation in Value Space

6 Model-free DP in Terms of Q-Factors

7 Policy Iteration - Self-Learning

Bertsekas (M.I.T.) Reinforcement Learning 9 / 33

Page 10: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Finite Horizon Problem - Exact DP

Systemxk+1 = fk(xk, uk, wk)

uk = µk(xk) xk

wk

) µk

Systemxk+1 = fk (xk , uk ,wk ), k = 0, 1, . . . ,N − 1

where xk : State, uk : Control, wk : Random disturbance

Cost function:

E

{gN(xN) +

N−1∑k=0

gk (xk , uk ,wk )

}

Perfect state information: uk is applied with (exact) knowledge of xk

Optimization over feedback policies {µ0, . . . , µN−1}: Rules that specify the controlµk (xk ) to apply at each possible state xk that can occur

Bertsekas (M.I.T.) Reinforcement Learning 11 / 33

Page 11: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

The DP Algorithm and Approximation in Value Space

Go backwards, k = N − 1, . . . ,0, using

JN(xN) = gN(xN)

Jk (xk ) = minuk

Ewk

{gk (xk , uk ,wk ) + Jk+1

(fk (xk , uk ,wk )

)}Jk (xk ): Optimal cost-to-go starting from state xk

Approximate DP is motivated by the ENORMOUS computational demands of exact DP

Approximation in value space: Use an approximate cost-to-go function Jk+1

µk (xk ) ∈ arg minuk

Ewk

{gk (xk , uk ,wk ) + Jk+1

(fk (xk , uk ,wk )

)}

There is also a multistep lookahead version

At state xk solve an `-step DP problem with terminal cost function approximation Jk+`.Use the first control in the optimal `-step sequence.

Bertsekas (M.I.T.) Reinforcement Learning 12 / 33

Page 12: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Approximation in Value Space Methods

One-step case at state xk :min

uk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Replace E{·} with nominal values (certainty equiv-alent control)

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

1

Approximations: Computation of Jk+ℓ:

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!gk(xk, uk, wk) + Jk+1(xk+ℓ)

"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

Approximate minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

Approximate minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout Model Predictive Control

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

Approximate minimization Replace E{·} with nominal values

(certainty equivalent control) Computation of Jk+1:

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout Model Predictive Control

minuk

E!

gk(xk, uk, wk) + Jk+1(xk+1)"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

s t j1 j2 j` j`�1 j1

Nodes j 2 A(j`) Path Pj , Length Lj · · ·

Aggregation

Is di + aij < UPPER � hj?

�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

s t j1 j2 j` j`�1 j1

Nodes j 2 A(j`) Path Pj , Length Lj · · ·

Aggregation Adaptive simulation Monte-Carlo Tree Search

Is di + aij < UPPER � hj?

�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

s t j1 j2 j` j`�1 j1

Nodes j 2 A(j`) Path Pj , Length Lj · · ·

Aggregation Adaptive simulation Monte-Carlo Tree Search (certainty equivalence)

Is di + aij < UPPER � hj?

�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

Multistep case at state xk :

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Bertsekas (M.I.T.) Reinforcement Learning 13 / 33

Page 13: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Problem Approximation: Simplify the Tail Problem and Solve it Exactly

Use as cost-to-go approximation Jk+1 the exact cost-to-go of a simpler problem

Many problem-dependent possibilities:Probabilistic approximation

I Certainty equivalence: Replace stochastic quantities by deterministic ones (makes thelookahead minimization deterministic)

I Approximate expected values by limited simulationI Partial versions of certainty equivalence

Enforced decomposition of coupled subsystemsI One-subsystem-at-a-time optimizationI Constraint decompositionI Lagrangian relaxation

Aggregation: Group states together and view the groups as aggregate statesI Hard aggregation: Jk+1 is a piecewise constant approximation to Jk+1I Feature-based aggregation: The aggregate states are defined by “features" of the

original statesI Biased hard aggregation: Jk+1 is a piecewise constant local correction to some other

approximation Jk+1, e.g., one provided by a neural net

Bertsekas (M.I.T.) Reinforcement Learning 15 / 33

Page 14: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Rollout: On-Line Simulation-Based Approximation in Value Space

Selective Depth Lookahead Tree

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position Evaluator

States xk+1 States xk+2

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Iteration Index k PI index k Jµk J∗ 0 1 2 . . . Error Zone Width (ϵ + 2αδ)/(1 − α)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (ϵ + 2αδ)/(1 − α)

Policy Cost Evaluation Jµ of Current policy µ µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation Jµ

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) − Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

1

States xk+1 States xk+2 xk Rollout Policy Approximation J

Adaptive Simulation Terminal cost approximation

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

1

States xk+1 States xk+2 xk Rollout Policy Approximation J

Adaptive Simulation Terminal cost approximation

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

1

States xk+1 States xk+2 xk Heuristic/Suboptimal Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic Policy

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

1

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic Policy

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

1

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

1

of Current Policy µ by Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

1

The base policy can be any suboptimal policy (obtained by another method)One-step or multistep lookahead; exact minimization or a “randomized form oflookahead" that involves “adaptive" simulation and Monte Carlo tree searchWith or without terminal cost approximation (obtained by another method)Some forms of model predictive control can be viewed as special cases (basepolicy is a short-term deterministic optimization)Important theoretical fact: With exact lookahead and no terminal costapproximation, the rollout policy improves over the base policy

Bertsekas (M.I.T.) Reinforcement Learning 17 / 33

Page 15: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Example of Rollout: Backgammon (Tesauro, 1996)

Av. Score by Monte-Carlo Simulation

Av. Score by Monte-Carlo Simulation

Av. Score by Monte-Carlo Simulation

Av. Score by Monte-Carlo Simulation

Possible Moves

Base policy was a backgammon player developed by a different RL method [TD(λ)trained with a neural network]; was also used for terminal cost approximation

The best backgammon players are based on rollout ... but are too slow forreal-time play (MC simulation takes too long)

AlphaGo has similar structure to backgammonThe base policy and terminal cost approximation are obtained with a deep neural net.In AlphaZero the rollout-with-base-policy part was dropped (long lookahead suffices)

Bertsekas (M.I.T.) Reinforcement Learning 18 / 33

Page 16: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Parametric Approximation in Value Space

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ:

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Jk comes from a class of functions Jk (xk , rk ), where rk is a tunable parameter vector

Feature-based architectures: The linear case

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

µµi

µ

Cost = 2⇥� J0

R + g(1) + �n⌥

j=1

p1jJ�(j)

1

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

µµi

µ

Cost = 2⇥� J0

R + g(1) + �n⌥

j=1

p1jJ�(j)

1

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

µµi

µ

Cost = 2⇥� J0

R + g(1) + �n⌥

j=1

p1jJ�(j)

1

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

µµi

µ

Cost = 2⇥� J0

R + g(1) + �n⌥

j=1

p1jJ�(j)

1

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

µµi

µ

Cost = 2⇥� J0

R + g(1) + �n⌥

j=1

p1jJ�(j)

1

Selective Depth Lookahead Tree

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator �k(xk)0rk

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Selective Depth Lookahead Tree

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator �k(xk)0rk

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Selective Depth Lookahead Tree

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Bertsekas (M.I.T.) Reinforcement Learning 20 / 33

Page 17: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Training with Fitted Value Iteration

This is just DP with intermediate approximation at each step

Start with JN = gN and sequentially train going backwards, until k = 0

Given Jk+1, we construct a number of samples (xsk , β

sk ), s = 1, . . . , q,

βsk = min

uE{

g(xsk , u,wk ) + Jk+1

(fk (xs

k , u,wk ), rk+1)}, s = 1, . . . , q

We “train" Jk on the set of samples (xsk , β

sk ), s = 1, . . . , q

Training by least squares/regressionWe minimize over rk

q∑s=1

(Jk (xs

k , rk )− βs)2+ γ‖rk − r‖2

where r is an initial guess for rk and γ > 0 is a regularization parameter

Bertsekas (M.I.T.) Reinforcement Learning 21 / 33

Page 18: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Neural Networks for Constructing Cost-to-Go Approximations Jk

Major fact about neural networksThey automatically construct features to be used in a linear architecture

Neural nets are approximation architectures of the form

J(x , v , r) =m∑

i=1

riφi (x , v) = r ′φ(x , v)

involving two parameter vectors r and v with different roles

View φ(x , v) as a feature vector

View r as a vector of linear weights for φ(x , v)

By training v jointly with r , we obtain automatically generated features!

Neural nets can be used in the fitted value iteration scheme

Train the stage k neural net (i.e., compute Jk ) using a training set generated with thestage k + 1 neural net (which defines Jk+1)

Bertsekas (M.I.T.) Reinforcement Learning 23 / 33

Page 19: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Neural Network with a Single Nonlinear Layer

. . .. . . . . .

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

State encoding (could be the identity, could include special features of the state)

Linear layer Ay(x) + b [parameters to be determined: v = (A, b)]

Nonlinear layer produces m outputs φi (x , v) = σ((

Ay(x) + b)

i

), i = 1, . . . ,m

σ is a scalar nonlinear differentiable function; several types have been used(hyperbolic tangent, logistic, rectified linear unit)

Training problem is to use the training set (xs, βs), s = 1, . . . , q, for

minv,r

q∑s=1

(m∑

i=1

riφi (xs, v)− βs

)2

+ (Regularization Term)

Solved often with incremental gradient methods (known as backpropagation)

Universal approximation theorem: With sufficiently large number of parameters,“arbitrarily" complex functions can be closely approximated

Bertsekas (M.I.T.) Reinforcement Learning 24 / 33

Page 20: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Deep Neural Networks

. . . . . . . . .. . .. . .. . .

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

1

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

Feature Extraction Features: Material Balance, uk = µdk

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

1

More complex NNs are formed by concatenation of multiple layers

The outputs of each nonlinear layer become the inputs of the next linear layer

A hierarchy of features

Considerable success has been achieved in major contexts

Possible reasons for the successWith more complex features, the number of parameters in the linear layers may bedrastically decreased

We may use matrices A with a special structure that encodes special linearoperations such as convolution

Bertsekas (M.I.T.) Reinforcement Learning 25 / 33

Page 21: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Q-Factors - Model-Free RL

The Q-factor of a state-control pair (xk , uk ) at time k is defined by

Qk (xk , uk ) = E{

gk (xk , uk ,wk ) + Jk+1(xk+1)}

where Jk+1 is the optimal cost-to-go function for stage k + 1

Note thatJk (xk ) = min

u∈Uk (xk )Qk (xk , uk )

so the DP algorithm is written in terms of Qk

Qk (xk , uk ) = E{

gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)

Qk+1(xk+1, u)}

We can approximate Q-factors instead of costs

Bertsekas (M.I.T.) Reinforcement Learning 27 / 33

Page 22: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Fitted Value Iteration for Q-Factors: Model-Free Approximate DP

Consider fitted value iteration of Q-factor parametric approximations

Qk (xk , uk , rk ) ≈ E{

gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)

Qk+1(xk+1, u, rk+1)}

(Note a mathematical magic: The order of E{·} and min have been reversed.)

We obtain Qk (xk , uk , rk ) by training with many pairs((xs

k , usk ), βs

k

), where βs

k is asample of the approximate Q-factor of (xs

k , usk ). No need to compute E{·}

No need for a model to obtain βsk . Sufficient to have a simulator that generates

random samples of state-control-cost-next state((xk , uk ), (gk (xk , uk ,wk ), xk+1)

)Having computed rk , the one-step lookahead control is obtained on-line as

µk (xk ) = arg minu∈Uk (xk )

Qk (xk , u, rk )

without the need of a model or expected value calculations

Also the on-line calculation of the control is simplified

Bertsekas (M.I.T.) Reinforcement Learning 28 / 33

Page 23: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

A Few Remarks on Infinite Horizon Problems

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate cost

Jµ(r) = �r Using Simulation

Generate “Improved” Policy µ

µ(i) = arg minu⇧U(i)

n↵

j=1

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

M↵

m=1

⇤im

�xim � ⌅(im)⇤r

⇥2

⌅(i)⇤

x = T (x) = g + �Px

x =

⌅↵

t=0

�tP tg

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤r⇥

�⌦

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃rt =

t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

rt+1 =

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃�1 t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

1

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate cost

Jµ(r) = �r Using Simulation

Generate “Improved” Policy µ

µ(i) = arg minu⇧U(i)

n↵

j=1

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

M↵

m=1

⇤im

�xim � ⌅(im)⇤r

⇥2

⌅(i)⇤

x = T (x) = g + �Px

x =

⌅↵

t=0

�tP tg

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤r⇥

�⌦

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃rt =

t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

rt+1 =

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃�1 t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

1

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate Cost

Jµ(r) = �r Using Simulation

Generate “Improved” Policy µ

µ(i) = arg minu⇧U(i)

n↵

j=1

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

M↵

m=1

⇤im

�xim � ⌅(im)⇤r

⇥2

⌅(i)⇤

x = T (x) = g + �Px

x =

⌅↵

t=0

�tP tg

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤r⇥

�⌦

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃rt =

t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

rt+1 =

⇧n↵

i=1

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

i=1

⇤i ⌅(i)

⌥ bi +

n↵

j=1

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

k=0

⌅(ik)⌅(ik)⇤

⌃�1 t↵

k=0

⌅(ik)

⇤bik +

aikjk

pikjk

⌅(jk)⇤rt

1

y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Initial Policy Controlled System Cost per Stage Vector G(r) Transi-tion Matrix P (r)

Steady-State Distribution ⇧(r) Average Cost ⇤(r)

⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space

⇥ =

�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state x Initial state f(x, u,w) TimeVk: k-stages optimal cost vector with terminal cost function J

TJ J0

Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ

Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equation

Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)

Tµ(⇥r) ⇥r = �T(�)µ (⇥r)

�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ �TJ J T J �T J

Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1

Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1

Route to Queue 2

1

y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Qµ(i, u, r) Jµ(i, r) G(r) Transition Matrix P (r)

Evaluate Approximate Cost Steady-State Distribution ⇧(r) AverageCost ⇤(r)

⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space

⇥ =

�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state Decision µ(i) x Initial state f(x, u,w)

TimeVk: k-stages optimal cost vector with terminal cost function J

TJ J0

Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ

Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equation

Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)

Tµ(⇥r) ⇥r = �T(�)µ (⇥r)

�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ �TJ J T J �T J

Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1

Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1

1

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Jµ(i, r)

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2)

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

�F (i)

R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ

�F (i)

I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ

�F (i)

Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

d`i = 0 if i /2 I`

�j ¯ = 1 if j 2 I¯

1

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Based on Jµ(i, r)

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2)

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

�F (i)

R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ

�F (i)

I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ

�F (i)

Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

d`i = 0 if i /2 I`

�j ¯ = 1 if j 2 I¯

1

of Current Policy µ

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

1

of Current Policy µ by Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

1

of Current Policy µ by Lookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

1

Most popular setting: Stationary finite-state system, stationary policies,discounting or termination statePolicy iteration (PI) method generates a sequence of policies

I The current policy µ is evaluated using a parametric architecture: Jµ(x , r)I An “improved" policy µ is obtained by one-step lookahead using Jµ(x , r)

The architecture is trained using simulation data with µThus the system “observes itself" under µ and uses the data to “learn" theimproved policy µ - “self-learning"Exact PI converges to an optimal policy; approximate PI “converges" to within an“error zone" of the optimal, then oscillatesTD-Gammon, AlphaGo, and AlphaZero, all use forms of approximate PI for training

Bertsekas (M.I.T.) Reinforcement Learning 30 / 33

Page 24: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

A Few Topics we did not Cover in this Talk

Infinite horizon extensions: Approximate value and policy iteration methods, errorbounds, model-based and model-free methods

Temporal difference methods: A class of methods for policy evaluation in infinitehorizon problems with a rich theory, issues of variance-bias tradeoff

Sampling for exploration, in the context of policy iteration

Monte Carlo tree search, and related methods

Aggregation methods, synergism with other approximate DP methods

Approximation in policy space, actor-critic methods, policy gradient andcross-entropy methods

Special aspects of imperfect state information problems, connections withtraditional control schemes

Infinite spaces optimal control, connections with aggregation schemes

Special aspects of deterministic problems: Shortest paths and their use inapproximate DP

A broad view of using simulation for large-scale computations: Methods for largesystems of equations and linear programs, connection to proximal algorithms

Bertsekas (M.I.T.) Reinforcement Learning 31 / 33

Page 25: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Concluding Remarks

Some words of cautionThere are challenging implementation issues in all approaches, and no fool-proofmethods

Problem approximation and feature selection require domain-specific knowledge

Training algorithms are not as reliable as you might think by reading the literature

Approximate PI involves oscillations

Recognizing success or failure can be a challenge!

The RL successes in game contexts are spectacular, but they have benefited fromperfectly known and stable models and small number of controls (per state)

Problems with partial state observation remain a big challenge

On the positive sideMassive computational power together with distributed computation are a sourceof hope

Silver lining: We can begin to address practical problems of unimaginable difficulty!

There is an exciting journey ahead!

Bertsekas (M.I.T.) Reinforcement Learning 32 / 33

Page 26: Reinforcement Learning and Optimal ControlA Selective Overviewweb.mit.edu/dimitrib/www/Slides_RL_and_Optimal_Control.pdf · Reinforcement Learning and Optimal Control A Selective

Thank you!

Bertsekas (M.I.T.) Reinforcement Learning 33 / 33