learning to trade via direct reinforcement
DESCRIPTION
Learning to Trade via Direct Reinforcement. John Moody International Computer Science Institute, Berkeley & J E Moody & Company LLC, Portland [email protected] [email protected] Global Derivatives Trading & Risk Management Paris, May 2008. What is Reinforcement Learning?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/1.jpg)
Learning to Trade viaLearning to Trade via Direct ReinforcementDirect Reinforcement
John MoodyInternational Computer Science Institute,
Berkeley&
J E Moody & Company LLC, Portland
[email protected]@JEMoody.Com
Global Derivatives Trading & Risk ManagementParis, May 2008
![Page 2: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/2.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tWhat is Reinforcement What is Reinforcement Learning?Learning?
RL Considers:• A Goal-Directed “Learning” Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility
RL is an Active Paradigm:• Agent “Learns” by “Trial & Error” Discovery• Actions result in Reinforcement
RL Paradigms:• Value Function Learning (Dynamic
Programming)• Direct Reinforcement (Adaptive Control)
![Page 3: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/3.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
I. Why Direct Reinforcement?I. Why Direct Reinforcement?
Direct Reinforcement Learning:
Finds predictive structure in financial data
Integrates Forecasting w/ Decision Making
Balances Risk vs. RewardIncorporates Transaction Costs
Discover Trading Strategies!
![Page 4: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/4.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tOptimizing Trades based on Optimizing Trades based on ForecastsForecasts
Indirect Approach:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck
![Page 5: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/5.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement
Trader Properties:• One set of parameters• A single utility function • U includes transaction costs• Direct mapping from inputs to actions
![Page 6: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/6.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Direct RL TraderDirect RL Trader (USD/GBP):(USD/GBP): ReturnReturnAA=15%,=15%, SR SRAA=2.3,=2.3, DDR DDRAA=3.3=3.3
![Page 7: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/7.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
II. II. Direct Reinforcement:Direct Reinforcement: Algorithms & Algorithms &
IllustrationsIllustrations
Algorithms:Recurrent Reinforcement Learning (RRL)Stochastic Direct Reinforcement (SDR)
Illustrations:Sensitivity to Transaction CostsRisk-Averse Reinforcement
![Page 8: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/8.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tLearning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement
DR Trader:
• Recurrent policy (Trading signals, Portfolio weights)
• Takes action, Receives reward (Trading Return w/ Transaction Costs)
• Causal performance function(Generally path-dependent)
• Learn policy by varying GOAL: Maximize performance
or marginal performance
1( ; , )t t t tF F F I
1, ; ,t t t tR F F S
1 1( , ,..., )t tU R R R
1t t t tD U U U
1( ; , )t t tF F I t
TU
![Page 9: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/9.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tRecurrent Reinforcement Learning (RRL)Recurrent Reinforcement Learning (RRL)(Moody & Wu 1997)(Moody & Wu 1997)
Deterministic gradient (batch):
with recursion:
Stochastic gradient (on-line):
stochastic recursion:
Stochastic parameter update (on-line):
Constant : adaptive learning. Declining : stochastic approx.
1
1 1
TT t t t tT
t t t t
dU dR dF dR dFdU
d dR dF d dF d
1
1 1
t t t t t t
t t t t t t
dU dU dR dF dR dF
d dR dF d dF d
t tt
t
dU
d
1
1
t t t t
t
dF F dF dF
d dF d
1
1 1
t t t t
t t t t
dF F dF dF
d dF d
![Page 10: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/10.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Structure of TradersStructure of Traders
• Single Asset- Price series
- Return series
• Traders - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction Costs: represented by .
tz1t t tr z z
1,0,1tF
1( ; , )t t t tF F F I
1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y
1 1
1
t t t t t
T
t tt
R F r F F
P R
![Page 11: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/11.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Risk-Averse Reinforcement:Risk-Averse Reinforcement:Financial Performance MeasuresFinancial Performance Measures
Performance Functions:• Path independent: (Standard Utility Functions)• Path dependent:
Performance Ratios:• Sharpe Ratio:
• Downside Deviation Ratio:
For Learning:• Per-Period Returns: • Marginal Performance:
e.g. Differential Sharpe Ratio .
( )t tU U W
1 0( , ,..., )t t tU U R R W
Average( )
Standard Deviation( )t
t
R
R
Average( )
Downside Deviation( )t
t
R
R
tR
1t t t tD U U U
![Page 12: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/12.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Long / Short Trader SimulationLong / Short Trader SimulationSensitivity to Transaction CostsSensitivity to Transaction Costs
• Learns from scratch and on-line
• Moving average Sharpe Ratio with = 0.01
![Page 13: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/13.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTrader SimulationTransaction Costs vs. Performance
100 Runs; Costs = 0.2%, 0.5%, and 1.0%
SharpeRatio
TradingFrequency
![Page 14: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/14.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Minimizing Downside Risk:Minimizing Downside Risk:Artificial Price Series w/ Artificial Price Series w/ Heavy TailsHeavy Tails
![Page 15: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/15.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Comparison of Risk-Averse Comparison of Risk-Averse TradersTraders Underwater Curves Underwater Curves
![Page 16: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/16.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tComparison of Risk-Averse Traders: Comparison of Risk-Averse Traders: Draw-DownsDraw-Downs
![Page 17: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/17.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
III. III. Direct Reinforcement vs.Direct Reinforcement vs. Dynamic Dynamic ProgrammingProgramming
Algorithms:Value Function Method (Q-Learning)Direct Reinforcement Learning (RRL)
Illustration:Asset Allocation: S&P 500 & T-BillsRRL vs. Q-Learning
![Page 18: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/18.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
RL Paradigms ComparedRL Paradigms Compared
Value Function Learning
• Origins: Dynamic Programming
• Learn “optimal” Q-Function
Q: state action value
• Solve Bellman’s Equation
Action:
“Indirect”
Direct Reinforcement
• Origins: Adaptive Control• Learn “good” Policy P
P: observations p(action)
• Optimize “Policy Gradient”
Action:
“Direct”
ˆ( , )P obsa b
ˆargmax ( , , )Q x ba
![Page 19: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/19.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500 / T-Bill Asset Allocation:S&P-500 / T-Bill Asset Allocation:Maximizing the Differential Sharpe RatioMaximizing the Differential Sharpe Ratio
![Page 20: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/20.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500: Opening Up the Black BoxS&P-500: Opening Up the Black Box85 series: Learned relationships are nonstationary over
time
![Page 21: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/21.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Closing RemarksClosing Remarks• Direct Reinforcement Learning:
– Discovers Trading Opportunities in Markets– Integrates Forecasting w/ Trading– Maximizes Risk-Adjusted Returns– Optimizes Trading w/ Transaction Costs
• Direct Reinforcement Offers Advantages Over:– Trading based on Forecasts (Supervised Learning)– Dynamic Programming RL (Value Function Methods)
• Illustrations:– Controlled Simulations– FX Currency Trader– Asset Allocation: S&P 500 vs. Cash
![Page 22: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/22.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Selected ReferencesSelected References::
[1] John Moody and Lizhong Wu. Optimization of trading systems and portfolios. Decision Technologies for Financial Engineering, 1997.
[2] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17:441-470, 1998.
[3] Jonathan Baxter and Peter L. Bartlett. Direct gradient-based reinforcement learning: Gradient estimation algorithms. 2001.
[4] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.
[5] Carl Gold. FX Trading via Recurrent Reinforcement Learning. Proceedings of IEEE CIFEr Conference, Hong Kong, 2003.
[6] John Moody, Y. Liu, M. Saffell and K.J. Youn. Stochastic Direct Reinforcement: Application to Simple Games with Recurrence. In Artificial Multiagent Learning, Sean Luke et al. eds, AAAI Press, 2004.
![Page 23: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/23.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Supplemental SlidesSupplemental Slides
• Differential Sharpe Ratio
• Portfolio Optimization
• Stochastic Direct Reinforcement (SDR)
![Page 24: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/24.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Maximizing the Sharpe RatioMaximizing the Sharpe Ratio
Sharpe Ratio:
Exponential Moving Average Sharpe Ratio:
with time scale and
Motivation:• EMA Sharpe ratio emphasizes recent patterns;• can be updated incrementally.
Average( )
Standard Deviation( )t
Tt
RS
R
2 1 2( )
( )t
t t
AS t
K B A
1 1( )t t t tA A R A 2
1 1( )t t t tB B R B 1 2
1 2
1K
1
![Page 25: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/25.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Differential Sharpe RatioDifferential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization
Expand to first order in :
Define Differential Sharpe Ratio as:
where
1 1
2 3 21 1
1( ) 2( )
( )
t t t t
t t
B A A BdS tD t
d B A
20
( )( ) ( 1) | ( ).
dS tS t S t O
d
1t t tA R A 2
1t t tB R B
( )S t
![Page 26: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/26.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning with the Differential SRLearning with the Differential SR
Evaluate “Marginal Utility” Gradient:
Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic
optimization).
1 12 3 2
1 1
( )
( )t t t
t t t
dD t B A R
dR B A
tR tU
![Page 27: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/27.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTransaction costs vs. Performance
100 runs; Costs = 0.2%, 0.5%, and 1.0%
TradingFrequency
CumulativeProfit
SharpeRatio
![Page 28: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/28.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Portfolio Optimization (3 Securities)Portfolio Optimization (3 Securities)
![Page 29: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/29.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tStochastic Direct Reinforcement: Stochastic Direct Reinforcement:
Probabilistic PoliciesProbabilistic Policies
![Page 30: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/30.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to TradeLearning to Trade
• Single Asset- Price series
- Return series
• Trader - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction cost rate .
tz1t t tr z z
1,0,1ta
1( ; , )t t t tP Ia a 1 2 1 2, , ,...; , , ,...t t t t t t tI r r r i i i
1 1
1
t t t t t
T
t tt
R r
P R
a a a
![Page 31: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/31.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Consider a learning agent with stochastic policy function
whose inputs include recent observations o and actions a :
Why should past actions (recurrence) be included?
Examples:Games (observations o are opponent’s actions)
Trading financial markets
In General:
Why does Reinforcement need Why does Reinforcement need Recurrence? Recurrence?
1 2 1 2( ; ; )t t t tt o o a aP a
Model opponent’s responses o to previous actions a
Minimize transaction costs, market impact
Recurrence enables discovery of better policiesthat capture an agent’s impact on the world !!
![Page 32: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/32.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
Expected total performance of a sequence of T actions
Maximize performance via direct gradient ascent
Must evaluate total policy gradient
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
t
t
dU
d
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( , ) ( , ) ( ) ( )1 1 1 1with( ) ( )n m n m n m
t t t t tP a H H O A
![Page 33: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/33.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
The goal of SDR is to maximize expected total performance
of a sequence of T actions
via direct gradient ascent
Must evaluate
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
1 1( ) ( | ) ( )T t
Tt t t t t
H t a
dU du a p a H p H
d d
Notation: The complete history is denoted . is a partial history of length (n,m) .
( , ) ( ) ( )( )n m n mt t tH O A
( )t t tH O A
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( ) ( )
1 1( , )n mt t tP a O A
![Page 34: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/34.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement:Stochastic Direct Reinforcement:First Order Recurrent Policy GradientFirst Order Recurrent Policy Gradient
For first order recurrence (m=1), conditional action probability is given by the policy:
The probabilities of current actions depend upon the probabilities of prior actions:
The total (recurrent) policy gradient is computed as :
with partial (naïve) policy gradient :
1
1 1( ) ( ) ( )t
t t t ta
p a p a a p a
1
1 11 1
( ) ( ) ( )( ) ( )
t
t t t tt t t
a
dp a p a a dp ap a p a a
d d
( )1 1 1( ) ( )n
t t t t tp a a P a O a
( )1 1 1( ;...) ( )n
t t t t tp a a P a O a
![Page 35: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/35.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
SDR Trader SimulationSDR Trader Simulation w/ Transaction Costsw/ Transaction Costs
![Page 36: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/36.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tTrading Frequency vs. Transaction Trading Frequency vs. Transaction CostsCosts
Recurrent SDR Non-Recurrent
![Page 37: Learning to Trade via Direct Reinforcement](https://reader033.vdocuments.net/reader033/viewer/2022050802/56814325550346895daf8d9f/html5/thumbnails/37.jpg)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Sharpe Ratio vs. Transaction CostsSharpe Ratio vs. Transaction Costs
Recurrent SDR Non-Recurrent