dynamic replication and hedging: a reinforcement learning ...€¦ · machine learning in finance...
Post on 04-Jun-2020
15 Views
Preview:
TRANSCRIPT
Dynamic Replication and Hedging AReinforcement Learning Approach
Petter KolmCourant Institute NYU
petterkolmnyueduhttpswwwlinkedincominpetterkolm
Published in Kolm and Ritter (2019) ldquoDynamic Replication andHedging A Reinforcement Learning Approachrdquo The Journal of
Financial Data Science Winter 2019 1 (1) pp 159-171
Matlab Computational Finance ConferenceOctober 15 2019
1 40
2 40
Outline
Background amp motivationMachine learning in financeReplication amp hedgingWhat we do
Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples
Conclusions
2 40
Background amp motivation
3 40
Machine learning in finance IAdoption of ML has been rapid in the financial industry Why
The data The algos amp tools The infrastructure for data storage amp computation
Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted
Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc
Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling
4 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
2 40
Outline
Background amp motivationMachine learning in financeReplication amp hedgingWhat we do
Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples
Conclusions
2 40
Background amp motivation
3 40
Machine learning in finance IAdoption of ML has been rapid in the financial industry Why
The data The algos amp tools The infrastructure for data storage amp computation
Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted
Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc
Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling
4 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Outline
Background amp motivationMachine learning in financeReplication amp hedgingWhat we do
Reinforcement learning for hedgingBrief introduction to RLAutomatic hedging in theoryAutomatic hedging in practiceExamples
Conclusions
2 40
Background amp motivation
3 40
Machine learning in finance IAdoption of ML has been rapid in the financial industry Why
The data The algos amp tools The infrastructure for data storage amp computation
Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted
Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc
Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling
4 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Background amp motivation
3 40
Machine learning in finance IAdoption of ML has been rapid in the financial industry Why
The data The algos amp tools The infrastructure for data storage amp computation
Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted
Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc
Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling
4 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Machine learning in finance IAdoption of ML has been rapid in the financial industry Why
The data The algos amp tools The infrastructure for data storage amp computation
Difference today versus 3 years ago Many use-cases are available now New technology constantly being adopted
Robo advisors block chains algo amp systematic trading Many hedge funds use the Amazon S3 cloud etc
Alternative data amp its handlingpreparation is betterunderstood A shift towards being incorporated into existing models Common issues Debiasing tagging amp paneling
4 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Machine learning in finance IIBuyer beware Testing models In- versus out-of-sample Where are your algos and data from
ldquoBlack boxrdquo risk on steroids Is your data biased Are your algos correct Do you know what they do Did you
implement them yourself
Data science amp ML are not cure-alls (who knew)
What do we want in the financial indsutry Financial domain expertise often critical More business use cases Interpretable ML
5 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Machine learning for trading I
While the role of trading has not changed modern ML allowsus to Create more realistic models Build models that adapt to changing market conditions Solve the models faster
6 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Machine learning for trading IIExamples of ML tools that are being leveraged for trading LASSO-based techniques yield very fast derivative free solvers
for single- and multi-period problems Bayesian learning and Gaussian processes used for building
forecasts (forecasting distributions) and dealing withestimationmodel risk
Unsupervised learning techniques for building statistical riskmodels
Regime-switching models for dealing with different marketenvironments (eg risk onoff)
Mixed-distributions to model skewed distributions andtail-behavior
Bootstrap and cross-validation techniques for backtesting
7 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Machine learning for trading III Reinforcement learning (often tree- or NN-based models) for
complex trading planning problems in the presence ofuncertainty (where the value function is not easily obtainable)
NLP LDA amp extensions ICA for analysis of news filings andreports
8 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Replication amp hedging
Replicating and hedging an option position is fundamental infinance
The core idea of the seminal work by Black-Scholes-Merton(BSM) In a complete and frictionless market there is a continuously
rebalanced dynamic trading strategy in the stock and risklesssecurity that perfectly replicates the option (Black and Scholes(1973) Merton (1973))
In practice continuous trading of arbitrarily small amounts ofstock is infinitely costly and the replicating portfolio isadjusted at discrete times Perfect replication is impossible and an optimal hedging
strategy will depend on the desired trade-off betweenreplication error and trading costs
9 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Related work IWhile a number of articles have considered hedging in discrete timeor transaction costs alone
Leland (1985) was first to address discrete hedging undertransaction costs His work was subsequently followed by others1 The majority of these studies treat proportionate transaction
costs
More recently several studies have considered option pricingand hedging subject to both permanent and temporary marketimpact in the spirit of Almgren and Chriss (1999) includingRogers and Singh (2010) Almgren and Li (2016) BankSoner and Voszlig (2017) and Saito and Takahashi (2017)
Halperin (2017) applies reinforcement learning to options butthe approach is specific to the BSM model and does notconsider transaction costs
10 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Related work II Buehler et al (2018) evaluate NN-based hedging under
coherent risk measures subject to proportional transactioncosts
1See for example Figlewski (1989) Boyle and Vorst (1992) Henrotte(1993) Grannan and Swindle (1996) Toft (1996) Whalley and Wilmott(1997) and Martellini (2000)
11 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
What we doIn the article we Show how to build a reinforcement learning (RL) system which
can learn how to optimally hedge an option (or otherderivative securities) in a fully realistic setting Discrete time Nonlinear transaction costs Round-lotting
Method allows the user to ldquoplug-inrdquo any option pricing andsimulation library and then train the system with no furthermodifications
The system learns how to optimally trade-off trading costsversus hedging variance for that security Uses a continuous state space Relies on nonlinear regression techniques to the ldquosarsa targetsrdquo
derived from the Bellman equation Method extends in a straightforward way to arbitrary portfolios
of derivative securities12 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Reinforcement learning for hedging
13 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Brief introduction to RL I
RL agent interacts with its environment The ldquoenvironmentrdquo isthe part of the system outside of the agentrsquos direct control
At each time step t the agent observes the current state of theenvironment st and chooses an action at from the action set
This choice influences both the transition to the next state aswell as the reward Rt the agent receives
Environment
Reward ActionState
RL Agent
14 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Brief introduction to RL II A policy π is a way of choosing an action at conditional on
the current state st RL is the search for policies which maximize the expectation of
the cumulative reward Gt
E[Gt ] = E[Rt+1 + γRt+2 + γ2Rt+3 + ]
where γ is discount factor (such that the infinite sumconverges)
Mathematically speaking RL is a way to solve multi-periodoptimal control problems
Standard texts on RL includes Sutton and Barto (2018) andSzepesvari (2010)
15 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Brief introduction to RL III The action-value function expresses the value of starting in
state s taking an arbitrary action a and then following policyπ thereafter
qπ(s a) = Eπ[Gt | St = sAt = a] (1)
where Eπ denotes the expectation under the assumption thatpolicy π is followed
If we knew the q-function corresponding to the optimal policyqlowast we would know the optimal policy itself namely We choose a in the action set that maximizes qlowast(st a)
This is called the greedy policy Hence the problem is reduced to finding qlowast or producing a
sequence of iterates that converges to qlowast Methods for producing those iterates are based on the Bellman
equations
16 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
A visual example
YouTube example
17 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Supervised learning vs RL
Training signals equiv target output from training set
Supervisedlearning
Inputs Outputs
Training signals equiv rewards
ReinforcementlearningStates Actions
18 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
RL has a feedback loop
Environment
Training signals equiv rewards
ReinforcementlearningStates Actions
19 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Is RL worth the troubleRL is often harder than supervised learning Joint learning and planningoptimizing from correlated samples The distributions of the data changes with the choice of
actions Need access to the environment for training
So when may it be a good idea to use RL Data comes in the form of trajectories (ie non-IID sequences) We need to make a sequence of decisions (ie non-IID
decisions) We are able to observe feedback to state or choice of actions
This information can be partial andor noisy There is a gain to be made by optimizing action choice over a
portion of the trajectory (ie time consistency needed for theBellman equation to hold)
Other ML techniques cannot easily deal with these situations20 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Common challenges when solving RL problems
Specifying the model Representing the state Choosing the set of actions Designing the reward
Acquiring data for training Exploration exploitation High cost actions Time-delayed reward
Function approximation (random forests CNNs) Validation confidence measures
21 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in theory I
We define automatic hedging to be the practice of usingtrained RL agents to handle hedging
With no trading frictions and where continuous trading ispossible there may be a dynamic replicating portfolio whichhedges the option position perfectly meaning that the overallportfolio (option minus replication) has zero variance
With frictions and where only discrete trading is possible thegoal becomes to minimize variance and cost We will use this to define the reward
22 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in theory II This suggest we can seek the agentrsquos optimal portfolio as the
solution to a mean-variance optimization problem withrisk-aversion κ
max983043E[wT ]minus
κ
2V[wT ]
983044(2)
where the final wealth wT is the sum of individual wealthincrements δwt
wT = w0 +T983131
t=1
δwt
We will let wealth increments include trading costs
23 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in theory III In the random walk case this leads to solving
minpermissible strategies
T983131
t=0
983043E[minusδwt ] +
κ
2V[δwt ]
983044(3)
whereminusδwt = ct
and ct is the total trading cost paid in period t (includingcommissions bid-offer spread cost market impact cost andother sources of slippage)
With an appropriate choice of reward function the problem ofmaximizing this mean-variance problem can be recast as a RLproblem
24 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in theory IV We choose the reward in each period to be2
Rt = δwt minusκ
2(δwt)
2 (4)
Thus training reinforcement learners with this kind of rewardfunction amounts to training automatic hedgers who tradeoffcosts versus hedging variance
2See Ritter (2017) for a general discussion of reward functions in trading25 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in practice I
Simplest possible example A European call option with strikeprice K and expiry T on a non-dividend-paying stock
We take the strike and maturity as fixed exogenously-givenconstants For simplicity we assume the risk-free rate is zero
The agent we train will learn to hedge this specific option withthis strike and maturity It is not being trained to hedge anyoption with any possible strikematurity
For European options the state must minimally contain (1)the current price St of the underlying (2) the timeτ = T minus t gt 0 remaining to expiry and (3) our currentposition of n shares
The state is thus naturally an element of3
S = R2+ times Z = (S τ n) | S gt 0 τ gt 0 n isin Z
26 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Automatic hedging in practice II The state does not need to contain the option Greeks because
they are (nonlinear) functions of the variables the agent hasaccess to via the state We expect the agent to learn such nonlinear functions on their
own
A key point This has the advantage of not requiring anyspecial model-specific calculations that may not extendbeyond BSM models
3If the option is American then it may be optimal to exercise early justbefore an ex-dividend date In this situation the state must be augmented withone additional variable The size of the anticipated dividend in period t + 1
27 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Letrsquos put RL at a disadvantage
The RL agent is at a disadvantage It does not know any ofthe following information the strike price K that the stock price process is a geometric Brownian motion
(GBM) the volatility of the price process the BSM formula the payoff function (S minus K )+ at maturity any of the Greeks
Thus it must infer the relevant information insofar as itaffects the value function by interacting with a simulatedenvironment
28 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Simulation assumptions I
We simulate a discrete BSM world where the stock priceprocess is a geometric Brownian motion (GBM) with initialprice S0 and daily lognormal volatility of σday
We consider an initially at-the-money European call option(struck at K = S0) with T days to maturity
We discretize time with D periods per day hence eachldquoepisoderdquo has T middot D total periods
We require trades (hence also holdings) to be integer numbersof shares
We assume that our agentrsquos job is to hedge one contract ofthis option
In the specific examples below the parameters areσ = 001 S0 = 100T = 10 and D = 5 We set therisk-aversion κ = 01
29 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Simulation assumptions II T-costs For a trade size of n shares we define
cost(n) = multiplier times TickSize times (|n|+ 001n2)
where we take TickSize = 01 With multiplier = 1 the termTickSize times |n| represents a cost relative to the midpoint ofcrossing a bid-offer spread that is two ticks wide Thequadratic term is a simplistic model for market impact
All of the examples were trained on a single CPU and thelongest training time allowed was one hour
Baseline agent = RL agent trained in a friction-less world
30 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Example Baseline agent (discrete amp no t-costs)
minus200
minus100
0
100
200
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 1 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agentrsquos position tracks the delta
31 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Example Baseline agent (discrete amp t-costs)
minus100
0
100
0 10 20 30 40 50
timestep (DT)
valu
e (
dolla
rs o
r sh
are
s)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
Figure 2 Stock amp options PampL roughly cancel to give the (relatively lowvariance) total PampL The agent trades so that the position in the nextperiod will be the quantity minus100 middot∆ rounded to shares
32 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Example T-cost aware agent (discrete amp t-costs)
minus100
minus50
0
50
100
0 10 20 30 40 50
timestep (DT)
valu
e (
do
llars
or
sha
res)
costpnl
deltahedgeshares
optionpnl
stockpnl
stockposshares
totalpnl
33 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Kernel density estimates of total cost amp volatility
000
001
002
003
100 200 300
totalcost
densi
ty
method
delta
reinf
000
025
050
075
100
5 10 15
realizedvol
densi
ty
method
delta
reinf
Figure 3 Kernel density estimates for total cost (left panel) and volatilityof total PampL (right panel) from N = 10 000 out-of-sample simulationsThe ldquoreinfrdquo policy achieves much lower cost (t-statistic = minus14322) withno significant difference in volatility of total PampL
34 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Kernel density estimates of total PampL
00
01
02
03
04
minus8 minus6 minus4 minus2 0
studenttstatistictotalpnl
densi
ty
method
delta
reinf
Figure 4 Kernel density estimates of the t-statistic of total PampL for eachof our out-of-sample simulation runs and for both policies representedabove (ldquodeltardquo and ldquoreinfrdquo) The ldquoreinfrdquo method is seen to outperform inthe sense that the t-statistic is much more often close to zero andinsignificant 35 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Conclusions IWe have introduced a RL system that hedges an option underrealistic conditions of discrete trading and nonlinear t-costs
The approach does not depend on the existence of perfectdynamic replication The system learns to optimally trade offvariance and cost as best as possible using whatever securitiesit is given as potential candidates for inclusion in thereplicating portfolio
It accomplishes this without the user providing any of thefollowing information the strike price K the fact that the stock price process is a geometric Browning
motion the volatility of the price process the Black-Scholes-Merton formula the payoff function (S minus K )+ at maturity any of the Greeks
36 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
Conclusions II A key strength of the RL approach It does not make any
assumptions about the form of t-costs RL learns theminimum variance hedge subject to whatever t-cost functionone provides All it needs is a good simulator in which t-costsand options prices are simulated accurately
37 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
References I
Almgren Robert and Neil Chriss (1999) ldquoValue under liquidationrdquo In Risk 1212 pp 61ndash63
Almgren Robert and Tianhui Michael Li (2016) ldquoOption hedging with smooth market impactrdquo InMarket Microstructure and Liquidity 21 p 1650002
Bank Peter H Mete Soner and Moritz Voszlig (2017) ldquoHedging with temporary price impactrdquo InMathematics and Financial Economics 112 pp 215ndash239
Black Fischer and Myron Scholes (1973) ldquoThe pricing of options and corporate liabilitiesrdquo In Journalof Political Economy 813 pp 637ndash654
Boyle Phelim P and Ton Vorst (1992) ldquoOption replication in discrete time with transaction costsrdquo InThe Journal of Finance 471 pp 271ndash293
Buehler Hans et al (2018) ldquoDeep hedgingrdquo In arXiv180203042
Figlewski Stephen (1989) ldquoOptions arbitrage in imperfect marketsrdquo In The Journal of Finance 445pp 1289ndash1311
Grannan Erik R and Glen H Swindle (1996) ldquoMinimizing transaction costs of option hedgingstrategiesrdquo In Mathematical Finance 64 pp 341ndash364
Halperin Igor (2017) ldquoQLBS Q-Learner in the Black-Scholes (-Merton) Worldsrdquo In arXiv171204609
Henrotte Philippe (1993) ldquoTransaction costs and duplication strategiesrdquo In Graduate School ofBusiness Stanford University
38 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
References IIKolm Petter and Gordon Ritter (2019) ldquoDynamic Replication and Hedging A Reinforcement Learning
Approachrdquo In The Journal of Financial Data Science 11 pp 159ndash171
Leland Hayne E (1985) ldquoOption pricing and replication with transactions costsrdquo In The Journal ofFinance 405 pp 1283ndash1301
Martellini Lionel (2000) ldquoEfficient option replication in the presence of transactions costsrdquo In Reviewof Derivatives Research 42 pp 107ndash131
Merton Robert C (1973) ldquoTheory of rational option pricingrdquo In The Bell Journal of Economics andManagement Science pp 141ndash183
Ritter Gordon (2017) ldquoMachine Learning for Tradingrdquo In Risk 3010 pp 84ndash89
Rogers Leonard CG and Surbjeet Singh (2010) ldquoThe cost of illiquidity and its effects on hedgingrdquo InMathematical Finance 204 pp 597ndash615
Saito Taiga and Akihiko Takahashi (2017) ldquoDerivatives pricing with market impact and limit orderbookrdquo In Automatica 86 pp 154ndash165
Sutton Richard S and Andrew G Barto (2018) Reinforcement learning An introduction Secondedition in progress MIT press Cambridge
Szepesvari Csaba (2010) Algorithms for Reinforcement Learning Morgan amp Claypool Publishers
Toft Klaus Bjerre (1996) ldquoOn the mean-variance tradeoff in option replication with transactions costsrdquoIn Journal of Financial and Quantitative Analysis 312 pp 233ndash263
39 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
References IIIWhalley A Elizabeth and Paul Wilmott (1997) ldquoAn asymptotic analysis of an optimal hedging model
for option pricing with transaction costsrdquo In Mathematical Finance 73 pp 307ndash324
40 40
top related