discrete-event simulation-based q-learning algorithm ... · 2 repeat (for each episode): 3...

Vol.:(0123456789)

SN Computer Science (2020) 1:50 https://doi.org/10.1007/s42979-019-0051-7

SN Computer Science

ORIGINAL RESEARCH

Discrete‑Event Simulation‑Based Q‑Learning Algorithm Applied to Financial Leverage Effect

E. Barbieri1 · L. Capocchi1 · J. F Santucci1

Received: 13 September 2019 / Accepted: 26 November 2019 © Springer Nature Singapore Pte Ltd 2019

AbstractDiscrete-event modeling and simulation and machine learning are two frameworks suited for system of systems modeling which when combined can give a powerful tool for system optimization and decision making. One of the less explored application domains is finance, where this combination can propose a driven tool to investor. This paper presents a discrete-event specification as a universal framework to implement a machine learning algorithm into a modular and hierarchical environment. This approach has been validated on a financial leverage effect based on a Markov decision-making policy.

Keywords Machine learning · Modeling · Simulation · Discrete event

Introduction

Modeling and simulation (M&S) and artificial intelligence (AI) are two domains that can complement each other. For instance, the AI can help the “simulationist” in the modeling of complex systems that are impossible to represent math-ematically [15]. On the other hand, M&S can help AI mod-els that failed to deal with complex system of systems for lack of simple or unworkable heuristics. In [26], the authors rightly point out that: “AI components can be embedded into a simulation to provide learning or adaptive behavior. And, simulation can be used to evaluate the impact of introduc-ing AI into a “real world system” such as supply chains or production processes.”.

Machine learning is a type of AI that uses three types of algorithm [supervised learning, unsupervised learning, reinforcement learning (RL)] to build models that can get input set to predict an output set using statistical analysis. An MDP is used to describe an environment for reinforce-ment learning, where the environment is fully observable.

Almost all RL problems can be formalized as MDPs. The definition of an MDP model requires a formal framework for modular specification to represent the interaction (actions/rewards) between the environment and the agent of an MDP. The DEVS formalism (Discrete-EVent system Specification) [2] is the appropriate mathematical framework for specifi-cally specifying an MDP. In addition, DEVS offers the auto-matic simulation of these models and thus the simulation of reinforcement learning algorithms (e.g., Q-learning) [11] applied to MDPs to obtain the optimal solution in the frame-work of a decision-making problem with a high degree of uncertainty.

This paper presents the benefits of DEVS formalism aspects to assist in the realization of RL models more spe-cially for the Q-learning algorithm. In [18], the authors pro-pose to use DEVS as an experimental framework for imple-menting time integration in MDPs to define decision dates. This approach is consistent with the need to treat time as a fundamental part of our decision-making process. AI learn-ing techniques have already been used in a DEVS simulation context. Indeed, in [20] the authors propose the integration of some predictive algorithms of automatic learning in the DEVS simulator to considerably reduce the execution times of the simulation for many applications without compromis-ing their accuracy. In [23], the comparative and concurrent DEVS simulation is used to test all the possible configu-rations of the hyper-parameters (momentum, learning rate, etc.) of a neural network. In [21], the authors present the formal concepts underlying DEVS Markov models and how

* L. Capocchi [email protected]

E. Barbieri [email protected]

J. F Santucci [email protected]

1 UMR CNRS 6134, University of Corsica, Campus Grimaldi, 20250 Corte, France

http://orcid.org/0000-0002-0793-8742

http://crossmark.crossref.org/dialog/?doi=10.1007/s42979-019-0051-7&domain=pdf

SN Computer Science (2020) 1:50 50 of 12

SN Computer Science

they are implemented in MS4Me software [31]. Markov con-cepts of states and state transitions are fully compatible with the DEVS characterization of discrete event systems.

The new propose approach allows Q-learning to be isolated in a DEVS Agent component that reacts with a dynamic DEVS environment component. This modular approach permits a total control of the Q-learning loop to drive the algorithm convergence. Basically, an agent and an environment communicate to converge the agent towards a best possible policy. Thanks to the modular and hierar-chical aspect of DEVS, the separation between the agent and the environment within the RL algorithms is improved. A validation has been proposed on dynamic and uncertain environment that corresponds to one of the best applica-tion domains of the Markov chain property. We decided to test our approach using the historical data of stock exchange markets because the stock option management perfectly fits to a very volatile environment where the next state of a sys-tem depends only on the present state. The Markov property is clearly identified by the fact that past performance is no guarantee of future performance. This validation is a con-tinuation of the research that was presented in the paper [1]. This last publication focused on a proposal for a modeling and simulation tool aimed at improving the ex-ante evalu-ation of the leverage effect of the deployment of European programs.

The paper is organized in the following way: A pre-liminaries section including the MDP modeling traditional approach with the Q-learning algorithm and the DEVS for-malism has been introduced in Sect. 2. Section 3 presents how reinforcement learning and simulation can be integrated and how DEVS formalism can help in this integration. The motivation and the contribution towards an M&S and ML combination have been introduced in Sect. 4. Section 5 is dedicated to validate our approach by introducing the prob-lem of obtaining a certain kind of leverage effects in finan-cial asset optimization processes. This section presents the mathematical model of the MDP in a framework defined by DEVS in the case of a leverage effect of stock price volatil-ity (CAC40, NasDaq, etc.). The last section is devoted to concluding remarks and future works.

Preliminaries

This section gives some background about MDP process with Q-learning algorithm and DEVS formalism to intro-duce the propose modeling approach based on the combina-tion of these two previous topics.

Markovian Decision‑making Process Modeling and Q‑Learning Algorithm

Markov-based Markov decision processes (MDPs) are defined as controlled stochastic processes satisfying the Markov property, assigning rewards to state transitions [3, 17]. An MDP can be defined with the set (S, A, T, r), where

• S is the state space in which the process evolves.• A is the space of actions that controls the dynamics of

the state.• T is the time space or the time axis.• r is the reward function on transitions between states.

In the case of a large-scale, simulation-based MDP, the Bell-man equation [29] allows us to determine a best possible policy without generating a probability transition matrix as is the case in the traditional approach. The Bellman equa-tion (Eq. 1) states that the expected long-term reward for a given action is equal to the immediate reward of the current action combined with the expected reward of the best future action taken in the next state. This means that the Q value for one or more states and the action (a) should represent the current reward (r) plus the expected maximum future reward ( � ) expected for the next state (s’). The discount factor � makes it possible to estimate at more ( � = 1 ) or medium term ( 𝛾 < 1 ) the future values of Q. According to the Tem-poral Difference [22] learning technique, the matrix Q can be updated as follows:

with s� ∈ S , a� ∈ A , � ∈ [0, 1] the discount factor and � ∈ [0, 1] the learning rate. The learning rate determines how much new calculated information will outperform the old one. The discount factor � determines the importance of future rewards. 0 would make the myopic agent by consider-ing only the current rewards, while a factor close to 1 would also involve the more distant rewards.

The Q variable allows us to decide how much future rewards can be compared to the current reward. Indeed, it is a matrix composed of a number of rows equivalent to the number of states and a number of columns equivalent to the number of actions considered. The Q-learning algorithm is used to determine, by iteration, the optimal value of the vari-able Q to find the best possible policy. Maximization allows to select only the action a from all possible actions for which Q(s, a) has the highest value.

Q-learning [27] is a popular RL algorithm that has been used to solve many MDP problems. Figure 1 shows the Q-learning algorithm from [22]. In line 3, the start state is defined before the loop (line 4) in charge of trying to reach the goal state (line 9). For every step, the Q matrix is

(1)Q(s, a) = Q(s, a) + �[r + �(maxa�

Q(s�, a�)) − Q(s, a)]

SN Computer Science (2020) 1:50 of 12 50

SN Computer Science

updated according to Eq. 1 by considering a new action a, reward r and state s’. When the goal state is obtained, a new episode takes place.

Basically and as pointed out in [6], Q-learning algorithm converges in polynomial time depending on the value of the learning rates. However, if the discount factor is close to or equal to 1, the value of Q may diverge [19].

Discrete Event System Specification

The DEVS (Discrete-EVent system Specification) formalism was introduced by Zeigler in 1970s [30, 32] for modeling discrete-event systems in a hierarchical and modular way. With DEVS, a model of a large system can be decomposed into smaller component models with couplings between them. DEVS formalism defines two kinds of models: (i) atomic models that represent the basic models providing specifications for the dynamics of a sub-system using func-tion transitions; (ii) coupled models that describe how to couple several component models (which can be atomic or coupled models) together to form a new model.

An atomic DEVS model can be considered as an automa-ton with a set of states and transition functions allowing the state change when an event occurs or not. When no events occur, the state of the atomic model can be changed by an internal transition function noted �int . When an external event occurs, the atomic model can intercept it and change its state by applying an external transition function noted �ext . The life time of a state is determined by a time advance function called ta . Each state change can produce output message via an output function called � . A simulator is asso-ciated with the DEVS formalism to exercise instructions of coupled models to actually generate its behavior. The archi-tecture of a DEVS simulation system is derived from the abstract simulator concepts associated with the hierarchical and modular DEVS formalism.

The DEVSimPy (Python Simulator for DEVS models) [5] environment is a user-friendly interface for collabora-tive M&S of DEVS systems implemented in the Python language. The DEVSimPy project uses the Python program-ming language for providing a GUI (based on wxPython graphic library) for the PyDEVS and PyPDEVS [25] (the

Parallel DEVS implementation of PyDEVS) APIs. DEVS-imPy has been set up to facilitate both the coupling and the re-usability of the PyDEVS classic DEVS models and the PyPDEVS Python Parallel DEVS models.

With DEVSimPy, DEVS models can be stored in a library to be reused and shared (area 1 in Fig. 2). A set of DEVS models constitutes a shared library due to the fact that all models can be loaded or updated from an external location such as from a file server (Dropbox, GoogleDrive, GitHub, etc.) which could be also considered as a kind of “on-line model store”. With DEVSimPy, complex systems can be modeled by a coupling of DEVS models (area 2 in Fig. 2) and the simulation is performed in an automatic way.

Reinforcement Learning and DEVS Simulation

Systems that already use AI (Artificial Intelligence), such as digital supply chains, “smart factories” and other industrial processes in Industry 4.0, will inevitably have to include the AI in their simulation models. For example, with simulation analysis systems, AI components can be directly integrated into the simulation model to allow for testing and forecast-ing. In [13], the authors use a recursive learning algorithm (Q-learning) to combine a dynamic load balancing algorithm and a bounded-window algorithm for discrete simulation of VLSI circuits (Very-large-scale integration) at gate level.

The application of AI to optimization is another oppor-tunity for integration into simulation modeling [8]. Agent-based systems often have many parameters and require sig-nificant execution times to explore all their permutations to find the best configuration of the models. AI can accelerate this phase of configuration and provide more efficient opti-mization. In return, simulation can also accelerate the learn-ing and configuration processes of AI algorithms. Indeed in [24], comparative and concurrent DEVS simulation is used to test all possible configurations of the hyper-parameters (momentum, learning rate, etc.) of a neural network.

In the case of reinforcement learning, components can be developed to replace rule-based models. This is possible when considering human behavior and decision making. For example, in [7], the authors consider that by observ-ing a desired behavior, in the form of outputs produced in response to inputs, an equivalent behavioral DEVS model can be constructed. These learning components can either be used in simulation models to reflect the real system, or be used to train AI components. By generating the data sets needed to learn neural networks, simulation models can be a powerful tool in deploying the algorithms of reinforcement learning.

AI learning techniques have already been used in a DEVS simulation context. Indeed, in [20] the authors propose the

1 Initialize Q(s,a)2 Repeat (for each episode):3 Initialize s4 Repeat (for each step of episode):5 Choose a from s using policy derived from Q6 Take action a, observe r and s’7 Update Q (according to the equation 2)8 s <- s’9 until s is goal state

Fig. 1 Q-learning algorithm from [22]


SN Computer Science

integration of some predictive algorithms for machine learn-ing in the DEVS simulator to considerably reduce simulation execution times for many applications without compromis-ing accuracy.

More generally, the DEVS formalism can be used to facilitate the development of traditional phases involved in a learning process by strengthening a system given in Fig. 3:

1. The data analysis consists of an exploratory analysis of the data which will make it possible to determine the type of algorithm learning (supervised, unsupervised, by reinforcement) to use for a given decision problem. In addition, this phase also allows to determine the state variables of the future learning model. This is one of the most important phases in the modeling process. As noted in Fig. 3, the SES (System Entity Structure) for-malism [31] can be used to define a family models of learning algorithms (DQN, DDQN, A3C, Q-learning, and others.) [22] based on the results of the analysis of interpretation-based data statistics.

2. Simulation learning of an agent consists of simulating entry games of the learning model to calibrate it by avoiding over learning (learning phase). The DEVS for-malism makes it possible to simulate the environment as an atomic model. However, the environment interacting with the agent in a traditional learning-by-reinforcement scheme may also be considered as a coupled model com-posed of several atomic models interconnected. In that context, we may consider the environment as a dynamic

multi-agent DEVS model in which the number of agents can vary over time.

3. Real-time simulation involves submitting the model to the actual input data (test phase). The DEVS formalism and its experimental framework are excellent candidates

Fig. 2 DEVSimPy general interface: (1) DEVS models libraries panel; (2) DEVS diagram composition panel

Fig. 3 Development phases involved in the traditional and DEVS-based approach in a learning process


SN Computer Science

to simulate the decision policies from real simulation data.

Motivation and Contribution

Our motivation is to give a contribution to reduce the lack of transparency in the black box machine learning algorithms. To mitigate the risk of delegation to machines in decision-making processes, discrete events offer the opportunity to keep under observation, interactions and separation between the agent and the environment involved in traditional rein-forcement learning algorithm as Q-learning.

The new propose approach allows Q-learning algorithm to be formally separated in a DEVS Agent component that reacts with a dynamic DEVS Environment component. This modular approach permits a total control of the Q-learn-ing loop to drive the algorithm convergence. Basically, an agent and an environment communicate to converge the agent towards a best possible policy. Thanks to the modular and hierarchical aspect of DEVS, the separation between the agent and the environment within the RL algorithms is improved.

Our contribution is to propose a formalized way to drive an RL algorithm into a discrete-event system as mentioned in the item 2 of the list of traditional phases involved in a learning process (Fig. 3). The DEVS formalism has been selected due to its universal and unique aspects for discrete-event system models.

Basically in RL model, an agent and an environment com-municate to converge the agent towards a best possible pol-icy (Fig. 4). Thanks to the modular and hierarchical aspect of DEVS, the separation between the agent and the environ-ment within the RL algorithms is improved. In this paper, a new generic DEVS modeling of the Q-learning algorithm based on two atomic DEVS models has been proposed: the Agent and the Environment DEVS models.

The atomic model environment will respond to the solici-tations of the agent atomic model by assigning it a new state s, a reward r according to an action it has received. In addi-tion, the model will inform if it reaches the final state or not thanks to a variable Boolean d. It is, therefore, through an external transition function that this communication will take place. The output function will be activated immedi-ately after the external transition function to send the pair (s, r, d) to the agent model. The internal transition function will update the model’s state variables without generating an output. Finally, the initial state will be to determine the list of possible states S and actions A and the reward matrix R. The model is responsible for the generation of episodes (when a final state is reached).

The atomic model Agent is intended to respond to events that occur from the environment model. When it receives a new tuple (s, r, d), it will return an action following a policy depending on the implemented algorithm ( �-greedy for example). The model has a Q state variable which is an SxA dimension matrix for implementing the learning algorithm (Q-learning or SARSA [4] for example). When convergence is reached (depending on the values of Q), the model becomes passive and no longer responds to the envi-ronment. The update of the Q attribute is done in the exter-nal transition function after receiving the tuple (s, r, d). The internal transition function makes the model passive. The output function is activated when the external function is executed.

Figure 5 shows the python code of the method in charge of updating the Bellman equation in the Q-learning algo-rithm implemented in the Agent class. The lines 19 and 20 implement the update of the mean of the matrix Q to control the convergence of the Q-learning algorithm. Such update allows the test of the convergence of the episode (which will enable to stop the while statement in line 2 of the Algorithm 1).

Fig. 4 Learning by reinforcement with the agent and environment DEVS models

1 def UpdateQLearning(self, s0, s1 ,a0 , r):2 """ 1. use any policy to estimate Q3 Q(s0,a0)=Q(s0,a0)+alpha[r + gamma.maxQ(s1,)-Q(s0,a0)]4 2. Q directly approximates Q* (Bellman optimality eqn)5 3. independent of the policy being followed6 4. only requirement: keep updating each (s,a) pair78 """910 ### greedy action and temporal difference11 ga = np.max(self.Q[s1,:])1213 td = r + self.y*ga - self.Q[s0,a0]1415 ## Q update6 self.Q[s0,a0] += self.lr*td1718 ### Q mean update19 self.Q_old_mean = self.Q_new_mean20 self.Q_new_mean = self.Q.mean()

Fig. 5 Python function for the Bellman equation update in the Agent model


SN Computer Science

Case Study: Leverage Effects in Financial Asset Optimization Processes

In our study case, we consider the possibility of raising funds by borrowing capital from local investment agencies ten times higher than the simulated Agent internal financ-ing (IF). The IF covers the risk of losing part of the capital borrowed. The leverage effect is related to the fact that the market volatility will be applied to the borrowed capital, which gives hope of an increase of the IF to strengthened company investments capacity. IF’s gains or looses should be driven by the machine learning algorithm embedded on the Agent. The final leverage effect will be estimated by calculating the difference between the evolution of the value of the initial portfolio (basket of stock option) with no sell and buy and the total value of portfolio driven by the machine learning algorithm embedded on the Agent.

We propose four scenarios. In simulation number 1, named initial cash=0, the IF is 705$ and leverage effect borrowed capital in 10 times the IF, that is 7053$ that are invested in an investment portfolio (equivalent to a state) of 1 to 3 stock indexes ( indexa, indexb, indexc ) among the main world indexes (CAC40, Nasdaq, DowJones). On the 3 others simulations the Agent has the same amount in stocks that the simulation number 1 and the possibility to invest 3 different extra cash values. The goal is to have a portfolio of a multiplicity N from 0 to N − 1 of each of these indexes which allows to obtain at any moment the maximum possible value by adding together all the values of the N indexes (see Eq. 2).

Investing IF in stock market indexes rather than in company-specific shares makes possible to take into account the evo-lution of the environment of financial markets. Indeed, the volatility of the indexes representing the trend of the best (or the average of all) shares of the market reflects the behavior of the major agents who influence market trends and its envi-ronment (correction, bullish, bearish, etc.). It is important to note how much the environment (volatility of the indexes or new cash inflow to buy extra indexes) can have an impact on our simulation. In fact, the change in the value of an index or the availability of new cash will influence, for example, the number of states or their length of life. Our simulation example will deal with the policy to be followed in the case of a change in environment related to the increase of the Agent cash availability and indexes volatility.

In an RL system, the Agent learns from rules. In our case, the rules are as follows:

(2)max∑

i=a,b,c

{0;N − 1} ∗ indexi

• The Agent state whole is finite and it is calculated from Eq. 2,

• The Agent can take one action at time, among 3 actions (buy, sell, keep) that are chosen on the base of the indexes volatility,

• The Agent uses a goal-reward representation [22] and gets a reward different from 0 only when it reaches the goal state corresponding to the maximum value of the investment portfolio.

• The Agent uses the single goal approach that considers only one goal state of policy research.

Considering our case study, an MDP can be formalized as follows:

– the finite set of states S = {(s0,… , sk)|k ∈ ℕ, s ∈ [0,N − 1]} and the size of the state space is |S| = Nk with N the mul-tiplicity of indexes. If N = 8 , 512 possible states can be considered. Every state can be reached from every other state.

– the non-empty set of goal states G ⊆ S.– the finite set of actions A = {(a0,… , ak)|k ∈ ℕ, a ∈ [−1, 0, 1]}

and the total number of actions is ∑

s∈S �A(s)� = 7 . All actions are deterministic and epsilon-greedy algorithm is used to choose them with a exploitation–exploration approach.

– According to the goal-reward representation [22] where the agent is rewarded for entering a goal state but not rewarded or penalized otherwise, the set of reward r ∶ S × A ⟶ ℝ

10

In our case and due to the goal-reward approach, the con-vergence of the Q-learning algorithm can be obtained by tracking the Q matrix until a stable value is obtained. Line 2 of the Algorithm 1 could be replaced by “Repeat until Q converges”.

The goal is to implement the case study of an IF invested in 3 stock indexes CAC40, DJI (Dow Jones Industrial Aver-age) and IXIC (Nasdaq Composite) using the DEVSimPy M&S environment [5]. A DEVSimPy modeling is proposed using the Agent-Environment RL approach combined with a discrete-event Q-learning algorithm. Simulations have been done to obtain optimal policies that are able to manage an IF invested in stock market indexes.

DEVSimPy Modeling

Figure 6 details the DEVSimPy model of the case study pre-sented in this section. The model includes 5 atomic DEVS models:

r(s, a) =

{1 if s is a goal state

0 otherwise


SN Computer Science

• the CAC40, DJI, IXIC models: They are generator type models that send on their outputs the index values col-lected during a period (stored in a csv file) or the real index values from the real stock market. When a con-trol message is received on the input port, the new index value is obtained and triggered on the output.

• the Env model: it is an environment component accord-ing to the Q-learning algorithm. When all the inputs are received, the possible states/action tuple are computed and the reward is defined to 1 on the goal state. The model interacts with the Agent model through its port 0 to send the new tuple state/reward (line 4 in A

• Algorithm 1) and on its port 1 to activate the Agent model for a new episode (line 2 in Algorithm 1).

• the Agent model: its an Agent component according to the Q-learning algorithm. When a message is received on the port 0, the agent updates the Bellman equation through its external transition function and send to the Env model an action depending on the exploration/exploitation configuration defined in the model. When all the steps are computed and the goal is reached (line 9 in Algorithm 1, the Agent gives a best possible policy and sends a message on its port 1 to the index generator models.

As depicted in Fig. 6, the DEVS model presents two loops. The first one is the traditional communication between the Agent according to the Q-learning algorithm (line 4 in Algorithm 1). The less traditional second one allows to con-sider the evolution of the indexes in a real stock market (line 2 in Algorithm 1). DEVS generator models (CAC40, DJI and IXIC in Fig. 6) are connected with the output port 1 of Agent model to make a new simulation run (new episode)

when the Env model generates a policy that solves a current simulation.

The next section is dedicated to the Q-learning DEVS-imPy simulations of this case study model. Two cases are considered: single episode case where the previous Q-learn-ing algorithm is executed only one once and multiple epi-sode case where the Q-learning is updated depending on the evolution of the indexes.

DEVSimPy Simulations

This section presents two kinds of simulation scheme. In the single episode case, only one set of indexes is simulated and only one best possible policy is obtained at the end of the simulation. In the multiple episode case, indexes from 02/01/1991 to 05/07/2018 are simulated and all the optimal policies are stored and analyzed to validate our approach.

Single Episode Case In this paragraph, the model of Fig. 6 has been simulated with the following settings:

• IXIC model index/volatility values: 360.200012∕−0.01906318.

• CAC40 model index/volatility values: 1508.0∕−0.025210084.

• DJI model index/volatility values: 2522.77002∕−0.016881741.

• For the Env model: cash = 10000$; M = 1 ; N = 8 ; init state = (1 IXIC, 1 CAC40, 2 DJI).

• For the Agent model: � = 0.8 ; � = 1 ; discount factor � = 0.95.

The end of the simulation is obtained with the conver-gence of the matrix Q (Fig. 7). The simulation results give

Fig. 6 DEVSimPy model


SN Computer Science

the path to reach the optimal goal state (0 IXIC, 0 CAC40, 6 DJI) among 250 possible states and 7 possible actions:

Due to the Q-learning algorithm, the above proposed path insures to reach the goal state with a minimal number of transition ( � = 1 ) and no loss of cash.

Multiple Episode Case In this paragraph, the model of Fig. 6 has been simulated with the following general settings:

• IXIC, CAC40, DJI models index/volatility values from 02/01/1991 to 05/07/2018 (Fig. 8).

• For the Env model: M = 1 ; N = 8 ; init state = (1 IXIC, 1 CAC40, 2 DJI).

• For the Agent model: � = 0.8 ; � = 1 ; discount factor � = 0.95

(1, 1, 2) ⟶ (1, 0, 2) ⟶ (1, 0, 3) ⟶ (1, 0, 4)

⟶ (1, 0, 5) ⟶ (0, 0, 5) ⟶ (0, 0, 6) ⟶ wait

For all values of each index, the Agent gives a best possi-ble policy to obtain a leverage effect depending on the initial cash and the initial state. Scenarios with different value of initial cash have been simulated and analyzed with a new algorithm that consists to determine, for all tuples state/action, the maximum of the Q value during the evolution of the indexes.

The next scenarios are defined with a initial cash equal to 0$, 8000$, 16,000$ and 24,000$ added to the previous general setting.

Figure 9 shows the size of episode and the number of step during the simulations.

One step is equivalent to a message between the ML Agent and the ML environment. One episode is composed by steps to find a best possible policy. As it is shown in Fig. 9, the size of the episode (and the steps) is short as the initial cash amount is higher. It seems to be correct that the state (7,7,7) reaches sooner with a higher amount initial cash. The sizes of the episode are ordered following the cash amount availability. Between 0 and 2000, concerning the number of steps, its seems logical that there is a correspond-ence between the number of steps and the number of states generated by the amount of cash. In a specific case of initial cash 0, the growth of the number of steps at the same level of the other proves that there is a real growth in value of the state from day 4000 to last day of the simulation.

Figure 10 depicts the variation of the cash during the simulations.

The simulation validates that our ML Agent correctly behaves following the indexes volatility and values as shown in Fig. 11 that also validates that the ML Agent respects the learning rules indeed even when the value of all indexes is dropping down, the ML Agent continues to try to reach the best possible state and at the same time to invest the maxi-mum quantity of available cash.

Figure 11 also validates that the cash is not a limiting fac-tor to get the best investment policy. Figure 12 clearly vali-dates that the ML Agent correctly invests the cash to try to reach as soon as it can be the best investment policy, indeed the time to have a residual cash close to 0 is faster with an higher initial cash and the three other scenarios are respect-ing the same order. Figure 12 validates that ML Agent does a real leverage effect even with the lowest initial cash, indeed if we compare the evolution in the period of the value of the initial investment portfolio (1,1,2) and the value of the portfolio driven by the ML Agent, the leverage effect is three times higher and the four scenarios get into this value cor-rectly following the residual cash order. Figure 13 shows that the indexes multiplicity is reached more rapidly in the init cash 24,000 scenario that confirms that the ML Agent is behaving to reach the best policy in the shorter time and the three other scenarios respect the rule. Figure 9 validates that the ML Agent behaves to maximize the value in the portfolio

Fig. 7 Mean of the Q matrix in the Agent model for the single simu-lation case

Fig. 8 Stock market indexes


SN Computer Science

indeed during the simulation period there is only one best time to take an action for one specific state.

Related Work

In the literature, a number of ML models have been devel-oped to predict different kinds of risks in a wide variety of financial areas. For instance, since the 1990s, machine-learn-ing techniques have been studied as tools for bankruptcy prediction and credit score modeling [12]. More specifically in the area of stock market index behaviors, the authors of [9] investigate in 2005 the predictability of the NIKKEI 225,

the Japan stock market index, and evaluate the forecasting ability of the support vector machine (SVM) a very specific type of ML algorithms. Multiple Kernel learning algorithm that refers to a set of machine learning methods is also fre-quently used to predict market behavior. In [28], the authors develop a two-stage multiple-kernel learning algorithm to forecast the Taiwan Capitalization Weighted Stock Index. Other forecasting model based on, support vector regres-sion (SVR) and chaotic mapping, firefly algorithms was proposed in [10] to predict some Nasdaq companies stock market prices. Other authors focus their work on the com-parison of machine learning techniques like artificial neural network (ANN), SVM applied to two stock price indices CNX Nifty and S&P Bombay Stock Exchange (BSE) Sensex [16]. Kamran Raza builds a model that tries to predict the performance of KSE-100 index. The results of the work pub-lished in 2017 suggested that behavior of market can be pre-dicted using machine learning techniques and in particular with the Multi-layer Perceptron (MLP) that seemed to be the best model. John M. Mulvey [14] focuses on the difficulties of the evaluation of the quality of the ML learning market forecasting and the need for measurements of correctness in supervised learning due to the levels of uncertainties and time lags in strategic decisions that complicate the search for ML breakthroughs. It can be difficult to evaluate the quality of the recommendations. However, in the literature, there is no related work that combines a Q-learning technique based on a Markovian approach in a DEVS formalism that tries to demonstrate that an ML Agent can have a better financial

Fig. 9 Size of episode (left part) and number of step (right part) during the period and ordered by the initial cash scenario (from top to bottom: 0, 8000, 16000, 24000)

Fig. 10 Total assets (stock + cash) during the simulation


SN Computer Science

Fig. 11 Residual cash during the simulation for the four scenarios

Fig. 12 Leverage effect (with init cach from top to bottom: 0, 8000, 16000, 24000)


SN Computer Science

result than stock exchange index value (average value of the market) without necessarily making forecast.

Conclusions and Future Works

This paper presents the benefits of DEVS formalism aspects to assist in the realization of RL models more specially for the Q-learning algorithm. This modeling was carried out using a Markov decision-making process modeled using DEVS and solved via simulation based on reinforcement learning using the Q-learning algorithm which integrates the notion of internal financing (cash) over time.

The continuation of our work will have to allow to inte-grate other factors constituting the environment which directly influences the construction of a policy of optimiza-tion of the internal financing like for example the transaction costs, and the behavior of the other agents.

In that sense, the major advantage of the use of DEVS lies in the possibility to build a system composed of a cou-pled model specifying the environment in connection with an atomic model representing the behavior of the agent. In addition, DEVS should offer us the opportunity to simulate the frequency of decision making to allow our agent to be faster than a human operator, who makes decisions based on intuition and tags and smarter than the algorithms of high-frequency trading.

In perspective, in order to deploy the tool to manage-ment at the sixty major global stock indexes, we will explore two different tool development tracks and compare the results. These approaches should allow us both to esti-mate upstream rewards and to respond to the increase in the number of states generated by taking into account new factors that make up the environment. On the one hand, we will couple to our reinforcement learning model, a sec-ond supervised learning model and a third unsupervised model. The reinforcement learning would then be used for the learning phase for the following two models. This combination should drastically reduce the learning and simulation times of the reinforcement learning model. All the results of these will then be saved to create the labels and classes for the test phase that will be performed with the other two learning models. On the other hand, we plan to try to add to Q-learning the estimation of the Bellman equation by a neural network (Deep Q-Networks). Indeed, the Q-learning algorithm based on the management of a table reaches these limits when the number of states increases. The use of neural networks makes it possible to have an approximator of the variable Q and makes it pos-sible to consider a much larger number of possible states.

Compliance with ethical standards

Conflict of interest The authors declare that they have no conflict of interest.

Fig. 13 Indexes multiplicity (IXIC on the left, CAC40 on the middle and DJI on the right) during the simulation for the four scenarios depending on the init cash (from top to bottom: 0, 8000, 16000, 24000)


SN Computer Science

References

1. Barbieri E, Capocchi L, Santucci J. Devs modeling and simulation based on markov decision process of financial leverage effect in the eu development programs. In: 2017 Winter simulation con-ference (WSC), 2017; pp 4558–4560. https ://doi.org/10.1109/WSC.2017.82482 03

2. Bernard P, Zeigler HP, Kim TG. Theory of modeling and simula-tion. 2nd ed. Cambridge: Academic Press; 2000.

3. Bertsekas DP. Dynamic programming: deterministic and stochas-tic models. Upper Saddle River: Prentice-Hall Inc; 1987.

4. Bromuri S. Dynamic heuristic acceleration of linearly approxi-mated sarsa(� ): using ant colony optimization to learn heuris-tics dynamically. J Heuristic. 2019;25(6):901–32. https ://doi.org/10.1007/s1073 2-019-09408 -x.

5. Capocchi L, Santucci JF, Poggi B, Nicolai C. DEVSimPy: A col-laborative python software for modeling and simulation of DEVS systems. In: Proceedings of 20th IEEE international workshops on enabling technologies, 2011; pp 170–175. https ://doi.org/10.1109/WETIC E.2011.31

6. Even-Dar E, Mansour Y. Learning rates for q-learning. J Mach Learn Res 2004; 5:1–25. http://dl.acm.org/citat ion.cfm?id=10053 32.10053 33

7. Floyd MW, Wainer GA. Creation of devs models using imitation learning. In: Proceedings of the 2010 summer computer simula-tion conference, society for computer simulation international, San Diego, CA, USA, SCSC ’10, 2010; pp. 334–341. http://dl.acm.org/citat ion.cfm?id=19994 16.19994 59

8. Hayashi S, Prasasti N, Kanamori K, Ohwada H. Improving behav-ior prediction accuracy by using machine learning for agent-based simulation. In: Nguyen NT, Trawiński B, Fujita H, Hong TP, edi-tors. Intelligent information and database systems. Heidelberg: Springer; 2016. p. 280–9.

9. Huang W, Nakamori Y, Wang SY. Forecasting stock market move-ment direction with support vector machine. Comput Oper Res. 2005;32(10):2513–22. https ://doi.org/10.1016/j.cor.2004.03.016. http://www.scien cedir ect.com/scien ce/artic le/pii/S0305 05480 40006 81, applications of Neural Networks.

10. Kazem A, Sharifi E, Hussain FK, Saberi M, Hussain OK. Support vector regression with chaos-based firefly algorithm for stock mar-ket price forecasting. Applied Soft Computing. 2013;13(2):947–58. https ://doi.org/10.1016/j.asoc.2012.09.024. http://www.scien cedir ect.com/scien ce/artic le/pii/S1568 49461 20044 49.

11. Lake BM, Ullman TD, Tenenbaum JB, Gershman SJ. Building machines that learn and think like people. 2016; CoRR arXiv :1604.00289

12. Lin W, Hu Y, Tsai C. Machine learning in financial crisis predic-tion: a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev) 2012;42(4):421–436. https ://doi.org/10.1109/TSMCC .2011.21704 20

13. Meraji S, Tropper C. A machine learning approach for optimizing parallel logic simulation. In: 2010 39th International Conference on Parallel Processing, 2010; pp 545–554. https ://doi.org/10.1109/ICPP.2010.62

14. Mulvey JM. Machine learning and financial planning. IEEE Poten-tial. 2017;36(6):8–13. https ://doi.org/10.1109/MPOT.2017.27372 00.

15. Nielsen NR. Application of artificial intelligence techniques to simulation. Springer, New York, 1991; pp 1–19. https ://doi.org/10.1007/978-1-4612-3040-3_1

16. Patel J, Shah S, Thakkar P, Kotecha K. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Exp Syst Appl. 2015;42(1):259–68. https ://doi.org/10.1016/j.eswa.2014.07.040. http://www.scien cedir ect.com/scien ce/artic le/pii/S0957 41741 40044 73.

17. Puterman ML. Markov Decision processes: discrete stochastic dynamic programming. 1st ed. New York: Wiley; 1994.

18. Rachelson E, Quesnel G, Garcia F, Fabiani P. A simulation-based approach for solving generalized semi-markov decision processes. In: Proceedings of the 2008 Conference on ECAI 2008: 18th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2008; pp 583–587. http://dl.acm.org/citat ion.cfm?id=15672 81.15674 08.

19. Russell S, Norvig P. Artificial intelligence: a modern approach. 3rd ed. Upper Saddle River: Prentice Hall Press; 2009.

20. Saadawi H, Wainer G, Pliego G. Devs execution acceleration with machine learning. In: 2016 Symposium on Theory of Mod-eling and Simulation (TMS-DEVS), 2016; pp 1–6. https ://doi.org/10.23919 /TMS.2016.79188 16

21. Seo C, Zeigler BP, Kim D. Devs markov modeling and simula-tion: formal definition and implementation. In: Proceedings of the theory of modeling and simulation symposium, society for computer simulation international, San Diego, TMS ’18, 2018; pp 1:1–1:12. http://dl.acm.org/citat ion.cfm?id=32131 87.32131 88

22. Sutton RS, Barto AG. Introduction to reinforcement learning. 1st ed. Cambridge: MIT Press; 1998.

23. Toma S. Detection and identication methodology for multiple faults in complex systems using discrete-events and neural net-works: applied to the wind turbines diagnosis. Theses, University of Corsica, 2014a. https ://hal.archi ves-ouver tes.fr/tel-01141 844

24. Toma S. Detection methodology and identify multiple faults in complex systems from discrete events and neural networks: appli-cations for wind turbines. Theses, Université Pascal Paoli, 2014b. https ://tel.archi ves-ouver tes.fr/tel-01127 073

25. Van Tendeloo Y, Vangheluwe H. The modular architecture of the Python(P)DEVS simulation kernel (WIP). In: Proceedings of the Symposium on Theory of Modeling & Simulation - DEVS Integrative, Society for Computer Simulation International, San Diego, DEVS ’14, 2014; pp 14:1–14:6. http://dl.acm.org/citat ion.cfm?id=26650 08.26650 22

26. Wallis L, Paich M. Integrating artifical intelligence with anylogic simulation. In: 2017 Winter Simulation Conference (WSC), 2017; pp 4449–4449. https ://doi.org/10.1109/WSC.2017.82481 56

27. Watkins CJCH. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.

28. Yeh CY, Huang CW, Lee SJ. A multiple-kernel support vec-tor regression approach for stock market price forecasting. Exp Syst Appl. 2011;38(3):2177–86. https ://doi.org/10.1016/j.eswa.2010.08.004, http://www.scien cedir ect.com/scien ce/artic le/pii/S0957 41741 00078 76.

29. Yu H, Mahmood AR, Sutton RS (2017) On generalized bellman equations and temporal-difference learning. In: Advances in Arti-ficial Intelligence - 30th Canadian Conference on Artificial Intel-ligence, Canadian AI 2017, Edmonton, AB, Canada, May 16–19, 2017, Proceedings, pp 3–14. https ://doi.org/10.1007/978-3-319-57351 -9_1

30. Zeigler BP. Theory of modeling and simulation. Cambridge: Aca-demic Press; 1976.

31. Zeigler BP, Sarjoughian HS. System entity structure basics. Guide to modeling and simulation of systems, simulation foundations, methods and applications. London: Springer; 2013. p. 27–37.

32. Zeigler BP, Muzy A, Kofman E. Theory of Modeling and Simula-tion, third edition. Academic Press, New York 2019. https ://doi.org/10.1016/B978-0-12-81337 0-5.00003 -1

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

https://doi.org/10.1109/WSC.2017.8248203

https://doi.org/10.1109/WSC.2017.8248203

https://doi.org/10.1007/s10732-019-09408-x

https://doi.org/10.1007/s10732-019-09408-x

https://doi.org/10.1109/WETICE.2011.31

https://doi.org/10.1109/WETICE.2011.31

http://dl.acm.org/citation.cfm?id=1005332.1005333




https://doi.org/10.1016/j.cor.2004.03.016

http://www.sciencedirect.com/science/article/pii/S0305054804000681


https://doi.org/10.1016/j.asoc.2012.09.024



http://arxiv.org/abs/1604.00289

http://arxiv.org/abs/1604.00289

https://doi.org/10.1109/TSMCC.2011.2170420

https://doi.org/10.1109/TSMCC.2011.2170420

https://doi.org/10.1109/ICPP.2010.62

https://doi.org/10.1109/ICPP.2010.62

https://doi.org/10.1109/MPOT.2017.2737200

https://doi.org/10.1109/MPOT.2017.2737200

https://doi.org/10.1007/978-1-4612-3040-3_1

https://doi.org/10.1007/978-1-4612-3040-3_1

https://doi.org/10.1016/j.eswa.2014.07.040





https://doi.org/10.23919/TMS.2016.7918816

https://doi.org/10.23919/TMS.2016.7918816


https://hal.archives-ouvertes.fr/tel-01141844

https://tel.archives-ouvertes.fr/tel-01127073



https://doi.org/10.1109/WSC.2017.8248156





https://doi.org/10.1007/978-3-319-57351-9_1

https://doi.org/10.1007/978-3-319-57351-9_1

https://doi.org/10.1016/B978-0-12-813370-5.00003-1

https://doi.org/10.1016/B978-0-12-813370-5.00003-1

discrete-event simulation-based q-learning algorithm ... · 2 repeat (for each episode): 3...

Documents