maxime bonneau - diva portal

Master Thesis in Statistics and Data Mining

Reinforcement Learning for 5G handover

by

Maxime Bonneau

2017-06

Department of Computer and Information Science

Division of Statistics

Linköping universitySE-581 83 Linköping, Sweden

Supervisors:

Jose M. Peña (LiU)

Joel Berglund (Ericsson)

Henrik Rydén (Ericsson)

Examiner:

Oleg Sysoev (LiU)

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-

kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan be-

skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se för-

lagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement – from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/her own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page: http://www.ep.liu.se/.

© Maxime Bonneau

Contents

Contents iv

List of Figures vi

List of Tables viii

Abstract 3

Acknowledgments 4

1 Introduction 51.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Data 102.1 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method 123.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Q-learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Combining reinforcement learning and artificial neural networks . . . . . . . . 183.4 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Results 314.1 Time considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Proof of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Penalising handover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Two actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.7 Contextual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Artificial neural networks and reinforcement learning . . . . . . . . . . . . . . . 42

5 Discussion 445.1 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 General observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 47

7 Appendix 487.1 Reducing the amount of data used for learning . . . . . . . . . . . . . . . . . . . 487.2 Parameters of the different algorithms . . . . . . . . . . . . . . . . . . . . . . . . 50

iv

7.3 Results tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.4 Verification of the consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Bibliography 60

List of Figures

1.1 Simplified diagram of the LTE-A network . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Illustration of a handover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Illustration of interaction between agent and environment in RL . . . . . . . . . . . 8

2.1 Map of the simulated network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Most commonly used activation functions for hidden layers . . . . . . . . . . . . . 193.2 Simple ANN with one hidden layer of two nodes . . . . . . . . . . . . . . . . . . . 193.3 First ANN approximating action-value function . . . . . . . . . . . . . . . . . . . . 213.4 Second ANN approximating action-value function . . . . . . . . . . . . . . . . . . . 223.5 Expected shape of the cumulative sum of the reward . . . . . . . . . . . . . . . . . 233.6 Expected shape of the evolution of the mean reward . . . . . . . . . . . . . . . . . . 243.7 Plot to visualise the efficiency of the learning . . . . . . . . . . . . . . . . . . . . . . 253.8 Barplots in complement of the graphs in 3.7 . . . . . . . . . . . . . . . . . . . . . . . 263.9 The column on the right shows the action selected . . . . . . . . . . . . . . . . . . . 283.10 Two different UEs with identical start and finish states . . . . . . . . . . . . . . . . 293.11 Directions simplification for a UE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.12 Two different UEs with different start and finish points but similar directions . . . 30

4.1 Histogram of the time needed to run ten Q-Learning algorithms . . . . . . . . . . . 324.2 Cumulative sum of the reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Mean reward per epoch while learning with the first five UEs . . . . . . . . . . . . 334.4 Evolution of the mean squared error through the number of iterations . . . . . . . 344.5 Test with a penalty factor of 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.6 Test with a penalty factor of 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.7 Q-learning algorithm using two actions . . . . . . . . . . . . . . . . . . . . . . . . . 354.8 Heatmap representation of a Q-table . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.9 Heatmap representation of the number of visits . . . . . . . . . . . . . . . . . . . . 364.10 Mean reward for each state-action pair . . . . . . . . . . . . . . . . . . . . . . . . . . 374.11 Standard deviation of the reward for each action-state pair . . . . . . . . . . . . . . 374.12 Testing indoor UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.13 Testing car UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.14 Testing walking UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.15 Learning with 1 UE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.16 Learning with 5 UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.17 Learning with 10 UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.18 Precision of 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.19 Precision of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.20 No feature added for learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.21 Direction used for learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.22 Angle between UE and BS used for learning . . . . . . . . . . . . . . . . . . . . . . 414.23 ANNs Q-Learning testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vi

4.24 log-likelihood of the ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.25 Heatmap of the Q-table after learning with the first three car UEs . . . . . . . . . . 434.26 Some output from the ANN trained with the first three car UEs . . . . . . . . . . . 43

7.1 Distribution of the mean RSRP difference (walking UEs) . . . . . . . . . . . . . . . 487.2 Distribution of the number of handovers (walking UEs) . . . . . . . . . . . . . . . . 487.3 Distribution of the mean RSRP difference (car UEs) . . . . . . . . . . . . . . . . . . 497.4 Distribution of the number of handovers(car UEs) . . . . . . . . . . . . . . . . . . . 497.5 Testing on walking UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.6 Testing on car UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.7 Testing indoor UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.8 Testing car UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.9 Testing walking UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.10 Learning with 1 UE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.11 Learning with 5 UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.12 Learning with 10 UEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.13 Precision of 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.14 Precision of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.15 No feature added for learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.16 Direction used for learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.17 Angle between UE and BS used for learning . . . . . . . . . . . . . . . . . . . . . . 59

List of Tables

2.1 Longitude of the first UE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 RSRPs for the first UE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7.1 Table based Q-learning algorithm parameters . . . . . . . . . . . . . . . . . . . . . . 507.2 Parameters of Q-learning algorithm using ANNs . . . . . . . . . . . . . . . . . . . . 507.3 Results for each category with contextual information . . . . . . . . . . . . . . . . . 517.4 Results while learning with different numbers of UEs . . . . . . . . . . . . . . . . . 517.5 Results for the seventh to the eleventh UE when learning with 10 UEs . . . . . . . 527.6 Results with different precisions for rounding the RSRP . . . . . . . . . . . . . . . . 527.7 Results when adding new features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

viii

Abbreviations and definitions

Abbreviation Meaning3GPP Third Generation Partnership ProjectANN Artifical Neural NetworkBS Base StationCN Core NetworkeNB eNodeBE-UTRAN Evolved UMTS Terrestrial Radio-Access NetworklogL log-likelihoodLTE-A Long Term Evolution AdvancedMDP Markov Decision ProcessMME Mobility Management EntityMSE Mean Squared ErrorP-GW Packet data network GateWayRL Reinforcement LearningRSRP Reference Signal Received PowerS-GW Serving GateWayUE User Equipment

1

List of Tables

Term Definition

Epoch An epoch is all the time-steps from the start at a random state to the goal state.An epoch is then composed of a succession of ensembles state-action-reward-state.

Epoch index The process of learning being a succession of epochs,the index of an epoch is the index given to an epoch according to its chronological order.

IterationAn iteration is one step in the process of learning.An iteration consists of a succession of epochs, one for each UE used to learn.To be able to learn properly, the agent should repeat a certain amount of iterations.

2

Abstract

The development of the 5G network is in progress, and one part of the process that needs tobe optimised is the handover. This operation, consisting of changing the base station (BS) pro-viding data to a user equipment (UE), needs to be efficient enough to be a seamless operation.From the BS point of view, this operation should be as economical as possible, while satis-fying the UE needs. In this thesis, the problem of 5G handover has been addressed, and thechosen tool to solve this problem is reinforcement learning. A review of the different meth-ods proposed by reinforcement learning led to the restricted field of model-free, off-policymethods, more specifically the Q-Learning algorithm. On its basic form, and used with sim-ulated data, this method allows to get information on which kind of reward and which kindsof action-space and state-space produce good results. However, despite working on somerestricted datasets, this algorithm does not scale well due to lengthy computation times. Itmeans that the agent trained can not use a lot of data for its learning process, and both state-space and action-space can not be extended a lot, restricting the use of the basic Q-Learningalgorithm to discrete variables. Since the strength of the signal (RSRP), which is of high in-terest to match the UE needs, is a continuous variable, a continuous form of the Q-learningneeds to be used. A function approximation method is then investigated, namely artificialneural networks. In addition to the lengthy computational time, the results obtained are notconvincing yet. Thus, despite some interesting results obtained from the basic form of the Q-Learning algorithm, the extension to the continuous case has not been successful. Moreover,the computation times make the use of reinforcement learning applicable in our domain onlyfor really powerful computers.

3

Acknowledgments

First, I would like to thank Ericsson AB, especially LinLab, for giving me the chance to workwith them, and for providing me an interesting thesis topic, along with relevant data and aperfect working frame and atmosphere.

Specifically, I would like to express my gratitude to Henrik Rydén and Joel Berglund,my supervisors at Ericsson, who have always been ready and eager to give me advices andanswer to my questions. Thanks also for teaching me how to lose at table hockey.

In the same vein, I would like to thank Jose M. Peña, my supervisor at Linköping Univer-sity, for his opinions and thoughts on my work, and also for his time.

Thanks also to Andrea Bruzzone, my opponent, for his careful and thorough revisionwork.

I would also like to thank my family, who, despite the distance, has been able to keep mymotivation and my will at high level by providing continuous encouragements.

A final thanks goes to the whole class of the master in Statistics and Data Mining for thefriendly and good atmosphere during these two years, especially to Carro who has followedme at Ericsson in order to push me to do my best.

4

1 Introduction

This chapter aims to present a first approach of telecommunication systems and reinforce-ment learning. Then the objective of this thesis is stated.

1.1 Background

Telecommunication systems

For the purpose of this master thesis, some insight into telecommunications is preferable. Acomplete description of the network is not necessary, but an overview of telecommunicationsystems and a description of the part of the network on the side of the customer might helpthe understanding.

The biggest companies in the telecommunications domain fight step by step for the de-velopment of the wireless networks from the second generation, also known as 2G, until thelatest generation of 4G, called LTE-A (Long Term Evolution Advanced). As explained byDahlman et al. in [5], this network is composed mainly by the Core Network (CN) and bythe Evolved-UMTS Terrestrial Radio-Access Network (E-UTRAN) (see Figure 1.1). The CNmakes the link between the Internet and the E-UTRAN. It is composed of several nodes, e.g.the Mobility Management Entity (MME), the Serving Gateway (S-GW), or the Packet DataNetwork Gateway (P-GW). The MME constitutes the control-panel node: it runs securitykeys and checks if a User Equipment (UE) (a UE is a device able to communicate with theLTE-A network) can access the network and establishes connections. The S-GW node is theuser-plane node. There are several Base Stations (BSs) broadcasting information to the UEs,and the role of the S-GW node is to act as a mobility anchor when UEs move between theseBSs. This node also collects information and statistics necessary for charging. The P-GW nodeis the one directly linked to the Internet and relaying it by providing IP-addresses.

5

1.1. Background

Figure 1.1: Simplified diagram of the LTE-A network

There is only one type of node used by the LTE-A radio-access network, and it is calledeNodeB (eNB). So an eNB is a logical node linked directly and wirelessly, through a beam, toa UE. A beam is a signal transmitted along a specific course, used to carry the data from theeNB to the UE. The connection between BS and UE is established after agreement from bothparties. The transmission between a UE and a BS is called uplink when the UE communicatesinformation to an eNB, and the opposite communication, from an eNB to a UE, is calleddownlink. On the other side, an eNB is linked to the CN via the MME node by the mean of anS1 user-plane part and the S-GW node by the mean of an S1 control-plane part. Among otherpossible implementations, the eNB is commonly implemented as a three-sector site, each sitespreading several beams. An eNB can be implemented as, but is not the same as a BS. Despitethis, eNB and BS will in this thesis be assimilated and called BS, or node. This approximationin the language and in the technical precision does not change the results obtained later.

An important network procedure that needs to be described here is the handover. It shouldbe noted that the following explanation is a simplification of the real process. While a UEcan stand in the field of several beams at the same time, only one of these beams providesdata to the UE, and it is called the serving beam. The strength of the signal received by a UE iscalled Reference Signal Received Power (RSRP) and is measured in decibel milliwatts (dBm).While the UE is moving, the RSRP can fluctuate. When the RSRP becomes so low that theUE does not get a satisfactory connection, another beam should replace the serving beam inorder to get a better RSRP. The handover is this operation consisting of switching the UE’sserving beam from one BS to another (see Figure 1.2). It should be noted that a handoveris a costly operation, that is why it should be made in an efficient way. In fact, performinga handover is not only about switching the serving beam, but also to find the best possiblebeam to switch to.

Merely one study has broached the problem of optimisation of handovers for the 5G net-work. Ekman [7] used supervised learning (more specifically, random forest) to find howmany candidate beams should be activated in order to find the best one to switch to. Theresults are that around 20 candidate beams are required in order to get 90% of the sampleswhere the best beam was selected. This thesis is about the same topic, but will not try toanswer the same question.

6

1.1. Background

Figure 1.2: Illustration of a handover

5G requirements

The third generation partnership project (3GPP) is a collaboration between groups of telecom-munications associations. Its main scope is to make globally applicable specifications for allthe generations of mobile networks starting from the 3G system.

The main requirements for the 5G system are the following [14]:

• Higher data rate

• Latency in the order of 1ms

• High reliability

• Low-cost devices with very long battery life

• Network energy efficiency

Telecommunication companies need to respect these constraints while developing theirnetwork. Nevertheless, it may be complicated to satisfy all these requirements at the sametime. A trade-off may be necessary within these requirements, since for example it is difficultso far to propose an extremely high data rate along with a perfect reliability for a low-cost[18].

Using a lean design policy, these requirements can be reached since it gives companiesmore freedom in the way to develop and manage their network, so they can produce moreefficient products. For instance, it is interesting when one knows that, despite being most ofthe time in idle mode, a BS is always fully supplied with power [8].

Reinforcement learning

Reinforcement learning (RL) is a subfield of machine learning, alongside supervised and un-supervised learning. Gaskett et al. described reinforcement learning in [9] as: "Reinforcementlearning lies between the extremes of supervised learning, where the policy is taught by anexpert, and unsupervised learning, where no feedback is given and the task is to find struc-ture in data."

As explained by Sutton in [23], the basic idea of RL is to let an agent learn its environ-ment through a trial-and-error process. At each state of its learning, the agent must choosean action, which is recognised by a reward (see Figure 1.3). This reward may be positive,negative, or null. Depending on the received reward, the agent will learn whether taking thisaction at the current state is advantageous or not. Therefore it is essential that the rewardsare properly defined, or the agent could learn badly how to behave. Taking an action leadsthe agent to a new state, that may be the same as the previous one. The agent repeats this

7

1.2. Objective

process of state-action until it reaches a goal state. It can happen that there is no goal state; inthis case the agent stops learning after a certain amount of steps, defined by the user.

An epoch is the process between a randomly chosen first state and the goal state. In orderto learn, the agent may need to repeat a quite massive amount of epochs, depending on thesize of the environment. The aim of this process is that the agent visits all the possible states inthe environment, takes all possible actions at each state several times, so it eventually knowshow to achieve a goal in the best possible way.

Figure 1.3: Illustration of interaction between agent and environment in RL [20]

Since its beginnings, RL has evolved quite a lot and has found many fields to be exploited.Here come some examples of applications of RL, among so many others:

• Control a helicopter flight hovering [13]

• Play Atari games and beat human champions [12]

• Swing up and balance a pole [16]

• Training robot soccer teams [21]

1.2 Objective

As mentioned earlier, a handover is a costly operation. This is why the handover procedureneeds to be improved, both from the UE and the BS point of view, but furthermore to reducethe number of handovers as much as possible. Concretely, the objective is to optimise thetrade-off between the signal quality, the number of measurements to find a better beam, andthe number of handovers. This master thesis will propose a new approach for this problem,using machine learning, and more specifically RL. The questions that will be broached in thisthesis are the following:

• What kind of RL algorithm could address the 5G handover problem?

• Which features are needed to make this algorithm as efficient as possible?

8

1.2. Objective

• Is RL a good method in practice to find the optimal trade-off between signal quality,number of measurements and number of handovers?

The structure of this thesis is the following: Chapter 2 provides details on the data usedfor this work. Chapter 3 presents the methods used, along with the necessary theory. Chapter4 will then present the main results. Chapter 5 discusses the outputs from previous chapters.Chapter 6 will draw conclusions from this work. Finally, the Appendix 7 will provide addi-tional results and clarifications.

9

2 Data

In this chapter, the origin of the data is described, followed by the description of the availabledata.

2.1 Data sources

While developing a new network, it is not possible to collect real data since the hardware isnot deployed. To compensate, Ericsson has developed a network simulator. BSs are deployedin a city model consisting of streets and buildings (see Figure 2.1) to simulate a network. Forthis simulation, each BS is divided into three sectors, each of them transmitting eight beams.UEs are simulated during a period of time of 60 seconds. At the beginning of the simulation,they are placed at a start point and then move on the map. Some of them are moving alongthe streets, possibly in a vehicle, while others are moving inside buildings, this in order tomodel realistic UE movements. At each time step, the simulator provides the RSRP valuesspread from all the beams to all the UEs.

2.2 Data description

The simulation results contain the position of the UEs (see Table 2.1) in the three dimensionalspace (latitude, longitude, altitude), the position of the BSs in the same reference space, andthe RSRP received by each UE from each beam. It should be raised that the latitude, longitudeand altitude are expressed in the referential of the map, not with usual units like degrees. Thevalues in the tables 2.1 and 2.2 have been rounded to three decimals for better readability.

Time (s) 0.5 0.6 0.7 0.8 0.9 ... 6090.356 90.424 90.495 90.566 90.636 ... 53.28

Table 2.1: Longitude of the first UE

It must be noticed that the UE’s position provided by the simulator is the perfect location.However, in the real world, it is impossible to have a perfect information concerning theposition of the UE, because of the precision of the measurements tools (e.g. Global NavigationSatellite System or radio positioning).

10

2.2. Data description

Figure 2.1: Map of the simulated network.The map represents streets (white), buildings (light grey), and open areas between buildings

(dark grey). The green and red points represent sectors; each group of three sectors is a BS

All the RSRP values (expressed in dBm) are measured every 0.1 second in a time frameof 59.5 seconds, from 0.5 seconds after the beginning of the simulation to one minute after(see table 2.2). It makes a total of 596 measurements per beam per UE during the simulation.There are 14 BSs, all of them steering 24 beams, which makes 336 values of RSRP per UE pertime step. While a lot of different simulated UEs are available, only 600 have been loaded tostart with. Later, these 600 UEs will be referred as UE1, UE2, ..., UE600, or first UE, secondUE, etc. To refer to several UEs, it can be said the first three UEs to refer to UE1, UE2 andUE3 for example. This order is neither a preference order nor a choice, it is simply the orderin which they were simulated.

Time (s)0.5 0.6 0.7 0.8 0.9 ... 60

Beam (index)

1 -84,738 -85,235 -85,869 -85,938 -83,753 ... -80,1192 -85,844 -86,797 -89,941 -89,8159 -91,170 ... -65,8083 -98,353 -97,127 -97,082 -96,866 -93,078 ... -63.796... ... ... ... ... ... ... ...

336 -149,320 -146,096 -151,465 -149,321 -149,291 ... -123,636

Table 2.2: RSRPs for the first UE

From all the simulated data presented in this chapter, other information such as thedirection and the speed of a UE, the distance between a UE and a BS, the time since lastmeasurement has been performed, etc., can be extracted.

It can be noted here that for a certain immobile UE receiving a signal from a certain BS,the RSRP can change slightly over time. The reason is that some interferences can modifyhow the beam spreads, hence a change of RSRP over time.

11

3 Method

First, this chapter introduces some related works that have been inspiring for this work. Then,it aims to present the methods used in this thesis, starting with the Q-learning algorithm.When needed, the theory necessary to understand the method is explained.

3.1 Related works

The three following paragraphs present results published by some researchers or groups ofresearchers which present conditions close to the ones faced in this thesis. These exampleshave been selected because they present really good results and inspiring methods, or be-cause they are among the rare ones to broach their topic, especially the work of Csàji.

Gaskett et al. ([9]) did a leading work on the extension of the use of the Q-learningalgorithm in the case of continuous state and action spaces. They could make a submersiblevehicle find its way to a random point, while the controller of the vehicle did not knowhow to control it and what was the goal. To do this, they used feedforward artificial neuralnetworks combined with wire fitting in order to approximate the action-value function.

Mnih et al. ([12]) were able to train a single agent to play seven Atari 2600 games, us-ing deep RL. They have used batch updates of the Q-learning algorithm combined withapproximation through deep neural networks to train their agent to play these Atari gameswithout knowing the rules, and by simply looking at the pixels of the screen. The results areimpressive: the method outperforms the comparative methods on six out of seven games,and have better scores than a human expert on three out of seven games.

Csáji has shown in his Ph.D. thesis ([4]) that using (ε, δ)-MDPs, the Q-learning algorithmcould perform well in varying environments. In a scheduling process, an unexpected eventappears and two restarts of the scheduling are compared: from scratch, or from the currentaction-value function. Using the result of the Q-learning algorithm in varying environmentsreduces significantly the time needed to reach the same result as before the unexpected event.

12

3.2. Q-learning algorithm

3.2 Q-learning algorithm

Markov Decision Process

A Markov Decision Process (MDP) is the framework of any reinforcement learning algorithm,as it describes the environment of the agent. According to Silver [20], a MDP is a tuple ăS ,A,P ,R, γ ąwhere:

• S is a finite set of states

• A is a finite set of actions

• P is a state transition probability matrix: P aSS1 = P(St+1 = s1|St = s, At = a)

• R is a reward function: RaS = E(Rt+1|St = s, At = a), Rt the reward received at time t.

• γ is a discount factor, γ P [0, 1].

S , A, R and γ are designed by the user. P can be either defined, estimated, or ignored.In the case no state transition probability matrix has been defined, less RL methods will beavailable. More details will be provided in Section 3.2.

The state transition probability matrix P defines transition probabilities from all states sto all successor states s1:

for each action a,P a =

P11 . . . P1n. . . . . . . .Pn1 . . . Pnn

,

where each row of the matrix sums to 1 and n is the number of states.The discount factor is the present value of future rewards. It is used in the definition of

the return Gt, which is the total discounted return from time step t:

Gt =8ÿ

k=0

γkRt+k+1, (3.1)

where Rt is the reward at time t. A γ close to 1 tends towards favouring the reward on thelong term, while being close to 0, only the very next values of the reward have an effect onthe return.

As suggested by its name and the definition of P , the state sequence S1, S2, ... follows theMarkov property. It means that given the present state, the future does not depend on thepast. Concretely, the observed state contains all the information needed for the future, evenif the sequence leading to this state is lost.

Greedy policy

In order to learn the best action to take at any state, the theory lies on the concept of return(see Equation 3.1). As one can see, the return takes into account all the future rewards. Bylooking at this equation, one can realise that an agent is expected to maximise its return.Since it would be extremely costly to compute the expected return for each state-action pairat a time t, it is needed to use some approximation. That is where greedy policies appear. Agreedy policy is a policy which considers that the action leading to the maximum return isthe one giving the highest reward at the current state.

It is expected from an agent to choose the most rewarding action for a given state. How-ever, to learn what actually is the best action, the agent needs to make a trade-off betweenexploration and exploitation. One does not want the agent to exploit only, because as soon asit has discovered a state-action pair, it would stick to this same pair, even if it is not the best

13


one. On the opposite, the agent only exploring will take a random action in any state to seehow is the reward, which is obviously not a learning behavior. In order to make the trade-offbetween exploration and exploitation, the agent will use a greedy policy which will lead theagent to explore and exploit in an appropriate manner.

Some greedy policies, namely the ε-greedy policy and the optimistic initialisation policy,will be introduced in the following subsections.

ε-greedy policy

The concept of the ε-greedy policy is very simple. It states that the agent should follow agreedy policy 100(1 ´ ε)% of the time, otherwise a random action is selected, in order tofavour exploration. There are some variations of this ε-greedy policy, for example, ε canchange over time. For instance, one can assume that after some time, the agent has a goodperception of the environment. Then there is no real need to continue exploring, and ε canbe set to zero after a certain amount of epochs. It is also possible to make it decrease as afunction of the time, so its limit to infinity is zero.

Optimistic initialisation policy

The idea of this policy is to set highly optimistic initial Q-values. Doing this, for any stateand action, the reward is less than expected, therefore the learner will try another action thenext time it reaches this state. This encourages the learner to explore all the state-action pairsseveral times, until the action-values converge.

About the algorithm

To start with, a MDP to apply the Q-learning algorithm on should be defined. The first MDPthat has been set up is the following :

• The state-space S is all the combinations of a serving node and a RSRP.

• The action-space A is performing a handover to every single node.

• The reward is the RSRP of the target node.

• The discount factor is γ = 0.9

These choices will be explained later. Moreover, the state transition probability matrix has notbeen defined, because this thesis will focus on the Q-learning algorithm. It should be noticedthat by definition, the RSRP will always be negative. The higher the RSRP, the stronger thesignal; consequently, a high RSRP will result in a high reward. Moreover, as explained inSection 2.2, the RSRP can change over time for an immobile UE. The reward is then stochastic,even if it does not change drastically.

It is important to notice here that unless some changes are explicitly stated, this MDP willbe the base of any Q-learning algorithm used in this thesis.

Model-free and model-based methods

In Section 3.2, the postulate saying that no state transition probablity matrix would be definedwas made. This since the Q-learning algorithm is a model-free method.

The difference between model-based and model-free methods can be compared to thedifference between habitual and goal-directed control of learned behavioural patterns [23].Dickinson [6] showed that habits are the result of antecedent stimuli (model-based), whilegoal-directed behaviour is driven by its consequences (model-free).

14


Model-based methods use the knowledge provided by the state transition probability ma-trix in order to help the agent to learn the most rewarding behaviour in the environment (e.g.forward model [10]). On the contrary, in a model-free method the agent discovers its environ-ment by trail-and-error. It builds habits by facing states and choosing actions several times.Model-free methods are of interest for this thesis, because the high number of states and ac-tions used in this project makes it difficult to know the state transition probability matrix. Themost famous model-free methods are probably SARSA [17] and Q-learning [25].

On or off-policy

Among the different model-free methods, two kinds can be distinguished: the on-policymethods and the off-policy methods.

To learn the best possible behavior in its environment, a learner may need to optimise itspolicy. A policy π is a distribution over actions given states: π(a|s) = P(At = a|St = s).How an agent behaves is completely defined by its policy, because it is the rule that the agentfollows to select actions.

In order to estimate this policy, two different kinds of models can be used: on-policy oroff-policy. The clearest explanation of on-policy and off-policy is given by Sutton in [23]:"On-policy methods attempt to evaluate or improve the policy that is used to make decisions,whereas off-policy methods evaluate or improve a policy different from that used to generatethe data."

For a better clarification of these terms, the notion of action-value function needs to beintroduced. It is the expected return from a starting state s, taking an action a and followinga policy π:

qπ(s, a) = Eπ(Gt|St = s, At = a) (3.2)

The goal of many RL algorithms is to learn this action-value function. An action-value is alsocalled a Q-value.

SARSA is an on-policy method because it uses the next state and the action taken byfollowing the policy π to update its Q-value. On the opposite, Q-learning is an off-policymethod because it uses the next state and the action taken by following a greedy policy (seeSection 3.2) to update its Q-value.

Description

The Q-learning algorithm is a rather popular reinforcement learning algorithm, developedin 1989 by Watkins [25]. It won its spurs by performing in a wide range of problems (e.g.[1], [22]). Thus, it is a highly interesting tool to start with in this project. It is a model-free,off-policy algorithm, which idea is to update an action-value table, also called Q-table, untilit converges to the optimum Q-values [24]. The action-value table is a matrix with the statesstanding as rows and the actions standing as columns. After the learning, this table can beused by the agent to take the best decision depending on the state it faces.The Q-learning algorithm is implemented as presented in Algorithm 1, and will be detailedfurther.

15


Algorithm 1 Q-Learning Algorithm

1: Initialize Q(s,a) arbitrarily, @ sP S , aP A(s), and Q(terminal_state, .) = 02: for each epoch do3: Initialize S4: for each step of epoch do5: Choose A from S using policy derived from Q (e.g., ε-greedy policy)6: Take action A, observe R, S’7: Q(S, A)Ð Q(S, A) + α[R + γmaxaQ(S1, a)´Q(S, A)] (3.3)

8: S Ð S1

9: until S is terminal

α is called the learning rate, α P [0, 1]. It decides the power to give to the newly acquiredinformation compared to the old. An α close to 0 would make that the agent does not learnanything (see Equation 3.3) while an α close to 1 would erase everything previously learnt.It can be useful in the case of a deterministic environment, however, this assumption is notchecked in this thesis.

In the case when there is no goal-state, a time limit for each epoch should be defined sothat the agent does not keep learning on one epoch.

In words, the Q-learning algorithm works as follows: a Q-table is first randomly ini-tialised, with the terminal state row being set to 0. Then for each epoch, a random stateis chosen. And for each time-step of each epoch, an action is taken according to the pol-icy derived from Q. After this action is taken, the agent will reach a new state and get thecorresponding reward. Then the Q-table is updated according to Equation 3.3.

This equation may need some clarification. It explains how to update the Q-table for thestate-action pair S, A. After taking this action A at the state S, the agent gets a reward R andreaches the state S1. The Q-table is updated using the reward and, because it is an off-policyalgorithm, the best possible action-value that can be obtained at the state S1. γ tells howimportant the future steps are for the current action-value.

If the agent was a robot looking for the exit in a maze, it would repeat the Q-learning for acertain amount of epochs. That is, it would start from a random point in the maze and wouldlook for the exit and repeat this operation until it knows all the maze. In the case of this thesis,the agent is not a UE, but the network, which is trying to figure out how to behave regardingall the UEs. That is why the agent should learn from a lot of UEs to know how to choose thebest possible BS for each UE. To do so, iterations will be run. One iteration contains an epochfor each UE. For example, if the interest is on three UEs, one iteration will consist of threeepochs run successively, one for each UE. And an epoch is made of 596 steps, due to the wayof simulating the UEs.

Defining the features

First, a greedy policy should be chosen. In this thesis, the ε-greedy policy will mainly beused. The value chosen for ε is 0.1, which means that one out of ten actions will be selectedrandomly among the possible actions. This choice seems to be a good trade-off betweenexploration and exploitation, as confirmed by its common use and the encountered successes[23]. The advantage of this method is that it is suitable for any kind of reward.

In the frame of this work, all the rewards are negative. Because of this characteristic, theoptimistic initialisation policy is a suitable alternative to the ε-greedy policy. By setting all theinitial Q-values to zero, any reward will be lower than the Q-value. Then, for a certain state-action pair, the updated Q-value will decrease according to Equation 3.3. When this state isencountered again, the Q-value of this state-action pair is lower than all the other Q-values

16


for this state. The equation 3.3 makes that the maximum Q-value is chosen, so the specificstate-action pair is not visited until it has the lowest Q-value again. Thus, there is no need forexploration, all the state-action pairs are visited because they have at a moment the highestQ-value.

Since these two methods are suitable for this work, the choice is made to favour the mostspread one, namely the ε-greedy policy.

Then, to implement a working Q-learning algorithm, several steps have been needed. Themain issue while updating a table is the time it takes, which mainly depends on the size ofthe matrix. In fact, the bigger the matrix, the more steps are needed to visit each action-stateenough times to reach convergence. Thus, the first challenge is to get as much information aspossible while keeping the Q-table as little as possible. Since the rows represent the states andthe columns represent the actions, the goal is to reduce the state-space and the action-space.After weighing up the pros and cons of many different spaces, the actions were decided to beswitching to a beam, and the states should be the combination of the serving beam and themeasured RSRP from this beam.

To reduce significantly the state-space and the action-space, instead of choosing a beam, anode is chosen. Since, in the simulated map, each node spreads 24 beams, it reduces greatlyboth spaces. When a node is picked, the beam providing the best RSRP is chosen. This isan optimistic way of proceeding, but the idea with this choice is to be able to reproduce it atthe beams scale. If the state-space was made with beams, there would be no choice to make.There are 14 nodes on the simulated map, so the action-space is reduced to 14 actions, eachof them being performing a handover to a node. The nodes received an index during thesimulation, this index will be used for the actions as well. For example, the first column ofthe Q-table will be: switching to node 1. If the action selected is to switch to the serving node,then there is no operation actually made.

Since the Q-learning algorithm uses a table, it is not possible to use a continuous variablelike RSRP. In order to use this essential variable anyway, the closest ten to the measuredRSRP is used. for example, if the RSRP is -73.02, then it is rounded to -70. After looking at allthe simulated values, it appears that all the RSRP values are included between -200 and -40.There are thus 17 possible values of rounded RSRP for each serving node, and there are 14nodes, so there is a total of 14x17 = 238 states.

The reward initially chosen is the difference in RSRP between the node resulting in theaction chosen, also called target node, and the RSRP received from the serving node. Asmentioned earlier, even if the target node and the current serving node are the same, thereward can be different from zero, due to the evolution in the time, and probably in thespace, of the UE.

The parameters of the model are the learning rate α = 0.1 and the discount factor γ = 0.9.These values have been chosen according to the literature, in which they are well spread andwork efficiently [23], but not only. A γ close to 1 has been chosen because a handover shouldbe performed only if this operation is interesting on the long-term. A γ close to 0 would makethe agent think that only the next step is important, and then it would perform a handovereach time a better node is found.

Convergence

If theoretically this method is supposed to converge to the optimum Q-values, how to besure that the number of iterations was sufficient to let the Q-table converge? Apart somevisualisation tools that will be presented later, a way to certify the convergence has beenthought. To be able to verify that there is no more evolution of the Q-table, some intermediatevalues of the Q-table are saved during the learning. To measure the convergence, the meansquared element-wise difference between two successive Q-tables (MSE) is computed. At the

17

3.3. Combining reinforcement learning and artificial neural networks

time-step t,

MSE(t) =

ř

i,j(Qt´1i,j ´Qt

i,j)2

I ˚ J, (3.4)

where Qt is the Q-table at time t, Qi,j is the element at the crossing of the row i and the columnj, and I is the size of the state-space and J is the size of the action-space.

Moreover, in case the slope is still fluctuating due to changing rewards, a linear regressionwill be applied to the last values of the MSE. The explanatory variable will be the number ofiterations and the response variable will be the MSE. Then if the slope is close to 0, it meansthat the MSE has converged.

3.3 Combining reinforcement learning and artificial neural networks

It has become a quite popular technique to integrate the use of Artificial Neural Networks(ANNs) in RL. According to the literature, these kinds of methods have met a lot of success(e.g. [23], [11]). Q-learning is one of the algorithms that have been successfully used withANNs [23].

Theory of neural networks

Description

ANNs were defined by Dr. Robert Hecht-Nielsen in [3] as "a computing system made up ofa number of simple, highly interconnected processing elements, which process informationby their dynamic state response to external inputs". The ANN, which are deeply inspiredby human neural network, usually aim at inferring functions or at doing classification byproviding observations. An ANN is organized as a succession of layers, each layer beingcomposed of one or several so-called nodes.

In order to introduce neural networks in a mathematic way, a specific kind, used in thisthesis, will be described: the feed-forward network, also called multilayer perceptron. Con-sider x = x1, ..., xD as the input variable and y = y1, ..., yK as the output variable. First, Mlinear combinations of x are constructed:

aj =D

ÿ

i=1

w(1)ji xi + w(1)

j0 , (3.5)

where j = 1,...,M and the superscript (1) points out that the parameters wji (the weights) andwj0 (the biases) are in the first layer of the ANN. The aj are called activations. Then, an acti-vation function h, which should be differentiable and nonlinear, is applied to the activationsin order to give the hidden layer:

zj = h(aj) (3.6)

The hidden layer, composed by the M hidden units, is the layer which is between the inputand the output layers. The activation function is often chosen to be logistic sigmoid functionor the ’tanh’ function (see Figure 3.1).

18


Figure 3.1: Most commonly used activation functions for hidden layers

Then, K linear combinations of the hidden units are constructed:

ak =Mÿ

j=1

w(2)kj zi + w(2)

k0 , (3.7)

where k = 1,...,K and the superscript (2) points out that the parameters wkj and wk0 are in thesecond layer of the ANN. The ak are called output unit activations. Last, the outputs y areobtained applying another activation function to the output unit activations. In this thesis,because ANNs will be used for regression, the activation function of the output layer is theidentity: yk = ak.

A common representation of the ANN described above is given in Figure 3.2. In thisparticular case, D = 2, M = 2 and K = 1. An ANN with one hidden layer is called a two-layernetwork, because it is the number of layers having weights and biases to adapt. It is a two-layer feed-forward ANN which has been described previously, but a feed-forward ANN canhave several hidden layers. It has to be noticed that a feed-forward network does not containany closed directed cycles, the links are originated from one layer and reach the followinglayer, but there is no possible comeback to the previous layer.

Figure 3.2: Simple ANN with one hidden layer of two nodes [19]. The weights are on thelinks between two consecutive layers

19


ANN training

The training will be exposed in a more general case, that is not only in the case of a networkwith one hidden layer. First, the generalised equations for any layer of Equations 3.5 and 3.6are:

aj =ÿ

i

wjizi, (3.8)

where zi is the activation of a unit sending a connection to unit j, and

zj = h(aj) (3.9)

To train an ANN, the idea is to provide inputs x and targets t = t1, ..., tK. The targets are thevalues that one would like to get as an output of the ANN when providing x as inputs. Theinputs are placed in the input layer, then the networks proceeds these inputs with the currentparameters w following the Equations 3.8, and 3.9. The output is denoted y. The goal is tominimise the error function E(w) = 1

2řK

n=1(y(xn, w)´ tn)2.To minimise E(w), the idea is to solve the equation ∇E(w) = 0. Since it is extremely

complicated to find an analytical solution to this problem [2], it will be necessary to proceedby iterative numerical procedures. The error backpropagation (see Algorithm 2) is the mostcommonly used solution.

Here is a short clarification of this method, more details can be found in [2]. The errorfunction can be written as a sum of terms, one for each data point in the training set:E(w) =

řKn=1 En(w). Then, the problem can be restricted to evaluating ∇En(w) for each n.

According to the chain rule for partial derivatives,

BEn

Bwji=BEn

Baj

Baj

Bwji

=BEn

Bajzi according to Equation 3.8

= δjzj by notation

(3.10)

By applying again the chain rule for partial derivative,

δj =ÿ

k

BEn

Bak

BakBaj

= h1(aj)ÿ

k

wkjδk using 3.8, 3.9 and 3.10(3.11)

Algorithm 2 Error backpropagation

1: For an input vector x, apply a forward propagation through the network using 3.8 and3.9 to find all the hidden and output units activations.

2: For the output units, evaluate δk = yk ´ tk3: For each hidden unit, backpropagate the δ’s using δj = h1(aj)

ř

k wkjδk4: Evaluate the required derivatives using 3.10

Using artificial neural networks in the Q-learning algorithm

The principle of this Q-learning algorithm is still to update an action-value function untilconvergence to the optimum Q-values, following the Algorithm 1. The action-value functionis still updated according to Equation 3.3. However, instead of updating a Q-table, an ANNis trained to approximate the action-value function. The state and the action are given as

20


inputs, and the output is the Q-value corresponding to this state-action pair. Following arepresented the ANNs as they will be used to approximate the action-value function.

Using an ANN should allow to come to the continuous dimension, so there is no need toapproximate the RSRP anymore. This will result in a smaller amount of input nodes, becausethere is one continuous variable instead of a categorical variable with for example 17 possiblevalues, for as many nodes. It should also result in a more accurate action-value function.

The first ANN created is a feedforward neural network, described previously in this sec-tion, with only one hidden layer of ten units. This ANN aims to approximate the action-valuefunction, so a state-action pair is given as an input, and the action-value corresponding to thisstate-action pair is the output. The state-space S and the action-space A are the same as de-scribed in Section 3.2. Thus, the state-space is the combination of the categorical variableserving node which can take 14 values and the continuous variable RSRP. There will be so 15input nodes for the state. The 14 nodes for the serving node will all be set to 0 but the onecorresponding to the serving node which will be set to 1. In the same way, the action-space iscomposed of 14 nodes, all of them being set to 0 but the one corresponding to the target node(see Figure 3.3). There are 15 nodes to represent the state and 14 nodes for the action, whichmeans that there are 29 input nodes. The output being the Q-value of the input state-actionpair, there is only one output node.

Figure 3.3: First ANN approximating action-value function, in the case of a state S: -70dBmreceived from the serving node 1, and an action A: perform a handover to node 14

The reward being the RSRP received from the target node, a penalty of 25 is applied to anaction leading to a handover. This choice will be explained later in 3.5 and 4.4.

The weights and biases of the layers of the ANN are initialised as follows:

wji „ N (0, 0.01)

The chosen activation function is a tanh-function (see Figure 3.1). The values of γ, α and ε arechosen as previously, according to the discussion in Sections 3.2 and 3.2. In order to train theANN, an input dataset and a target dataset should be provided. However, there is no suchdatasets in this case. In fact, the target needs to be computed at each time-step of the learningfollowing the Equation 3.3. So it is not possible to have a complete target dataset, because thetarget is depending on the current action-value given by the ANN. Thus, the ANN must be

21


adapted after every time-step, by providing the input, the output and the target computeddepending on the input.

After a first try, it appears that this method is extremely time consuming. In order tofasten the process, batch training is considered. It means that the weights are not updatedafter every iteration, but the inputs and targets are saved, and the ANN is updated with thisset of values from time to time. The size of the batch is set to the quarter of the time of anepoch, so there are four updates made for each epoch.

Despite this batch updating of the ANN, the training process remains extremely slow andthus very difficult to exploit in the time of a master thesis. So a new kind of ANN is built.The general architecture, a feed-forward ANN with one hidden layer of ten hidden units anda tanh activation function, is kept. The input layer is the state, and the output layer is theQ-value for each possible action. Thus, there are 15 input nodes, corresponding to the RSRPand the serving node, and 14 output nodes, corresponding to each action (see Figure 3.4).

Figure 3.4: Second ANN approximating action-value function, in the case of a state S: -70dBm received from the serving node 1. All the Q-values corresponding to all the actionscompose the output

Convergence

Computing the log-likelihood (logL) is a solution to measure if an ANN is converging. As-suming a normal distribution of the output of the ANN, it can be written:

logL(t) = log

Kź

k=1

exp´(nett(inputt)[k]´targett [k])

2

2σ2?

2πσ2

(3.12)

where K is the size of the output layer, nett is the output of the ANN at time-step t, inputt isthe input at time-step t and targett is the target at time-step t. [k] represents the k-th element,for example the output of the k-th node of the ANN. The interest is not the value of the log-likelihood, but the relative value. So all the constant variables can be grouped in variables (Xand Y):

logL(t) =K

ÿ

k=1

X(nett(inputt)[k]´ targett[k])2 + Y (3.13)

22

3.4. Visualisation

Finally, X and Y will be removed of the expression of the log-likelihood. So now logL is goingto be a value proportional to the log-likelihood:

logL(t) =K

ÿ

k=1

(nett(inputt)[k]´ targett[k])2 (3.14)

This value is supposed to decrease while the output values become closer to the target values.

3.4 Visualisation

In order to see how good the learning performs, it is common to look at the curve of thecumulative sum of the reward, since the goal of the agent is to maximise its reward [23]. Thecumulative sum of the reward is the sum of all the rewards obtained by the agent during thelearning:

cumsum(T) =T

ÿ

t=1

Rt, (3.15)

where Rt is the reward obtained by the agent at the time-step t. In fact, the cumulative sumof the mean reward per epoch will be plotted. That is, after an epoch, the rewards obtainedduring this epoch are summed and divided by the size of an epoch, which is always 596. Inpractice, the only difference is that the curve will look more smooth and the values obtainedwill be lower.

The expected behaviour is that after taking some actions leading to excessively bad re-wards, the agent should avoid to retake these actions, and after some time all the actions toavoid are known. It follows on that the cumulative sum of the reward should first includelow rewards, so decrease quickly, and after some time these state-action pairs are not chosenanymore, so the slope becomes closer to zero (see Figure 3.5). In one iteration, there is oneepoch of learning for each UE. The epoch index is the index given to an epoch according toits chronological order.

Figure 3.5: Expected shape of the cumulative sum of the reward

23

3.4. Visualisation

Another solution to actually visualise that the agent learns is to plot the mean reward perepoch. In the end of each epoch, the mean reward obtained by the agent is computed, and thegraph obtained shows the evolution of this mean epoch after epoch. The expected behaviouris that the mean reward starts by being quite low because of badly rewarding states. Then,these states are avoided because the agent knows they are not interesting for him, so the meanreward increases after some time of learning (see Figure 3.6).

Figure 3.6: Expected shape of the evolution of the mean reward

Barplots

If the agent has learnt, it is interesting to visualise how it used its knowledge. The idea is touse the learnt Q-table and a simulated UE to see how BSs would provide a signal to this UE.To select an action, the agent picks the one with the highest Q-value for the current state. Tosee how it performs, two kinds of graphs are plotted. First (see figure 3.7), the RSRP (on theleft) and the serving node (on the right), both for each time-step during a complete epoch. Theserving node is the node spreading the serving beam. On each graph, there are two differentlines. The first one, in orange, shows the case where the RSRP is maximum. The second,in blue, shows the situation induced by the use of the Q-table learnt by the agent. That is,for each time-step, the maximum RSRP and the corresponding serving node are displayed inorder to see if the learnt behaviour is close or not. The simulation is made on one or severalUEs that have been used to make the agent learn, plus at least one test UE, to see if the Q-tablelearnt can be used for unknown positions. All the UEs are plotted row by row.

The test UE is always the UE following the last UE used to learn in the simulation order.If there are several test UEs, they are the following ones. In case of different categories (seeSection 3.5), the test UE is the one following the last UE in the same category.

24

3.4. Visualisation

Figure 3.7: Plot to visualise the efficiency of the learning

For the previous plot, the first five UEs have been used for the learning of the agent. SomeUEs seem to perform well, like the first or the fourth UE, because the blue line (predictedvalues) is close to the orange line (optimal values) on the RSRP graph, and also because theblue line is quite stable on the graph of the serving node, meaning that a few handovers areperformed. In the opposite, the predicted values of RSRP of the test UE are much lower thanthe optimal ones, and the serving node is constantly changing, proof that a lot of handoversare performed.

In order to complete the information given by these graphs, the second kind of graphused is barcharts, which are used to show the mean difference in RSRP between optimal andlearnt choice of node, and also to see the number of handovers (see figure 3.8).

25

3.4. Visualisation

Figure 3.8: Barplots in complement of the graphs in 3.7

The first five UEs have been used for the learning of the agent on the previous graph. Theanalysis is the same, but more summarised. It is easier to see how many handovers have beenperformed for example.

In order to be more precise, the 90% confidence interval of the mean difference of RSRPis added to the graph, for each UE. However, the distribution of the mean difference is un-known. To go through this obstacle, the confidence intervals are computed using bootstrapmethod. The idea of bootstrapping is to use resampling to estimate statistics, such as theconfidence interval in this case.

Quickly, here is how bootstrap works ([15]). Suppose that the goal is to estimate a statisticm from a sample x1, ..., xn. x1˚, ..., xn˚ is a sample of the same size as the original sample,with m˚ being the searched statistic computed from the resample. This operation is repeatedmany times. Then, according to the bootstrap principle, the variations of m and m˚ are ap-proximately equal.

In the field of this thesis, the bootstrap confidence interval is computed after resampling10,000 times from the differences in RSRP. Then the first five and the last five percentiles areremoved from the 10,000 means computed from the resamples. The remaining lowest valueis the lower limit of the confidence interval, while the remaining highest value is the upperlimit.

Heatmaps

A complementary technique can be used in order to visualise and interpret the obtainedresults, namely the heatmaps. Firstly, they will be used to visualise the Q-table itself. Formore readability, instead of plotting a single heatmap with all the states and all the actions,all the different nodes are plotted individually. Thereby, one can see distinctly where is theborder between two nodes and then understand better the map. The state-action pairs thathave never been visited during the learning process are colored in grey. For the other colors,the darker they are, the smaller the corresponding action-value. After the learning, one canguess thanks to this heatmap which action will be taken at a given state: the one with thepalest colour. The heatmap of the action-value table can be used to visualise which node willmake the agent perform a handover to which other node. It can also become visually evident

26

3.5. Settings

why two nodes are continuously alternating, like for the third UE in Figure 3.7, where thereis a continuous alternation between nodes 5 and 12 between the time-steps 1 and 100.

In order to get even more information, the use of heatmaps will be extended. For example,one can have a look at the number of times each state-action pair has been visited during thelearning process. It could allow to see if some actions are favoured when the agent is ata certain state. This would mean that this action has probably not been chosen randomly,but because the target node will give a strong signal. For more readability, a log-10 scale isused, otherwise only some state-action pairs that have been visited extremely often appear. Itmeans that in order to get the real number of visits, the following operation should be made:

If log(nb of visits) = x,Then nb of visits = 10x

Moreover, since for a same state-action pair the reward can be different, it can be inter-esting to have an idea of the distribution of the rewards for each state-action pair over thecomplete process of learning. Heatmaps can be used here as well, to plot the mean rewardand the standard deviation for each state-action pair.

3.5 Settings

Penalising handover

As previously said, performing a handover is a costly action. It is why it can be judiciousto penalise a handover in the model. Concretely, it means adding a penalty to the reward,the value of which has to be defined, when the action leads to select a node different of thecurrent serving node. A simple way for choosing the value of this penalty factor is to train themodel with different values, all things being equal, included the addition of a seed to avoiddifferences due to randomness in the initialisation. The test values are taken every fifth valuebetween 5 and 50.

Precisely, the penalty factor is added to the reward, and so the updating function of theQ-learning algorithm 3.3 becomes:

Q(S, A)Ð Q(S, A) + α[R´ p.1TargetNode‰CurrentNode + γmaxaQ(S1, a)´Q(S, A)] (3.16)

where p is the penalty factor.

The selection of the best penalty factor implies a compromise, because it is expectedthat the higher the penalty, the lower the number of handovers, and the higher the meandifference. So a trade-off between mean difference and number of handovers has to be made.

Two actions

Since choosing 14 actions may be time consuming, it can be pertinent to consider a smalleraction-space. It seems like it is not possible to restrain it more than to two actions. Thus, theaction-space described in 3.2 is updated as following:

A = {Measure the RSRP from all the beams and switch to the best one; Do not perform ahandover}

The second action could also be called "Do nothing", because there is not even a measure-ment which is done. In order to see which action is taken at each time step, a new columnof graphs has been added to the previous visualisation tool 3.7. It is the column on the right(see Figure 3.9). The action 0 corresponds to do nothing, while the action 1 is measuring all

27

3.5. Settings

the beams. By saying this, one can realise that it may be a costly operation to measure allthe beams, even if no handover is performed after taking action 1. To counter this, a penaltyfactor has been added when the measuring action is taken. As this action is about to measurethe power of a lot of beams and perform a handover, the penalty factor has been taken quitehigh: 50.

Figure 3.9: The column on the right shows the action selected

It can be noticed on Figure 3.9, for the second, third and test UEs, that the action takenis often the action 1. However, when looking at the column in the middle, it looks like nohandover has been performed. It is because the action 1 leads to measuring all the beamsand switching to the best one, so if the best one is the serving node, there is no handover toperform. Consequently, this action should not have been performed to be optimal.

Contextual information

A new idea is now exploited: the notion of contextual information, introduced by Sutton[23]. It means that some exterior knowledge is used to make the agent learn more efficiently.It would probably be really interesting to know what kind of UE is requiring a signal, andthen knowing if it is a smartphone or a connected car for example, which brand it is, etc.Unfortunately, on the simulator, it is not possible to recreate different kinds of terminals. Butthere is one type of knowledge which can be of great interest to use the contextual informa-tion: the mobility of the UE. There are three categories of UEs: the ones used indoor, the onesused in a motorised vehicle outdoor, and the ones used otherwise outdoor. The contextualinformation is used to learn one Q-table for each category. Thereby, it is expected that theresults are more accurate. Actually, without contextual information, if two UEs are at thesame state but one is indoor and the other outdoor, it is quite likely that for the same actionthe following state will not be the same.

For more convenience, without reducing the understanding, the three categories listedabove will be called indoor, car and walking UEs. The simulator did create first a walkingUE, then a car UE and finally a walking UE. Then it started again keeping this order. So thewalking UEs are UEs 1, 4, 7, etc., the car UEs are UEs 2, 5, 8, etc. and the walking UEs are UEs3, 6, 9, etc.

28

3.5. Settings

Adding new features

Until now, a quite simple state-space has been used. As explained earlier, a bigger state-space would imply longer computation times. However, this additional time is taken to tryto discover features giving better precision to the agent. In order to see if some other featurescan be used to help the agent learn, two experiments have been done. The first one was touse the direction of the UE, the second was to use the angle between the UE and the servingnode.

The idea of adding these features grew because using simply the state as a combination ofthe RSRP and the serving node was a little bit restrictive. In fact, if two UEs start at the sameplace with the same RSRP, if they move in different directions, their RSRPs can evolve in thesame way but the node to switch to will be different (see Figure 3.10).

Figure 3.10: Two different UEs with identical start and finish states

On the previous diagram, a very theoretical case is introduced. Two UEs are consideredto start from the same point, so with the same RSRP. Then they move in different directions,but their RSRP stays exactly the same. When the RSRP becomes low, a handover should beperformed, and both UEs will not need the same beam to get a good signal strength, becausethey are quite far from each other.

Consider the direction in which is moving the UE. It would have been interesting to usethe angle directly, but due to the use of the Q-learning algorithm, an approximation has to bemade. To do so, the vertical is taken as the 0˝ and the space is divided in a certain amount ofparts (see Figure 3.11). The UE being in the centre of the figure, it can go in one of the eightdirections. Eight has been chosen for being a middle ground between not too wide segmentsand not too many more states.

29

3.5. Settings

Figure 3.11: Directions simplification for a UE

But after testing this feature, it appeared that adding the direction did not solve all theissues caused by similar states, because similar states and similar direction can also lead todifferent needs of node to switch to (see Figure 3.12).

Figure 3.12: Two different UEs with different start and finish points but similar directions

On the previous diagram, another theoretical case is introduced. Two UEs start with anidentical RSRP provided by the same serving node. The two UEs evolve in the same direction,and their RSRP stays the same while they are moving. When the RSRP becomes too low, ahandover needs to be performed, but the ideal beam will not be the same for both UEs.

To solve this, the angle between the UE and the BS looks to be the best feature to avoid allthese matters of similar cases. In fact, the previous case is broken by using this feature. Thesimulated BSs are each composed of three sectors, whose coordinates are known. The samesystem of segments as for the direction is used (see Figure 3.11), so if the three sectors are notin the same segment, then the segment containing two sectors is used.

Adding a feature means actually changing the MDP described in 3.2. The state-space Sbecomes the combination of the serving node, the RSRP received from this node, and the newfeature. As described previously, the new feature will be represented by an integer between 1and 8. For example, a state can be a RSRP of -70dBm received by the node 8 which angle withthe UE is in segment 3. Adding one of these new features multiplies the size of the state-spaceby 8, to become 8x238 = 1904 states.

30

4 Results

In this chapter, the results are presented, first for the Q-learning algorithm in its table-basedversion, then for the version using action-value function approximation with artificial neuralnetworks.

Note: For all the graphs of this chapter, all the details about the parameters used aresummarised in the table 7.1. Sometimes, close values can be difficult to differentiate betweentwo different graphs; hence, the tables with the values are available in the Appendix 7.3.Moreover, according to Algorithm 1, Q(terminalstate, .) should be initialised to 0. However,there is no goal state in this thesis, so for simplicity’s sake all the Q-table is initialised to 0.

4.1 Time considerations

As expected, computing the Q-learning algorithm takes a substantial time. After some op-timisation, the code became faster but still not impressively (see Figure 4.1). While learningwith the first three UEs for ten thousand iterations, the mean running time for ten computa-tions is over 20 minutes. From this observation, it is deductible that increasing exaggeratedlythe number of states and the number of actions will result in extremely long computations,which is not acceptable in the time frame of a master thesis. Therefore the Q-learning algo-rithm is a good frame to experiment and discover which features can be of interest, but willprobably not lead to extraordinary results easy to exploit for future applications.

It can be questioning why it takes so much time to run so few iterations. It can be noticedthat the plot was done with the state-space and the action-space described previously, i.e.238 states and 14 actions. But actually the sizes of the state-space and the action-space donot have an effect on the running time, only on the number of iterations needed to reachconvergence. So to understand why one iteration needs more than one second to run, thisiteration must be decomposed. In one iteration, the agent takes information from each UEused to learn, in this particular case three. And for each UE, a whole epoch is processed,that is 596 time steps. So in one iteration, the Q-table is updated almost 1800 times in thiscase. In fact, each updating of the action-value table takes less than a millisecond, which is areasonable value.

31

4.2. Proof of learning

Figure 4.1: Histogram of the time needed to run ten Q-Learning algorithms

4.2 Proof of learning

As described in Section 3.4, there are two visulisation tools that have been developed in orderto observe if the agent has actually learnt through all the iterations of the process. For thefollowing graphs, the first five UEs have been used to make the agent learn its environment.In Figure 4.2, the cumulative sum of the reward is plotted for the first 2000 epochs of thelearning. The expectations were a curve decreasing quickly at the beginning but having amore or less constant slope after a certain amount of iterations (see Figure 3.5). The trend isvisible, but the slope does not change drastically. The asymptote has been plot to testify thatthe slope is constant after a certain amount of epochs.

Figure 4.2: Cumulative sum of the reward while learning with five UEs

32

4.3. Convergence

This weak change in the slope can be explained by the values taken by the sum of thereward in one epoch. A badly rewarding state can be for example -150dBm, while a highlyrewarding state can be -40dBm. If summed out on a whole epoch, that is in this case 596 val-ues of reward, even if several badly rewarding states occur, then the difference in cumulativesum is quite low in percentage. For example, suppose that two epochs have a mean rewardof -50dBm on all the values but ten. These ten values are equal to -40 for one epoch andequal to -140 for the other epoch. Then the cumulative sum of the reward for the first epochis ´50.586´ 40.10 = ´29700 and the cumulative sum of the reward for the second epochis ´50.586´ 140.10 = ´30700. Then the difference in percentage is 30700´29700

29700 .100 = 3.4%,which is not significant.

To confirm that the agent has actually learnt, the second proposition of visualisation wasto plot the mean reward per epoch (see Figure 4.3).

Figure 4.3: Mean reward per epoch while learning with the first five UEs

The trend is this time more clear. At the beginning of the learning, the mean rewards arelow, but increase to become stable after a certain amount of epochs. So the conclusion is thatthe agent has learnt to avoid badly rewarding states. The reason why there is such a widefluctuation in the mean reward between two consecutive epochs is that the three UEs used tolearn do not have the same range of values of RSRP, i.e. different ranges of values of reward.This is because they are not located in the same place and have different serving beams, andthe received signal is not equally strong.

4.3 Convergence

As explained in Section 3.2, the convergence is examined. To analyse if the algorithm hasconverged, the MSE is plot, like in Figure 4.4, for which the learning has been made with thefirst five UEs. The Q-table have been saved every 60 iterations, and two consecutive Q-tablesare used in Equation 3.4. The graph is split in two parts for a matter of clarity. The graph onthe left shows the first 99 values of the Computed MSE, while the graph on the right showsthe last values. In orange on the last graph is the plot of the curve found when computingthe linear regression of the last values of MSE. In this case, the slope is really weak, whichtends to say that the MSE is globally not evolving anymore. If the MSE is actually not zero,it is because the reward is stochastic, due to a changing RSRP. For this reason, the MSE willnever tend to zero.

33

4.4. Penalising handover

Figure 4.4: Evolution of the mean squared error through the number of iterations

4.4 Penalising handover

In order to find the optimal penalty factor, the test has been conducted on the first threeUEs, with the fourth UE for checking the efficiency of the model. It appears that the optimalpenalty factor is around 25 (see Figure 4.5), because there is a quite good trade-off betweenmean difference and number of handovers. With a smaller and increasing penalty factor, themean difference is always decreasing, as well as the number of handovers. With a higher andincreasing penalty factor, the number of handovers is still decreasing, but the mean differenceis starting to increase, because the agent is waiting for several really low consecutive RSRPsto perform a handover. The observation with a low penalty factor is that a lot of handoversare performed, despite a mean difference which is not always really good, especially on thetest UE. For a high penalty factor, the number of handovers is rather low, but the test UE hasa mean RSRP difference of almost 15, which is rather high. Thus, the best trade-off seems tobe for a penalty factor of 25.

Figure 4.5: Test with a penalty factor of 25 Figure 4.6: Test with a penalty factor of 50

This choice is really subjective and is subject to discussion. For instance, the differencewith a penalty factor of 50 (see Figure 4.6) is really slight. In the case of the highest penalty

34

4.5. Two actions

factor, the mean difference of RSRP is a little bit higher for the learning UEs, and the numberof handovers is hardly better. Moreover, it can be different depending on the UEs used tomake the agent learn. Anyway, it is always relevant to use a non-zero value for this penaltyfactor, for all the reasons mentioned in Section 3.5.

4.5 Two actions

As explained in Section 3.5, the action-space has been reduced to two actions. The Q-tableresulting of the learning with the first three UEs has been used to test the behaviour of theagent (see Figure 4.7).

Figure 4.7: Q-learning algorithm using two actions

On the previous plot, the performance is rather bad on the testing UE, because a lot ofmeasurements were made without leading to a handover. It is also the case for the secondUE. On the opposite, the first UE does not perform any measurement, although the RSRPis much lower than the optimum. Facing these contrasted observations, it is quite hard todecide if 50 is too high or too low as a penalty factor.

4.6 Heatmaps

The presented Q-table (see Figure 4.8) is obtained using the first three UEs to learn, and sowill be obtained all the heatmaps in this section.

35

4.6. Heatmaps

Figure 4.8: Heatmap representation of a Q-table

One can see that some nodes are more likely to be a target node than some others. Forexample, the action 3, representing "performing a handover to the node 3" is seldom in a lightcolor, synonym of high Q-value. The highest Q-value in a row indicates which action is thebest to take at this state. The grey color indicates that the state has not been visited duringthe learning. And it appears that a lot of states are never visited (see Figure 4.9). The reasonis that the few UEs used for learning did not face all the possible state-action pairs duringthe learning. In fact, having a RSRP of -180dBm is a really extreme and rare case, so it is notsurprising to see that this case was not encountered.

Figure 4.9: Heatmap representation of the number of time each state-action pair has beenmet while learning

36

4.6. Heatmaps

The heatmap 4.9 shows that the state-action pairs are visited a quite disparate amount oftimes. If a state-action pair is visited a small amount of times, it probably means that it hasbeen chosen randomly thanks to the greedy policy. Thus, it should give a bad reward to selectan action with a dark color for a certain state.

Figure 4.10: Mean reward for each state-action pair

Figure 4.11: Standard deviation of the reward for each action-state pair

The mean reward (see Figure 4.10) gives the same idea as the Q-table: it is a particularcase of the Q-learning algorithm with γ = 0. The additional information is that one can seethe RSRP that can be obtained directly after taking an action at a certain state. The lighter thecolor, the higher the reward. The standard deviation of the reward (see Figure 4.11) tells how

37

4.7. Contextual information

far from the mean the RSRP is likely to be. If the color is dark, then the reward obtained forthis state-action pair is close to the mean reward observed in Figure 4.10.

4.7 Contextual information

Overall, the results are quite satisfying when using five UEs in each category.

Figure 4.12: Testing indoor UEs

Figure 4.13: Testing car UEs

Figure 4.14: Testing walking UEs

38


The indoor UEs used for learning give rather good results when tested, but UEs 3 and12 need a high amount of handovers to perform a good mean RSRP difference. Moreover,the test UE, not used for learning, has both a high mean RSRP difference and a high numberof handovers (see Figure 4.12). The car UEs give a decent mean difference of RSRP for theUEs used to learn, and none needs a lot of handovers to perform this result (see Figure4.13). Finally, the testing on walking UEs performs correctly, even if several UEs need manyhandovers to get these results (see Figure 4.14). UE13 is performing a handover at everytime-step, which denotes a wrong learning.

For every category, the test UE does not perform as well as the other UEs. It is because fiveUEs moving for one minute do not have enough time to cover all the map. Then some casesare not faced, some nodes may not be used to learn. That is why using the learnt Q-tableon the test UE may lead to a bad mean RSRP difference and a high number of handovers,because it can present new situations that the agent does not know how to react to.

In reaction to the overall satisfying results of the use of contextual information, it has beenused all the time afterwards. Following are the most significant results.

Number of UEs used for learning

Here, the learning has been made for each category using one, five or ten UEs to make theagent learn its environment. What can be expected is that the more UEs are used, the betterwill be the prediction for UEs that have not been used to make the agent learn. The indoorUEs are considered to see a possible difference. It should be noticed that in this section, notall the categories are shown. However, the plots that are presented are representative of allthe categories.

Figure 4.15: Learning with 1 UE Figure 4.16: Learning with 5 UEs

39


Figure 4.17: Learning with 10 UEs

Figure 4.15 is the result of the test when the first indoor UE is used for learning. The fiveother UEs are test UEs. Figure 4.16 is the result of the test when the first five indoor UEs areused for learning. The first UE performs worse than when learning with only one UE, butthe others perform better, which shows that learning with more UEs is good to encounternew situations. Figure 4.17 is the result of the test when the ten first indoor UEs are used forlearning. Some UEs perform really well, like UEs 6, 18 and 30. In general, the mean RSRPdifferences remain quite low, but some UEs need a lot of handovers to achieve these results.However, among the first five UEs, in comparison to Figure 4.16, some mean differences arenot improved (e.g. UE3, UE15).

Precision of the rounding

In order to get closer to the continuous case, the approximation in the RSRP has been sharp-ened, from rounding to the closest ten to rounding to the closest unit. This time, the results ofthe walking UEs can be looked into. In both cases, the first five walking UEs have been usedfor learning, and the test UE is the same, i.e. UE 16.

Figure 4.18: Precision of 10 Figure 4.19: Precision of 1

40


As one can see, the mean difference in RSRP is smaller in the case of a precision of 1 (seeFigure 4.19) for the UEs used to learn compared to the case of a precision of 10 (see Figure4.18). The number of handovers is also smaller. However, for the last UE, which was notused for the learning, there is no significant difference. This assessment is true in general,especially while looking at the case when ten UEs were used for learning. Most of the time,there is a noticeable change in the difference mean and the number of handovers in favour ofthe most precise approximation, but it happens in few cases that the mean difference and/orthe number of handovers does not decrease when improving the precision of RSRP. Overall,using a better precision in the rounding of the RSRP improves the results. Thus, it is incitingto use a continuous method using action-value function approximation.

Adding features

Finally, as explained in section 3.5, two features were tested to try to improve the state, thedirection of the UE and the angle between the UE and the serving BS. Consider now the carUEs tests. The learning has been made in every case with the first five car UEs, and the sixthcar UE is used as a test UE. The precision in the RSRP is back to 10 on the following plots.

Figure 4.20: No feature added for learning

Figure 4.21: Direction used for learning Figure 4.22: Angle between UE and BS usedfor learning

41

4.8. Artificial neural networks and reinforcement learning

As one can see on the graphs above, the direction of the UE as a feature is a complete failto improve the efficiency of the algorithm (see Figure 4.21). However, the learning using theangle between UE and BS (see Figure 4.22) gives some mixed results. Some UEs have a bettermean difference, and some a worse one compared to the case using no additional feature (seeFigure 4.20). It is also the case with respect to the number of handovers, some UEs are better,some worse. This is a general observation for all the categories, even if in some cases thebetter performance of the angle is more visible. In any case, the values are quite close fromeach other.

Against all odds, when the precision of the RSRP rounding is turned to 1, despite animprovement for the agent learning with the angle, the learning without additional featureperforms the best. Overall, all of this makes the angle an interesting feature that should be atleast tried in the continuous case.

4.8 Artificial neural networks and reinforcement learning

As already discussed in Section 3.3, the ANNs tend to take a lot of time to run. The secondconfiguration of ANN (see Figure 3.4) is trained during 2000 iterations, using the first threecar UEs to learn. Thus, there are three epochs of learning per iteration. After the learning, theANN is tested in the same way as the Q-tables, explained in Section 3.4. The result is shownon Figure 4.23.

Figure 4.23: ANNs Q-Learning testing

According to the graphs on the column on the right, it seems that the ANN is leading tothe same node whatever the input, and then never performs a handover.

The logL, described in Section 3.3, has been computed during the training of the ANNleading to the results of Figure 4.24. It seems to decrease really quickly, and then to staylow during the rest of the training. For more readability, only the first 1000 iterations havebeen plot, but the trend seen between iterations 200 and 1000 continues to the end of the 2000iterations.

42

4.8. Artificial neural networks and reinforcement learning

Figure 4.24: log-likelihood of the ANN

These observations are not only valid for the car UEs, it is also true with the other cate-gories of UEs, as well as with different number of UEs used for training.

Some inputs are chosen to run the trained ANN, and to compare the outputs with thevalues obtained using the usual Q-learning algorithm. The two inputs are: a RSRP of -80dBmspread by node 8, and a RSRP of -130dBm received from node 1. The heatmap 4.25 showsthat all the action-values are between -450 and -750. However, the outputs of the ANN forthe two proposed inputs are all between -100 and -190 (see Figure 4.26). So there is a hugedifference, which tends to state that the learning is not complete, despite the result shown inFigure 4.24

Figure 4.25: Heatmap of the Q-table afterlearning with the first three car UEs

Figure 4.26: Some output from the ANNtrained with the first three car UEs

43

5 Discussion

The goal of this chapter is to discuss about this work, some extensions that have been tried,how it could be continued, and so on.

5.1 The work in a wider context

This work has been conducted in order to give preliminary results with the hope to be ableto extend the method to a whole radio access network, broadcasting data to numerous UEssimultaneously. It appears that the computation times did not allow to deal with a lot ofUES. From this observation, it is genuinely hard to say if the results obtained can be scaledup or not. On one hand, having decent results with five UEs does not mean that learningwith 500 UEs would give the same results. On the other hand, because of the size of thesimulated network, using many UEs would guaranty a good coverage of the area, and thenthe results would probably have been reliable for the whole network. Hence, using the learntaction-value on test UEs would have probably led to satisfying results.

Moreover, learning with a lot of UEs would have increased as much the computationtime. For a real world problem, this could be an issue, even if there is much more powerfulhardware available. The best way would be to learn during a long period of time from asimulated network reproducing the real one, then use the results on the real network, andimprove these results with real data. The drawback is that if the simulated UEs are not per-fectly reproducing the real ones, the learnt action-value function would lead to a disastroususe of the network.

Actually, it looks unimaginable to extend the use of the table-based Q-learning algorithmto real world problems. The number of BSs in a network would induce enormous state andaction spaces. A solution in this case could be to divide the network in subnetworks, but theproblem arose by this solution is when a UE needs to move from one subnetwork to another.This problem is limited with the function approximation using ANNs. If the action-spacewould be the same in both methods, the state-space is reduced a lot due to the loss of therounding of the RSRP.

Unfortunately, the results obtained using the function approximation do not even get closeto the ones obtained with a submersible vehicle ([9]) or with the Atari games ([12]), evenif the same method is used. Gaskett et al. [9] also used feedforward neural networks to

44

5.2. General observations

approximate the action-value function in the case of a continuous state-space. The wire fittingmethod was only used to face the case of the continuous action-space, so it was not needed inthis work. Mnih et al. [12] used deep neural networks combined with batch training to traintheir agent. Using batch training allowed to save computation time, but the results are notsatisfying either. A possible reason of the fact that the action-value function always points tothe same node (see Figure 4.23) is that the evolution of the network is really slow. In fact, theweights evolve by really small steps because two consecutive datasets used for training areclose to each other. An idea for future works would be to use experience replay when trainingthe ANN. That is, some data already used for training, including the target, are stored, andfrom time to time they are given to the ANN again. The goal is to break series of similarinputs, when there is no handover and the RSRP does not change a lot for example.

Another solution that have been foreseen to train ANNs was to set all the targets to 0,but the one of the selected action, which should be set to 1. Thereby, the ANN output isgiving directly the action to take, without any consideration for the value of the action-valuefunction. This is also an idea that could be experimented in future works.

5.2 General observations

It may remain some grey areas concerning some choices made during this work, so this sec-tion will try to clarify them.

First, the choice of the comparative case for the visualisation tools (see Section 3.4). Thebest possible RSRP is a very hypothetic case that would probably not be the norm in reality.In fact, reaching the best possible RSRP can ask for a lot of handovers, which is not the goal.However, reaching a difference of RSRP of only some dBs with the optimal RSRP whileperforming only a few handovers is the hoped result. That is why visualising the differencein RSRP is a good indicator, while comparing the number of handovers to the optimal caseis not so interesting. The goal is to keep this number as small as possible. Another baselinecould have been to perform a handover only when the RSRP becomes smaller than a certainthreshold. In this case, it would have been possible to perform better than this simulation,both in difference of RSRP and in number of handovers. But since the 5G network is notdeployed yet, there is no real case to compare with.

Also, the measure of performance introduced in Section 3.4 and used in Section 4.2 canbe questioning. In fact, the agent gets a reward for each action taken, but the visualisationtool uses either the sum of the rewards on an epoch or the mean reward per epoch. Thereasons to explain this difference are several. Since the sum and the mean are proportional,the explanation for both cases is related, so the following commentaries will only be aboutthe mean reward per epoch. The first and main reason is that there is an exploration andmain reason is that there is an exploration approach, which makes the agent visit some badlyrewarding states randomly. From this observation, how to know if a bad reward comes froma wrong choice of the agent or from a random choice of an action? The advantage of themean reward is that it absorbs the randomly reached states, because it can be considered thatthe mean of the rewards from the random states is equal on each epoch, according to the Lawof Large Numbers. Thus, what appears on Figures 4.2 and 4.3 is similar to the same graphsexcluding the rewards obtained after taking a random action. And the result is that the badlyrewarding states lower the mean reward per epoch at the beginning of the learning, and thefact that the mean reward per epoch increases shows that the states that should be avoidedwere actually avoided. The second reason which incited to choose to use the mean rewardper epoch is purely visual. It is already not really practical to distinguish the different epochson Figure 4.3, so having 596 points instead of one would have resulted in an unreadablegraph.

45

5.2. General observations

It is also worth mentioning some experiments that have been made but not reported yet.First, the reward used has always been the RSRP from the target node in this report. Though,some tries have been made with a different reward, namely the difference between the RSRPof the target node and the RSRP of the current node. It does induce both positive and negativerewards, and thus the slope of the cumulative sum of the reward should tend to 0. But therewere no significant difference in the learning efficiency taking one or the other reward.

Otherwise, the table-based version of the Q-learning algorithm has been run using the 30first UEs to learn, without taking the category of the UE into account. The results were mixed,some UEs performing really well, when some others had a high RSRP difference and manyhandovers. So despite a better coverage of the area, the results were not satisfying. In fact,two UEs used to learn can provide contradictory information to the learner. As explainedin Section 3.5, with the current MDP, two identical states should not always be followed bythe same action. So the learner is probably overwriting what it has already learnt if it faces adifferent case. That is why, when testing, some UEs have a really good mean RSRP differenceand some others have a really poor one.

Many different options are available to make the action and state spaces evolve. Theones used have been chosen due to there relatively small size and because they looked to bepertinent to answer the problem. Without any thoughts about the size, the final goal is to usebeams instead of nodes for both actions and states. Moreover, all actions imply to measureall the beams. Since it would not be a viable solution for a concrete application, it would beinteresting to combine the results found by Ekman [7], namely measuring only the 20 beamsthat are the most likely to provide the best possible RSRP.

As a last thought, regarding contextual information, it can be not very relevant to createdifferent Q-tables, one for each category, if in the real world the category of the UE is notknown. And this is actually the case, especially because the category of a UE can evolve withtime. Hopefully, this category is not too complicated to estimate by some simple classificationmethod: a car UE can be detected with its speed, even if it is approximated. The position canhelp to determine if a UE is inside or outside. he altitude may say if a UE is not on the groundfloor, so in a building, etc. This could be done in many different ways, from a simple logis-tic regression to support vector machines, by way of random forests or naive Bayes classifiers.

Finally, it can be worrying to know that information about UEs, such as the position, canbe used, because the privacy of the owner of the UE can be called into question. However,the UE is not supposed to be recognised by the network. That is, when a UE connects to thenetwork, a number is given to this UE in order to communicate with it. But this number is notdepending on the UE, and is different each time the UE connects to the network. Moreover,the model of the UE cannot be recognised either. For all these reasons, the network cannotmake the link between a UE and its user.

46

6 Conclusion

The 5G handover problem is really contemporary, since it is a part of the whole 5G develop-ment project. In this master thesis project, reinforcement learning methods have been usedto try to find the best possible trade-off between the strength of the signal received by theuser equipment and the energy consumption for the base stations. After using classical Q-learning algorithm in order to familiarise with it and find interesting features and behaviours,an attempt was made to combine Q-learning algorithm with artificial neural networks. Themain lessons are that the contextual information, such as the category of the user equipment,improves the performance. Moreover, the angle between the user equipment and the servingbase station is a feature that needs a confirmation of its interest thanks to the continuousQ-learning algorithm using neural networks. Another clue that a Q-learning algorithm usingartificial neural networks should be used is the fact that rounding the strength of the signalto the closest unit instead of to the closest ten improves the results. Unfortunately, due to awide map and lengthy computation times, the Q-learning algorithm was not able to predictaccurately which beam should be the best to switch to.

The Q-learning algorithm is a quite robust algorithm, which has proved its value in manypractical cases. It is also suitable for the 5G handover problem, even if it is limited due to itstable aspect in its classical version. To be able to use more complex state and action spaces, theQ-learning algorithm associated with artificial neural networks seems to be able to addressthe 5G handover problem. In fact, the strength of the signal being a continuous variable,approximating the action-value looks to be necessary.

It has been found that using the rounded strength of the signal, the serving node and asimplification of the angle between the user equipment and the base stations as state-space isperforming satisfactorily. Without time consideration, it makes think that using the strengthof the signal, the serving beam and the angle would be ideal to use within the frame of theQ-learning algorithm combined with artificial neural networks.

To conclude, the Q-learning algorithm is a good tool to start with in order to solve the 5Ghandover problematic, but in its original form it is not powerful enough to find the optimaltrade-off between signal quality, number of measurements and number of handovers. A Q-learning combined with artificial neural networks could be a solution instead. With powerfulhardware, reinforcement learning seems to be able to solve the 5G handover problem.

47

7 Appendix

This appendix aims to present late results, that are interesting with the aim of pursuing thiswork later.

7.1 Reducing the amount of data used for learning

The data points are quite close in time in the time series of the RSRP for each UE. In 0.1second, there is not a lot of time to evolve in the space, and there is no huge difference inthe RSRP. In order to decrease the time of learning, the idea is to use the data collected everysecond instead of every tenth of a second. Thus, it is possible to use many more UEs. Foreach category, 33 UEs have been used for learning with these reduced data. Two results arepresented: the case of the walking UEs with a precision of 10, and the case of the car UEs witha precision of 1. The following plots show the results when using the learnt Q-table to testthe efficiency of the learning. It should be noted that if the learning uses one data point everysecond, the testing is made with the complete data, that is with a data point every tenth ofsecond.

Figure 7.1: Distribution of the meanRSRP difference while learning

with 33 walking UEs

Figure 7.2: Distribution of the numberof handovers while learning

with 33 walking UEs

48

7.1. Reducing the amount of data used for learning

Figures 7.1 and 7.2 show respectively the distribution of the mean difference of RSRP andthe distribution of the number of handovers when learning with the first 33 walking UEs.Overall, the mean difference is quite satisfying, because only five UEs used to learn have amean difference over 10dB. Regarding the handovers, more than half of the UEs need lessthan 50 handovers.

Figure 7.3: Distribution of the meanRSRP difference while learning

with 33 car UEs

Figure 7.4: Distribution of the numberof handovers while learning

with 33 car UEs

Figures 7.3 and 7.4 show the same information as the two previous plots, but this time thefirst 33 car UEs have been used for learning, and the precision while rounding the RSRP is 1instead of 10. This time, only one UE has a mean difference of RSRP over 10, and the numberof handovers needed to perform such low mean differences is never higher than 140.

Figure 7.5: Testing on walking UEs Figure 7.6: Testing on car UEs

Now, the two Q-tables learnt in this chapter are used to test the UEs not used for learning.For the walking UEs (see Figure 7.5), the results are mixed: three UEs behave quite well, bothin terms of mean RSRP difference and of number of handovers. However, the two other UEsperform badly in both domains. Regarding the test on the car UEs (see Figure 7.6), the resultsare more homogeneous. There is no very good performance, but no really bad either.

The differences can be explained. In the case of the walking UEs, the distance traveledin one minute is quite short, and so there are still some parts of the map that are probablynot covered with 33 UEs. Moreover, the places that have been visited are well known by theagent, due to the fact that two consecutive states may be relatively close. This is because ofthe short distance between two measurements, then the RSRP is less likely to change a lot,and the serving BS has a higher probability to be the same. So the states that are known bythe agent are well treated, while the states unknown lead to bad results. On the other hand,the car UEs cover most of the possible states on the map. So there is no surprise for the

49

7.2. Parameters of the different algorithms

agent when it is about using the learnt Q-table. However, the distance traveled in one secondcan be quite high, and thus the difference between two consecutive states can be also high.That is why all the test UEs give decent results: all the states are known, but the differencebetween two consecutive states during the learning leads to approximative results. The bigdifference between two consecutive states makes also that the precision in the rounding ofthe RSRP is too high, and in this case the results are better with a precision of 10.

To conclude this part, there is probably a good middle ground to find between the timebetween two data points and the number of UEs to use for learning. This method can lead todecent results like for the car UEs, while not increasing the computation time.

7.2 Parameters of the different algorithms

Figure Nb ofUEs Type of UEs Nb of

iterationsNb of

repetitions Precision Feature in thestate-space Penalty

4.1 3 All 10,000 10 10 No 04.2, 4.3 5 All 10,000 1 10 No 04.5 3 All 10,000 1 10 No 254.6 3 All 10,000 1 10 No 504.7 3 All 10,000 1 10 No 04.8, 4.9, 4.10, 4.11 3 All 10,000 1 10 No 04.4 5 All 30,000 1 10 No 04.12 5 Indoor 10,000 20 10 No 04.13 5 Car 10,000 20 10 No 04.14 5 Walk 10,000 20 10 No 04.15 1 Indoor 10,000 20 10 No 04.16 5 Indoor 10,000 20 10 No 04.17 10 Indoor 10,000 20 10 No 04.19 5 Walk 10,000 20 1 No 04.18 5 Walk 10,000 20 10 No 04.20 5 Car 10,000 20 10 No 04.21 5 Car 10,000 20 10 Direction 04.22 5 Car 10,000 20 10 Angle 04.25 3 Car 10,000 1 10 No 25

Table 7.1: Table based Q-learning algorithm parameters

Figure Nb ofUEs Type of UEs Nb of

iterationsNb of

repetitionsFeature in the

state-space Penalty

4.23, 4.24, 4.26 3 Car 2,000 1 No 25

Table 7.2: Parameters of Q-learning algorithm using ANNs

50

7.3. Results tables

Figure Indoor UEs 4.12 Car UEs 4.13UE index 1 4 7 10 13 Test 2 5 8Confidence intervallower limit 4,127 4,849 4,470 3,635 6,785 10,388 4,576 0 0,726

Mean 4,775 5,420 5,115 4,124 7,489 11,332 5,067 0,087 0,878Confidence intervalupper limit 5,528 6,093 5,873 4,688 8,286 12,332 5,670 0,260 1,216

Number of handovers 109 55 166 40 28 176 365 4 3

Figure Car UEs 4.13 Walking UEs 4.14UE index 11 14 Test 3 6 9 12 15 TestConfidence intervallower limit 4,791 0,039 24,976 0,915 5,734 5,158 46,650 48,805 19,831



Table 7.3: Results for each category with contextual information

UE UE3 UE6 UE9Figure 4.15 4.16 4.17 4.15 4.16 4.17 4.15 4.16 4.17Confidence intervallower limit 0,915 4,576 9,494 5,734 0 0 5,158 0,726 0,839

Mean 1,171 5,067 9,853 7,791 0,087 0 5,878 0,878 1,056Confidence intervalupper limit 1,547 5,670 10,194 6,718 0,260 0 6,750 1,216 1,447


UE UE12 UE15 TestFigure 4.15 4.16 4.17 4.15 4.16 4.17 4.15 4.16 4.17Confidence intervallower limit 46,650 4,791 1,828 48,805 0,039 22,151 19,831 24,976 0,984



Table 7.4: Results while learning with different numbers of UEs

7.3 Results tables

51

7.3. Results tables

Figure 4.17UE UE21 UE24 UE27 UE30 UE33Confidence intervallower limit 3,399 3,722 5,568 1,112 18,425

Mean 3,952 4,259 6,022 1,262 19,283Confidence intervalupper limit 4,589 4,936 6,524 1,430 20,162

Number of handovers 40 449 176 2 138

Table 7.5: Results for the seventh to the eleventh UE when learning with 10 UEs

UE UE1 UE4 UE7Figure 4.18 4.19 4.18 4.19 4.18 4.19Confidence intervallower limit 7,229 4,743 3,465 1,325 1,660 0,748

Mean 8,138 5,453 3,982 1,620 2,028 0,872Confidence intervalupper limit 9,173 6,237 4,540 1,988 2,556 1,063

Number of handovers 261 194 236 123 113 69

UE UE10 UE13 TestFigure 4.18 4.19 4.18 4.19 4.18 4.19Confidence intervallower limit 0 0 6,505 2,196 17,019 11,566

Mean 0,082 0,077 7,139 2,634 18,222 12,648Confidence intervalupper limit 0,246 0,239 7,794 3,153 19,465 13,761

Number of handovers 4 4 494 154 337 226

Table 7.6: Results with different precisions for rounding the RSRP

UE UE2 UE5 UE8Figure 4.20 4.21 4.22 4.20 4.21 4.22 4.20 4.21 4.22Confidence intervallower limit 4,127 24,156 3,285 4,849 28,247 5,969 4,470 39,562 4,307



UE UE11 UE14 TestFigure 4.20 4.21 4.22 4.20 4.21 4.22 4.20 4.21 4.22Confidence intervallower limit 3,635 29,192 7,150 6,785 43,912 5,245 10,388 17,019 11,566



Table 7.7: Results when adding new features

52

7.4. Verification of the consistency

7.4 Verification of the consistency

In order to check if the agent learns differently if it runs the algorithm several times, the ex-perience has been made using some algorithms using the table-based Q-learning. The boot-strap 90% confidence interval (see explanations in Section 3.4) is computed for the mean ofthe mean difference of RSRP and for the mean of the number of handovers.

Contextual information

Figure 7.7: Testing indoor UEs

53


Figure 7.8: Testing car UEs

Figure 7.9: Testing walking UEs

It appears the 90% confidence intervals are not large in general, with the first indoor UEbeing perhaps the exception, the confidence intervals being rather large compared to themean values of the mean difference of RSRP and of number of handovers.

54


Number of UEs used for learning

Figure 7.10: Learning with 1 UE


55



It appears that when using one or five UEs for learning, the 90% confidence intervals are alsolittle. However, when it comes to the learning with ten UEs, the 90% confidence intervalscan become quite large, especially for the mean of the mean difference of RSRP. One reasoncan be the fact that with a simple MDP like the one used, for two differenct UEs, two similarstates followed be the same action can lead to different states. Thereby, not always the sameinformation is saved by Q-table.

56


Precision of the rounding

Figure 7.13: Precision of 10

Figure 7.14: Precision of 1

The 90% confidence intervals are shorter in the most precise case. The states being moreunique in the case of a precision of 1, there are less possible different cases.

57


Adding new features

Figure 7.15: No feature added for learning

Figure 7.16: Direction used for learning

58


Figure 7.17: Angle between UE and BS used for learning

Despite really bad results, there is a good consistency when using the direction of the UE forlearning. The 90% confidence intervals when using the angle are higher, but still satisfactory.

59

Bibliography

[1] Andrew G. Barto and Satinder Pal Singh. “On the Computational Economics of Re-inforcement Learning”. In: Proceedings of the 1990 Connectionist Models Summer School(1990).

[2] Christoper M. Bishop. Pattern Recognition and Machine Learning. Springer, 2013.

[3] Maureen Caudill. “Neural Networks Primer, Part I”. In: AI Expert 2.12 (Dec. 1987),pp. 46–52. ISSN: 0888-3785. URL: http://dl.acm.org/citation.cfm?id=38292.38295.

[4] Balázs Csanád Csáji. “Adaptive Resource Control - Machine Learning Approaches toResource Allocation in Uncertain and Changing Environment”. PhD thesis. Faculty ofInformatics, Eötvös Loránd University, Budapest, Hungary, 2008.

[5] Erik Dahlman, Stefan Parkvall, and Johan Sköld. 4G LTE/LTE-Advanced for Mobile Broad-band. Academic Press; 1 edition, May 2011.

[6] Anthony Dickinson. “Actions and habits: the development of behavioral autonomy”.In: Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences308.1135 (Feb. 1985), pp. 67–78. URL: https://links.jstor.org/sici?sici=0080-4622%2819850213%29308%3A1135%3C67%3AAAHTDO%3E2.0.CO%3B2-9.

[7] Björn Ekman. “Machine Learning for Beam Based Mobility Optimization in NR”. MAthesis. Linköping University, 2017.

[8] Pål Frenger and Mårten Ericson. “Assessment of Alternatives for Reducing EnergyConsumption in Multi-RAT Scenarios”. In: Vehicular Technology Conference (VTC Spring),2014 IEEE 79th (May 2014).

[9] Chris Gaskett, David Wettergreen, and Alexander Zelinsky. “Q-Learning in Continu-ous State and Action Spaces”. In: Advanced Topics in Artificial Intelligence Lecture Notes inComputer Science (1999), pp. 417–428.

[10] Peter Dayan Jan Gläscher Nathaniel Daw and John P. O’Doherty. “States versusRewards: Dissociable Neural Prediction Error Signals Underlying Model-Based andModel-Free Reinforcement Learning”. In: Neuron (May 2010).

[11] Yuxi Li. “Deep Reinforcement Learning: An Overview”. In: CoRR abs/1701.07274(2017). URL: http://arxiv.org/abs/1701.07274.

[12] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In: NIPSDeep Learning Workshop 2013 (Dec. 2013).

60

http://dl.acm.org/citation.cfm?id=38292.38295

http://dl.acm.org/citation.cfm?id=38292.38295

https://links.jstor.org/sici?sici=0080-4622%2819850213%29308%3A1135%3C67%3AAAHTDO%3E2.0.CO%3B2-9

https://links.jstor.org/sici?sici=0080-4622%2819850213%29308%3A1135%3C67%3AAAHTDO%3E2.0.CO%3B2-9

http://arxiv.org/abs/1701.07274

Bibliography

[13] Andrew Y. Ng et al. “Autonomous inverted helicopter flight via reinforcement learn-ing”. In: Springer Tracts in Advanced Robotics 21 (2006).

[14] Shantanu Sharma Nisha Panwar and Awadhesh Kumar Singh. “A Survey on 5G: TheNext Generation of Mobile Communication”. In: Physical Communication (Mar. 2016).

[15] Jeremy Orloff and Jonathan Bloom. Bootstrap confidence intervals. Course at MIT. Avail-able online on the website of the MIT OpenCourseWave. May 2014.

[16] Martin Riedmiller. “Neural Reinforcement Learning to Swing-up and Balance a RealPole”. In: Systems, Man and Cybernetics, 2005 IEEE International Conference on (Oct. 2005).DOI: 10.1109/ICSMC.2005.1571637.

[17] Gavin A. Rummery and Mahesan Niranjan. On-Line Q-Learning Using Connectionist Sys-tems. Sept. 1994.

[18] Debashis Saha and Varadharajan Sridhar. Next Generation Data Communication Technolo-gies: Emerging Trends. 1st. Hershey, PA, USA: IGI Global, Dec. 2011. ISBN: 1613504772,9781613504772.

[19] Daniel Shiffman. The nature of code. 2012. Chap. ch. 10.

[20] David Silver. Reinforcement learning. Course at University College of London. Availableonline at http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html. 2015.

[21] Peter Stone and Richard S. Sutton. “Scaling Reinforcement Learning toward RoboCupSoccer”. In: The Eighteenth International Conference on Machine Learning (June 2001).

[22] Richard S. Sutton. “Integrated architectures for learning, planning, and reacting basedon approximating dynamic programming”. In: Proceedings of the Seventh InternationalConference on Machine Learning (1990).

[23] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nded., in progress. A Bradford Book | The MIT Press, 2016. URL: https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf.

[24] Christopher J.C.H. Watkins and Peter Dayan. Technical Note, Q-Learning. 1992.

[25] Christopher John Cornish Hellaby Watkins. “Learning from Delayed Rewards”. PhDthesis. King’s College, May 1989.

61

http://dx.doi.org/10.1109/ICSMC.2005.1571637

https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

maxime bonneau - diva portal

Documents