tdnet - a generative model for taxi demand prediction1334506/... · 2019. 7. 2. · linköpings...

Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information ScienceMaster thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/046--SE

TDNet - A Generative Model forTaxi Demand Prediction–TDNet - En Generativ Modell för att Prediktera Taxiefterfrågan

Gustav Svensk

Supervisor : Suejb MemetiExaminer : Kristian Sandahl

External supervisor : Eero Piitulainen

http://www.liu.se

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annananvändning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning somgod sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentetändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to bementionedwhen his/her workis accessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

©Gustav Svensk

http://www.ep.liu.se/

http://www.ep.liu.se/

Abstract

Supplying the right amount of taxis in the right place at the right time is very importantfor taxi companies. In this paper, the machine learning model Taxi Demand Net (TDNet)is presented which predicts short-term taxi demand in different zones of a city. It is basedon WaveNet which is a causal dilated convolutional neural net for time-series generation.TDNet uses historical demand from the last years and transforms features such as time ofday, day of week and day of month into 26-hour taxi demand forecasts for all zones in acity. It has been applied to one city in northern Europe and one in South America. In north-ern europe, an error of one taxi or less per hour per zone was achieved in 64% of the cases,in South America the number was 40%. In both cities, it beat the SARIMA and stackedensemble benchmarks. This performance has been achieved by tuning the hyperparame-ters with a Bayesian optimization algorithm. Additionally, weather and holiday featureswere added as input features in the northern European city and they did not improve theaccuracy of TDNet.

Abstract

Att ha rätt antal taxis på rätt plats vid rätt tid är väldigt viktigt för taxiföretag. I dennarapport presenteras maskininlärningsmodellen Taxi Demand Net (TDNet) som förutspården kortfristiga efterfrågan på taxi i olika stadszoner med precision. Den är baserad påWaveNet, ett faltande neuralt nätverk med kausala och utvidgade lager för tidsseriepredik-tion. TDNet använder historisk efterfrågan från de senaste åren och transformerar infor-mation så som tid på dygnet, dag i veckan och dag i månad till prognoser för efterfråganpå taxi som sträcker sig 26 timmar framåt för alla zoner i en stad. Modellen har tilläm-pats på en stad i norra Europa och en i Sydamerika, den har åstadkommit ett fel på entaxi eller mindre i 64% respektive 40% av fallen. I båda städer har den slagit referensmod-ellerna SARIMA samt en viktad ensemble. Denna precision har nåtts genom att hitta hy-perparamtetrar med en bayesiansk optimeringsmetod. Dessutom har det visats att varkeninformation om väder eller helgdagar förbättrar modellens prestanda.

Acknowledgments

I would like to thank my supervisors Suejb Memeti at the university and Eero Piitulainen atTaxicaller for providing valuable input and supporting me during this semester. I would alsolike to thank my examiner Kristian Sandahl for answering questions and taking on my thesis.Lastly I would like to thank everyone at Taxicaller for making me feel welcome.

v

Contents

Abstract iii

Sammanfattning iii

Acknowledgments v

Contents vi

List of Figures viii

List of Tables ix

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 42.1 Taxi Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Contextualizing Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Basics of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Training a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 RNN and Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . 132.8 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.9 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.10 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.11 Sequential Model-based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 162.12 Tree Parzen Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.13 SARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.14 Stacked ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.15 Mixed Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.16 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Literature Review 193.1 WaveNet Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Taxi Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Method 224.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

4.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.7 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.9 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.10 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.11 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.12 Models trained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Empirical Evaluation 325.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Results NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3 Results SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Hyperparameters and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Discussion 436.1 Results NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Results SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3 Comparing the Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.4 Method Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.5 Comparing the Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.6 Comparing TDNet to the Literature . . . . . . . . . . . . . . . . . . . . . . . . . 496.7 Improving TDNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.8 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Conclusion 537.1 Connection to Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55

Glossary 59

vii

List of Figures

2.1 Taxi Demand Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Example problem, supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Polynomial Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Error plot polynomial curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Convolving an image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.8 A Stack of dilated causal convolutional layers . . . . . . . . . . . . . . . . . . . . . . 142.9 MNIST images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 True Demand per Hour in SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 True Demand per Hour in NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Zone Demand Distribution in SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4 Zone Demand Distribution in NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Building block of TDNet architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Result NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Error Distribution of RMSLE in NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3 Error Distribution of RMSLE in NE of Non-zero Demand . . . . . . . . . . . . . . . 355.4 Error Distribution of RMSE in NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.5 Zone Error Distribution NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.6 Total Demand in NE over Test Period . . . . . . . . . . . . . . . . . . . . . . . . . . 375.7 Total Demand Benchmarks NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.8 Prediction Error per Hour NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.9 Train loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.10 Results SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.11 Zone Error Distribution SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.12 Error Distribution of Root Mean Square Error (RMSE) in SA . . . . . . . . . . . . . 405.13 Total Demand RMSE Model SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.14 Total Demand Benchmarks SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.15 Prediction Error per Hour SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1 Comparison of RMSE and Root Mean Square Logarithmic Error (RMSLE) . . . . . 45

viii

List of Tables

4.1 Unprocessed Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Processed Dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Hyperparameter domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Comparing loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Found Hyperparameter Values for NE and SA . . . . . . . . . . . . . . . . . . . . . 425.3 Hyperparameters SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Hyperparameters NE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

1 Introduction

Taxis are a part of the transportation system of most cities and provide a service to takeindividuals from point A to point B. Due to high regulation of and restricted entry to thetaxi market, ride-sharing companies such as Uber and Lyft have appeared and successfullycompeted with the traditional companies [10]. Traditional taxi companies must improve theircompetitiveness in order to fight for their share of the global taxi market which amounts toan estimated $108 billion [6]. Simultaneously, large parts of the taxi fleet is expected to bereplaced by autonomous vehicles which will lead to unprecedented changes in the industrywhen it comes to e.g. the cost and business models, service availability and optimal fleet size[6, 31].

These technological developments paint a picture of a competitive market where the cur-rent actors will have to adapt to the new circumstances or perish. However, for those who areable to leverage their technology, an opportunity for growth is presented. Solutions whichhave an impact on the current industry and won’t be made obsolete in the near future, suchas accurately being able to predict taxi demand, offer a great advantage.

1.1 Motivation

Predicting taxi demand accurately would lead to numerous benefits on several levels. Cus-tomers would experience a lower expected wait time, taxi companies would have more ef-ficient resource usage by regulating the number of taxis. Lastly, the drivers would receiverecommendations on where to look for customers as well as a reduction in time spent roam-ing and queuing for customers. The task of predicting taxi demand can be divided into twosub problems, short-term or real-time predictions and long-term predictions. Short-term pre-dictions impact the customers and drivers on a day to day basis and long-term predictionsare made on a weekly or monthly basis to help resource management and planning.

The taxi industry is, as many other industries, in an era of digital transformation. With anincreasing access to cheaper smart phones and better wireless mobile telecommunication, theability to collect and store large amount data generated by each taxi has emerged. Exampleof data collected by taxi companies are GPS locations of where the taxi has been at differenttimes, if the car was occupied by a customer and if so, when and where the customer waspicked up and dropped off.

1

1.2. Aim

Multiple studies have used this kind of data and shown that it is possible to forecasttaxi demand using different types of machine learning algorithms [34, 59, 12, 55]. A recentexample is the study done in August 2018 by Jun Xu et al. which shows that it’s possible toaccurately predict taxi demand using a sequential learning model based on a recurrent neuralnetwork (RNN) and a mixture density network.

The recurrent neural network architecture used by Jun Xu et al. in their research [55] iscalled a long short-term memory network (LSTM), it was first described in a paper by S.Hochreiter et al. in 1997 [20]. It has since gained popularity and been shown to deliver state-of-the-art results in a variety of fields and problem domains that use sequential data [47, 60,33]. Networks of this kind are able to accurately model long-term patterns in data. In alarge-scale analysis of different LSTM architectures, the forget gate component and outputactivation function are noted to be essential for the performance of the algorithm. The forgetgate enables the LSTM to reset its own state and the output activation function is needed tostabilize learning [17].

Due to the structure of the recurrent neural network, the number of trainable parametersis high which makes it computationally expensive in comparison to feed-forward networkssuch as the Convolutional Neural Network (CNN). In 2015, researchers at Google Deepmindproposed a novel, fully probabilistic and autoregressive model called WaveNet [49]. It isbased on a stack of convolutional layers, thus making it computationally cheaper than recur-rent neural networks and specifically LSTMs. Since the initial publication, multiple papershave been released using WaveNet as a base architecture for sequential forecasting [4, 25].The state-of-the-art results of WaveNet in multiple domains which require time-series fore-casting justifies applying it to predicting taxi demand.

1.2 Aim

The purpose of this project is to evaluate the performance of an architecture based onWaveNet, Taxi Demand Net (TDNet), when applied to predicting taxi demand. The resultswill be compared to other time series forecasting algorithms. This has been achieved throughthe following steps:

1. Developing a prediction model based on WaveNet

2. Tuning the hyperparameters of the model

3. Evaluating the final performance

4. Comparing performance with alternative models

1.3 Research questions

Based on the motivation and aim, these are the research questions that will be investigatedand answered in this report:

1. How accurately is it possible to predict the short-term taxi demand in predefined zonesof a city using TDNet?

a) How should accuracy be measured in this domain?

2. How does the spatial distribution of the demand in the cities and features other thandemand affect the performance of TDNet when predicting the short-term taxi demandin predefined zones of a city?

3. How well does TDNet perform, compared to other time-series forecasting models,when predicting the short-term taxi demand in predefined zones of a city?

2

1.4. Delimitations

1.4 Delimitations

This study is focused on applying TDNet to the specific application of predicting taxi de-mand. The data sets used are supplied by the taxi control company TaxiCaller AB and consistof real taxi trips that have been collected over more than two years. The data comes from twoof their largest customers, one has its business in northern Europe and the second in SouthAmerica. They will be referred to as NE (Northern Europe) and SA (South America). Thedistribution of demand in these cities are the two spatial distributions considered in researchquestion 2. Outside of historical demand, holiday and weather features will be investigatedfor city NE. Furthermore, the data sets do not represent all of the taxi demand in a given city.This means that the total taxi demand in a city is unknown and also the market share of thecustomer. Figuring out the size of the black market, the competitors and the market share ofride-sharing companies wouldn’t result in more than imprecise guesswork.

3

2 Theory

This section provides an explanation of concepts important to the rest of the thesis. Thefirst section describes the problem domain and is followed by a few sections which serve asbuilding blocks to understand TDNet and the underlying architecture of WaveNet. There-after follows sections on hyperparameter tuning and mixed precison which are techniqueswhich have been used to improve the performance of the model. Finally, in the method sec-tion, studies are presented which motivate the methodology with which this project has beenconducted.

2.1 Taxi Demand

The task of TDNet is to be able to predict short-term taxi demand, what’s actually meant bytaxi demand as as well as what parameters influence it will be described in this section.

Taxi demand could come from either street hailing or bookings, which are placed througha phone call or in a mobile application. There are three important differences. The first one isthat the hailing happens spontaneously while bookings are planned in advance. The secondone is that the location of taxi cars influences whether a hailing occurs or not. The last one isthat hailing is influenced by the structure of a city’s road network.

Consider figure 2.1 where the y-axis represents the combined number of hailings andbookings and the x-axis is the hour of the day. This is a simple example of what the demandcould look like for a zone in a city.

Domain experts and scientific literature have both been sources in investigating the pa-rameters which significantly impact taxi demand for an area [57, 59, 15]. The domain expert,which was interviewed specifically for this thesis, was a taxi driver in one of the cities in-vestigated who had also had a manager role. Primary parameters are historical taxi demandand temporal factors, i.e. hour of day, day of week and day of month. Secondary factors in-clude holidays, promotions, sporting events, special occasions, the schedules of nearby publictransport, taxi drop-offs, weather and closing times of pubs and night clubs. What some ofthese secondary factors have in common are that they are indicators of how many peoplethere are in an area. Working under the assumption that the number of mobile network con-nections is a good approximation of the number of people in an area, Google researchersmanaged to produce a model which very accurately predicted taxi demand in Tokyo. Unfor-tunately, this valuable data isn’t widely available. [24]

4

2.2. Contextualizing Machine Learning

Figure 2.1: Taxi demand over a 24 hour period.

2.2 Contextualizing Machine Learning

Artificial intelligence (AI) is a field which concerns itself with building intelligent entities.There are many different approaches to accomplishing this task. One example is the symbolicapproach where sets of hard-coded rules, logic and search algorithms are combined to solveproblems. Another example is the Bayesian approach where a probability distributions areused to reason with uncertainty. Using optimization algorithms and domain data, the mostlikely conditional dependencies of the variables in e.g. a probabilistic graph can then becalculated to generate a model which is able to infer the probability functions of unobservedvariables. [46]

The approach which has made the greatest progress during the last decade is machinelearning. It tries to solve problems without being explicitly programmed to solve them [27].This is achieved by a model through first learning patterns in the data in what’s known asthe training phase. Afterwards, the model is able to determine how to handle new, unseen,data by looking for the learned patterns in it.

To successfully make use of machine learning, there are four main factors to take into ac-count, namely data quantity and quality, computational power and model [16]. Since ma-chine learning requires learning patterns from data, how well a model is able to do thatlargely depends on the quantity but also the quality of the data. The data is an approxima-tion of reality and it all that is available to the model, if it doesn’t represent reality preciselyenough or doesn’t cover all possible scenarios the model won’t either. If the model is toocomplex for the amount of data available, the parameters of the model won’t converge, i.e.learn the patterns. If there is enough data available, but the quality is too low, i.e. there areoutliers, null entries, incorrect or invalid values, the model might start displaying unwantedbehaviour or fail to identify the patterns that actually exist.

For advanced tasks, advanced models are required and these are often computationallyexpensive to train. They might need multiple GPUs or even processing units made specif-ically for deep learning, tensor processing units (TPU), to run for weeks [48, 23]. Finally, asuitable model has to be chosen for the problem at hand. There are multiple domains whichhave come to be dominated by machine learning, a few examples are computer vision, ma-chine translation, the game Go, targeted advertising and text-to-speech [16]. For each of thesedomains there are specific types of model architectures which have proven to deliver results.

How the four main factors of machine learning have evolved over time also explains whymachine learning has made such substantial progress lately. The amount of data produced byour society increases rapidly. Moore’s law illustrates how computational power has increased

5

2.3. Basics of Supervised Learning

exponentially over the last decades and even though it is about to come to an end, the alreadymentioned TPUs as well as data centers stacked with GPUs are ready to step in [52]. Lastly,machine learning as a field has gone through a research boom and has wildly increased inpopularity over the last decade as more and more state-of-the-art results have been presentedin several domains [16].

A sub field of machine learning known as supervised learning is defined by how it re-quires labeled data, i.e. a mapping between the input and output data to learn patterns.Formally, given an input-output pair x, y a supervised learning model tries to learn f (x) = y.Classification is one of two types of supervised learning, the task when classifying is to assigna class to the output based on the input. The other type of supervised learning is regression,this is when the output is a numerical value which isn’t a class or category. The domain of taxidemand is an example of regression in that the sought output is an integer which representsa quantity.

2.3 Basics of Supervised Learning

To provide an example of a supervised learning problem and to introduce some key concepts,consider the following problem. Construct a polynomial function which approximates thefunction y, where x is a uniformly distributed vector and ε is normally distributed noise,N „ (0, 1).

y(x) = cos(πx) + ε (2.1)

Figure 2.2: Datapoints sampled from y(x) in equation 2.1. The noise has resulted in oneoutlier at x = 1.8.

The vector x ranges from 0 to π with N = 11 observations and has the values shown infigure 2.2. A polynomial function of degree M is defined as the following:

y(x, w) = w0 + w1x + w2x2 + ... + wMxM =Mÿ

i=0

wixi (2.2)

For the purpose of quantitatively evaluate how good the approximation is, an error func-tion has to be defined. In this case, an appropriate choice is the RMSE, defined in equation2.3. It is appropriate since it’s simple, frequently used and tries to make each residual as smallas possible since it’s sensitive to outliers. An example is that two errors of r/2 amounts to asmaller RMSE than if one if one error were zero and one had a difference of r. That means

6

2.3. Basics of Supervised Learning

that errors concentrated to a few points are punished harder than the same error spread outover more points. Calculating the mean error allows comparing performance of models onvalues of different lengths and calculating the root of the mean squared error converts theerror to the same unit as the predicted values. If the model’s approximation of y is py then theerror is defined as:

RMSE(w) =

d

řNn=1(pyn(xn, w)´ yn)2

N(2.3)

A lower error is better and in the case that RMSE(w) is equal to zero, the approximationpy perfectly fits the training points. Thus, the objective now is to find which values of wminimize the error.

By deriving the quadratic error function, setting it to zero and solving for w, it’s possibleto obtain an optimal solution. However, the degree of the polynomial function must first bedecided and this impacts the complexity of the model.

Changing the value of M has a large impact on how well the approximation fits the datapoints and how well the approximation is able to predict unseen values. M isn’t learnedduring the training phase and is more akin to a model setting decided by the creator beforethe training begins. In the domain of machine learning, M is known as a hyperparameter.

To ensure that the approximation is valid, data points from the same range as x are gen-erated. These data points are previously unseen and constitute the test set. A model that isable to estimate y with a low error for input that it hasn’t seen before, i. e. the test set, is saidto generalize well.

Figure 2.3: A lineplot showing how well different polynomial curves fit the training points aswell as the true function which they are trying to approximate, f (x) = cos(πx).

In figure 2.3, py(xtest, w) is displayed for the hyperparameter values M P t0, 5, 10u. ForM = 0, the degree of the polynomial is too low and the approximation can’t fit the trainingpoints, this is known as underfitting. The approximation when M = 5 doesn’t fit the datapoints perfectly but stays close to the true value for all values of x, i. e. it generalizes. ForM = 10, the curve perfectly fits the training points but there are regions such as 0 ă x ă 0.3where the approximation takes on values far from the true y. This is known as overfitting and

7

2.4. Artificial Neural Networks

negatively impacts generalization. There are several causes for overfitting but at its core, it’sbecause the model is too complex for the training data.

How well a model fits the data can be visualized by plotting the error on the train and testset for the different values of M. To calculate the test error, each of the models have predictedpy(x, w) for the last 10 values of x´ 0.1.

Figure 2.4: A scatterplot showing the train and test error for the polynomial fit for differentdegrees of the polynomial.

As can be seen in figure 2.4, the training error decreases as M is increased. However, this isnot the case for the test error which massively increases for M=10. This further confirms thatM=5 is the appropriate model to select for this task. To avoid problems such as overfitting,it’s possible to add a validation step to the training. To do this, a portion of the training dataset is held out in a similar way to the test set. However, the validation set is used repeatedlyto evaluate how different sets of hyperparameters affect the performance of the model. Thisenables the developer to spot how the hyperparameters affect the performance before doingthe final evaluation. The size of the validation set as well as the test set should only be bigenough to allow for a fair evaluation of the model since more training data helps improve theperformance of the model [38].

Normally when working with a real-life problem, the test set can’t be generated but mustbe held out from the original data set. There are several rules of thumb for splitting the datainto the different sets, a couple of years ago, the most common choice was 60% train, 20%validation and 20% test. The important part is that there is enough data in the test and cross-validation category to enable proper evaluation and not overfit the model, the rest should beused for training to improve the model [7]. However, as mentioned in section 2.2, the amountof available data has increased enormously and for deep learning tasks, splits such as 90%,5%, 5% or even 99.5%, 0.25%, 0.25% are not unheard of according to an expert in the field, A.Ng. [38]

2.4 Artificial Neural Networks

An artificial neural network (ANN) is a framework to approximate a function, given a set ofinputs and outputs. The building blocks of these networks are called nodes.

Nodes

Nodes generate a signal, a, the value of which is determined by the internal state of the nodeas well as the input to the node. The internal state is composed of a vector of weights, w.

8

2.4. Artificial Neural Networks

These weights are multiplied with the input to the node, x and a bias b, is added. This valueis then passed through a non-linear activation function, g, which transforms the value andconstrains the output to be within a certain range.

a = g(wᵀx + b) (2.4)

An example of an activation function is the sigmoid function. It returns output between0 and 1 is defined in equation 2.5 below. Similar to the sigmoid function is tanh, defined inequation 2.6 which returns output between -1 and 1.

σ(x) =1

1 + e´x (2.5)

tanh(x) = 2σ(2x)´ 1 (2.6)

Both functions are non-linear since that gives a network the powerful property to be ableto output functions which aren’t just linear combinations of the input. Look at the left graphin figure 2.5 for a visual representation of the sigmoid activation function.

Figure 2.5: Sigmoid activation function to the left and ReLU to the right.

Feedforward Neural Networks

The network is built by a set of nodes, structured in different layers. The nodes in each layerare interconnected via edges to the nodes in the next layer. Depending on the direction ofthese connections, the network falls into one of several categories. In the case that these nodesand edges can be represented as a directed acyclic graph the network is called a feedforwardneural network and the layers are called dense layers. In a feedforward neural netwrok theinformation can only flow from a previous layer to a later one, not in reverse and also notfrom one node in a layer to another node in the same layer.

The first layer, which receives the input data, is called the input layer. The last layer,which produces the output of the network, is called the output layer. All layers in betweenare referred to as hidden layers, in figure 2.6, the number of hidden layers is two. Together,these stacked layers achieve what a single node is able to do but it also allows for modellingpatterns of much higher complexity than a single node.

A feedforward neural network can be modified to work on input data which varies withtime. In that case the input data for each time step is fed to a dense layer which calculates anoutput. The weights of the layer aren’t updated until the input for all time steps have passedthrough it. A layer of this kind is called a time distributed dense layer.

9

2.5. Training a Network

Figure 2.6: A feedforward neural network with four dense layers.

2.5 Training a Network

In order to measure the performance of a neural network, a performance metric, J(w), has tobe defined. Commonly, a loss function is used which indicates how far the approximation ofthe network deviates from the training set, i.e. the ground truth of the task. After a predictionhas been generated by the network for each of the training samples and the loss has beencalculated, an algorithm is needed to identify what nodes of the network contributed to theloss. This is exactly what the back-propagation algorithm does. It propagates through thenetwork using the chain rule for derivation. In combination with gradient descent, the valuesof the weights w can be calculated which leads to the greatest decrease of the loss. [16]

Gradient descent is an optimization algorithm which attempts to find the minimum valueof a function. It does so by calculating the gradient, i.e the direction in which the value of thefunction decreases the fastest. The magnitude of the gradient is equal to the steepness of theslope. Thereafter, the values of the parameters of the function are adjusted according to thedirection and size of the gradient. This can be likened to taking a step in the steepest directionof the function surface. The step length is not only regulated by the size of the gradient butalso by the so called learning rate.

In addition to vanilla gradient descent, there are extensions such as stochastic gradientdescent, RMSProp and the Adam optimization algorithm which attempt to combat commonissues encountered when using gradient descent. A technique which all of these have incommon is that they divide the training set into batches. A batch is a subset of the trainingset which serves as a stochastic approximation of the whole set. Using batches decreases thespace required to load the input data into memory and speeds up the computation of theoutput. Stochastic gradient descent is an extreme example where the batch-size is reducedto one. In practice, mini-batches are more commonly used and often contain between 32 and512 training samples. [16]

The Adam Optimizer

The Adam optimizer was first introduced in a paper by D. P. Kingma and J. Lei Ba [26]. Thename Adam comes from the term adaptive moment estimation. Starting with the term adaptive,this means that each wi of w has its own learning rate, i.e. each weight has its own step sizewhen performing gradient descent, the step size is changed throughout the training. Thismakes sense since it would be very unlikely that all the weights in a network would improvethe fastest by being multiplied by the same scalar. The term moment estimation describes theway Adam uses exponential averaging of the previous gradients to update w. The momentestimation can be divided into two parts, estimation of the first moment, i.e. the mean of thegradient and estimation of the second moment, i.e. the uncentered variance of the gradient.

St = βSt´1 + (1´ β)Yt, t ą 1 (2.7)

10

2.6. Convolutional Neural Networks

In the above equation 2.7 which is a general definition of the exponential moving average,t denotes the time step, St is the exponential moving average, β is the decay rate and Yt isthe value of the series for which the exponential moving average is being calculated. S1 isnormally set to Y1. The value of β decides the importance of the most recent value of the timeseries in comparison to older ones.

In the context of Adam, the exponential moving average can be applied in the followingway. Let gt denote the gradient at time step t, then the estimation of the first moment is

mt = β1mt´1 + (1´ β1)gt, t ą 1 (2.8)

and the estimation of the second moment is similarly:

vt = β2vt´1 + (1´ β2)g2t , t ą 1 (2.9)

What separates mt and vt from the definition seen in equation 2.7 is that they are initializedto 0, i.e. mt = vt = 0. This initialization introduces a bias which luckily can be remedied by abias correction.

pmt =mt

1´ βt1

(2.10)

pvt =vt

1´ βt2

(2.11)

This leads up to the definition of the rule for weight updates:

wt = wt´1 ´ η(pmt

a

pvt + ε) (2.12)

where the ε is added to avoid dividing by zero and the whole rightmost expression ismultiplied by the η hyperparameter which controls the step size. The actual step size whenupdating a weight has two upper bounds defined by the value of η which makes it moreintuative than the standard learning rate seen in gradient descent. Furthermore, the momentestimations causes the weights to move in the direction of the previous gradients even ifthe current gradient is close to zero, this causes the optimizer to avoid getting stuck in localoptima or around saddle points.

2.6 Convolutional Neural Networks

Since the structure of the feedforward neural network assigns one weight to each input vari-able, the number of weights required for handling large inputs is high. An example domainwhich illustrates this is image classification. The input variables are the color values of eachpixel and while this might be feasible for a small gray-scale image of, for example, the di-mensions 28x28x1, the approach quickly becomes unfeasible when e.g. colored images froma modern camera are to be processed with a dimension of 3872x2592x3 which amounts toalmost 10 million pixels with 3 channels each.

Some data is ordered in a way that conveys additional information and a feedforwardneural network fails to capture this relationship. The previous example of images displaysthis property in that groups of adjacent pixels form objects. Another example are soundwaves where e.g the most recent frequencies conveys information about what the currentfrequency might be. A CNN attempts to solve these issues by convolving the input, i.e.passing a filter over it. This causes adjacent input to be interpreted together and reduces thenumber of parameters in the network. [14]

11

2.6. Convolutional Neural Networks

Convolving an Image

To show an example of convolution, a vanilla convolution will be performed on a 2D imagewith one color channel. This image has the dimension 4x4 and the filter has the dimensions2x2. For simplicity’s sake, the values of each of the pixels can either have the value 1 whichrepresent the color black or 0 which represents white. Filters can be used to look for differentkinds of features in an image, a simple example of a low-level feature is an edge.

Figure 2.7: Convolving a 4x4 image with a 2x2 filter produces a 3x3 output. The convolvedimage has a black top half and a white bottom half and the weights of the filter are chosen todetect edges.

The first step of the convolution depicted in figure 2.7, is to element-wise multiply thefilter with the top-left corner of the matrix which is marked with a dark blue border. Theproduct of this operation is shown in the top-left cell of the matrix to the right, which also hasa dark blue border. The product is calculated as the following, 1 ˚ 1+ 1 ˚ 1+ 1 ˚´1+ 1 ˚´1 =0. Afterwards, the filter is moved one step to the right and is element-wise multiplied withthe four values marked by a red background. Next, the filter slides one more step to the rightto produce the 0 in the top-left corner. After that, the filter is multiplied with the two middlerows of the first and second column, one row below the first convolution marked by the darkblue borders. In this position, the element-wise multiplication results in 1 ˚ 1 + 1 ˚ 1 + 0 ˚´1 + 0 ˚ ´1 = 2. The 2 can be spotted in the middle row to the left in the result matrix. [14]

The result matrix can be interpreted as that there is a horizontal edge in the middle ofthe image, which is true. Another property of the result is that the dimensionality has beenreduced in comparison to the original image.

In a CNN, the values of the filter aren’t static, instead the values are trainable parameterswhich are continuously updated when training the network. The filters are thereby able tobe improved to capture the features which matter the most when determining the output. Asmentioned at the start of this section, feedforward neural networks don’t scale well to largeinputs. However, the filters contain far fewer parameters and are therefore scalable. In thesimple example shown in figure 2.7, the number of parameters required for a feedforwardneural network would be 4 ˚ 4 = 16, one for each pixel. The filter only contains four param-eters. In more advanced cases, a larger number of features would have to be identified in animage. The solution is then to increase the number of filters as well as adding more layers tothe CNN. More filters enables detecting more features and more layers allows detecting moreabstract features. [14, 16]

In the given example, the result matrix describes "an edge in the middle of the image"in fewer dimensions than the original image. An example can be imagined where a CNN istrained to perform face detection, in that case, the output of a layer deep into the networkmight be interpreted as "there is probably a pair of eyes in the top-left region of the image"which is more abstract than edge-detection and certainly more abstract than the original pixelinput.

12

2.7. RNN and Sequence to Sequence Models

2.7 RNN and Sequence to Sequence Models

Sequence to sequence models takes an input sequence, x, and turns it into an output sequence,y. Examples of sequence data tasks are speech recognition and sentiment classification. Todo speech recognition, a model has to be developed which turns audio waves into writtenlanguage, e.g. a sentence. An example of sentiment classification is turning a movie reviewsuch as ”This movie shouldn’t have been made” into a rating such as one out of five.

If an artificial neural network contains feedback connections, i.e. nodes which connect tothemselves, it is referred to as a recurrent neural network (RNN). This feedback connectionenables the network to pass on current information to future states, thus making it usefulwhen modeling sequences like time series data. A RNN is a good choice for processing se-quential data but inherently they do not support turning an input sequence of one length intoan output sequence of a different length. To remedy this, two RNNs with different parame-ters can be combined, the first one to encode the input to a certain size and the second one totake the encoded input and decode it to an output of a different size. Even in this improvedstate, there are a number of issues with the nature of its architecture, such as a large numberof parameters, learning long-term patterns and gradients vanishing. [16]

The discovery of the rectified linear unit (ReLU) activation function, equation 2.13, is a greatexample of an attempt to fix an issue with neural networks, namely gradients vanishing [35].When using the back-propagation algorithm to calculate the gradient of the loss function fora node, a result will always be produced since all activation functions must be differentiable.But if the right graph in figure 2.5 is examined, it can be seen that the gradient of the sigmoidactivation function is very close to zero for extreme values. The gradient is partly responsiblefor minimizing the loss function and given the right circumstances, the higher the value of thegradient, the more progress is made. These gradients close to zero is known as the vanishinggradients problem and it slows down a network’s training. Due to computers having limitedprecision when representing decimals, the problem is even worse in practice and not too longago, the sigmoid value was widely used in deep neural networks.

f (x) = max(0, x) (2.13)

The right graph in figure 2.5 displays the above equation 2.13 which defines ReLU,it produces a gradient value of 1 for x ą 0 or 0 for x ď 0. This is an improvementover the sigmoid activation function and suits the discrete nature of computers far better.ReLU is now the most used activation function for deep neural networks. [28]

Apart from the attempts made to combat issues of RNNs, there have also been attemptsat using other neural network architectures on sequential data [30, 1]. One example of this isWaveNet, a one-dimensional CNN modified to forecast time-series.

2.8 WaveNet

In the paper where the WaveNet architecture first was described, the authors A. Van DenOord et al. showed that it was possible to generate raw audio waves from both text andmusic using a CNN. The technique to generate audio based on text is known as Text-To-Speech (TTS). Compared to other state-of-the-art techniques, e.g. long short-term memorynetworks (LSTM), the performance of WaveNet on this task was a major improvement. [49]

The WaveNet model generates raw audio waveforms. In mathematical terms, the jointprobability of the generated wave x = x1, x2, ..., xT can be described as factorizing the prod-uct of multiple conditional probabilities, where for each time step, xt is conditioned on theprevious time steps, i.e. x1, x2, ..., xt´1.

p(x) =T

ź

t=1

p(xt|x1, ..., xt´1) (2.14)

13

2.9. Evaluation Metric

To model the conditional probability distribution p(xt|x1, x2, ..., xt´1) in equation 2.14,multiple so called causal dilated convolutional layers are used. Dilation is a technique wherea subset of the previous sequential data is used to model the probability of time step t. Forexample x1, x2, x4 and x8 are used to determine p(x9|x1, x2, x4, x8) instead of all eight previousvalues [56]. The technique reduces the number of computations and increases the speed withwhich the CNN converges. A causal convolution means that no future samples, e.g. xt+1 orxT , are used to model p(xt). This is critical since a model which relies on knowing the futureto predict the present would be useless. Figure 2.8 displays an example of what calculationsare required to generate an output. In the figure, the filter size is two and the dilations are setto 2layer starting from the first hidden layer. No node contributes to the input of a node in adeeper layer more than once due to the dilation values. The dilations also result in a largerreceptive field, i.e inputs from further back in time are used.

Figure 2.8: A figure showing a stack of dilated causal convolutional layers. [49]

The authors also describe how the joint probability distribution p(x) can be conditionedon another set of input variables, h, which represent local or domain-specific information.Examples of domain-specific information in the area of Text-To-Speech is information re-garding different speakers or general information about the text. These conditional inputvariables can, as well as the hyperparameters of the WaveNet architecture, be adapted to fitother problem domains.

To achieve a better performance, WaveNet makes use of residual and parameterized skipconnections. Skip connections allow for gradients to flow from a shallow layer to a deeperone without going through an activation function. Parameterizing them means multiplyinga learnable parameter to the gradient of the connection. Residual connections reframe thelearning task of a layer, from trying to learn what the true output y(x) should be to trying tolearn the residual or difference r(x), between the true output and the input of the layer, x.

y(x) = x + r(x) (2.15)

A traditional network layer tries to learn y(x) in 2.15 while a residual layer tries to learnr(x). This has shown to empirically increase the performance of deep CNNs and it is believedthat it is due to the fact that the layer easily can produce its input x just by keeping r(x) at0. In a traditional layer, learning the identity function, i.e. passing the input as output isdifficult. [19]

2.9 Evaluation Metric

There are numerous ways of evaluating the accuracy of a statistical model. One chosen fora task similar to predicting taxi demand where the objective was to predict the demand of

14

2.10. Hyperparameter Tuning

groceries was the normalized weighted root mean squared logarithmic error (NWRMSLE) [9]. It isdefined as the following:

NWRMSLE =

g

f

f

e

řNi=1 wi(log(pyi + 1)´ log(yi + 1))2

N ˚řN

i=1 wi(2.16)

where n is the number of data points in the data set, i is the index of each individualdata point, pyi is the predicted value and yi is the true value. In this case, expiring groceriesmotivate a weighting factor. The weights, wi, were added to penalize expiring groceries (1.25for perishable groceries and 1.00 for all other items).

The taxi domain doesn’t motivate having different weights for different kinds of demandand therefore the NWRMSLE can be simplified to:

RMSLE =

c

řni=1(log(pyi + 1)´ log(yi + 1))2

N(2.17)

which is also known as the root mean squared logarithmic error (RMSLE). What separatesthis from the RMSE, as defined in section 2.3, is that given the same distance between thepredicted and the true value, a larger error is given when both values are small, compared towhen they are large. The metric is therefore applicable when predicting across a large rangeand magnitude of values.

2.10 Hyperparameter Tuning

Choosing the appropriate hyperparameters or "model settings" often has a significant im-pact on the performance of a model. Training a machine learning model may require a lot ofcomputational resources and time but only reveals how well the model does for one set ofhyperparameters. A model has to be retrained to find out how well it performs for a differentset of hyperparameters. Thus, finding the optimal or even satisfying values for these hyper-parameters isn’t a trivial task and several strategies have been developed to find sufficienthyperparameter values. The first step for each of the commonly used strategies is for thedeveloper to define possible values of each of the hyperparameters that should be tuned, thisis known as the domain. Thereafter, either values are chosen manually by the developer ora search algorithm is applied. Manually choosing which hyperparameters is easy to do butcomes at the cost of efficiency. For supervised learning tasks, the performance of the chosenhyperparameters can easily be evaluated based on the size of the error metric. [16]

In a study by Bergstra, J. and Bengio, Y., a simple neural network with seven hyperparam-eters is trained on seven similar data sets [2]. One of the data sets was MNIST, which is awell-known data set consisting of 70 000 handwritten digits presented in gray-scale, it wasfirst established in a classic paper by LeCun et. al and is now available online1 [29]. Threeexamples of the 28x28x1 images can be seen in figure 2.9. Three of the other data sets werevariations of MNIST and the last three were also image data sets of lower or similar com-plexity. For each of the data sets, only the learning rate or the learning rate and a secondhyperparameter had significant relevance for the performance of the neural network. How-ever, the second significant hyperparameter changed from one data set to the next. Withthis in mind, the authors suggest that in general, only a subset of the hyperparameters carrysignificance but that determining which ones these are in beforehand is a very difficult task.

Gridsearch

Gridsearch is an exhaustive search algorithm where all possible combinations of the hyper-parameter domain are tested. The advantage of this is that the optimal set of parameters

1http://yann.lecun.com/exdb/mnist

15

2.11. Sequential Model-based Optimization

Figure 2.9: Example images in the MNIST data set, to the right of every image there are threeclassification probabilities generated by a CNN.

is found in the domain defined by the developer. This does however assume that there’senough time and computational resources to retrain the model for all the possible combina-tions of hyperparameters. The number of possible combinations grows exponentially whichis problematic when training complex models which may take a long time to train and canhave more than a dozen hyperparameters. In practice, a combination of grid search and man-ual search is often used where certain parameter combinations are skipped in favor of morepromising ones a few iterations in. [2]

2.11 Sequential Model-based Optimization

A step up from gridsearch, or gridsearch combined with manual search is sequential model-based optimization (SMBO). In comparison to gridsearch, a SMBO-algorithm actually usesthe information from previous trials to estimate which hyperparameter values would yieldbetter results. In comparison to manually picking values, a strategy which also takes historyinto account, SMBO does so in a statistically sound manner and doesn’t require changing thecode before each new run.

To perform SMBO, five different parts are required. The first part is defining the hyper-parameter domain, the second part is defining an objective function which takes a set of hy-perparameters as input and returns an error metric as output. The third part is a probabilityfunction which models the belief regarding what results the different possible hyperparam-eter sets will achieve. The fourth part consists of a criterion under which the next model totrain is chosen. Lastly, a log is needed to store the results of the previous trials, this is used toupdate the probability function. [3]

Concretely, the first part is a set of hyperparameters where each of them gets initialized asa probability distribution, this requires some qualified guesses from the developer. The distri-butions should preferably be selected according to research, experts or previous experience.The second part could be a machine learning model which gets trained with hyperparametervalues which have been sampled from their respective distributions, the error metric wouldbe the model’s performance on the cross-validation set. The criterion used by several algo-rithms to chose which hyperparameter values are tried next is Expected Improvement (EI). Inthis case, the algorithm calculates how much the error on the cross-validation set is expectedto decrease based on the distributions of the hyperparameters. In this example, the maximumdecrease is what’s sought since the objective function should be minimized. This algorithm,with some slight modifications, would work just as well if the objective function was to bemaximized.

EIy˚(x) =ż 8

´8

max(y˚ ´ y, 0)pM(y|x)dy (2.18)

In equation 2.18, x is a set of hyperparameters, y˚ is a threshold of the objective function,y is the value of the objective function and pM(y|x) is the probability distribution of y given xfor the model M. In the case that pM(y|x) is zero for all y that are lower than the threshold, noimprovement is expected to be gained from this set of hyperparameters. pM(y|x) is updatedfor each iteration, the historical results improve the knowledge of the function which enablespicking candidates for x that improve y.

16

2.12. Tree Parzen Estimator

2.12 Tree Parzen Estimator

An example of a SMBO-algorithm is the Tree Parzen Estimator (TPE). It has been used toconstruct models which were able to produce state-of-the-art results by efficiently identifyinggood sets of hyperparameters. [3]

p(y|x) =p(x|y)p(y)

p(x)(2.19)

To approximate pM(y|x) in equation 2.18, the TPE uses Bayes’ rule defined in equation2.19. Furthermore, p(x|y) = l(x) if y ă y˚ and p(x|y) = g(x) if y ě y˚. l(x) can thus beinterpreted as a function which defines a probability distribution of promising values of xand g(x) as the opposite. Specifically, l(x) is sampled to produce a set of candidates and thenthese are evaluated under the criterion min(g(x)/l(x)), i.e. x should be chosen so that theprobability of a low error is high and the probability of a high error is low. This means thaty˚ must be chosen so that there exists at least one point where y(x) < y˚ in order for l(x) andby extension the criterion to be defined.

2.13 SARIMA

The statistical model ARIMA or autoregressive integrated moving average was popularized byBox, G.E.P. and Jenkins, G.M. in their book "Time series analysis: forecasting and control"which was first released in 1970 [5]. It is a linear model for time-series analysis and forecast-ing. The model linearly combines previous values of the response variable and its errors tobe able to predict what value it will take in the future.

yt = θ0 + β1yt´1 + β2yt´2 + ... + βpyt´p + εt ´ θ1εt´1 ´ θ2εt´2 ´ ...´ θqεt´q (2.20)

In the above equation, yt is the value of the response variable for time step t and εt is theerror. The parameters of the model are βi which ranges from 1 to p and θj which ranges from1 to q. The values of these parameters impacts the model’s performance. Finding parametervalues is achieved by going through the so called Box-Jenkins methodology.

The first step according to the authors is performing model identification, i.e. a modelshould be proposed which auto-correlates similarly to the data. Autocorrelation is a measureof how well a time-series correlates with a delayed copy of itself, depending on the delay. Aprerequisite for identifying the model is that the time-series is stationary, i.e. that the meanand autocorrelation doesn’t change over time. This is either the original state of the timeseries or can be achieved by selecting an appropriate value for the parameter d which con-trols the degree of differencing. To check whether the time-series is stationary or not, theaugmented Dickey-Fuller test can be performed [13]. The second step of the Box-Jenkinsmethodology is identifying the parameters p and q which minimize the error. p and q con-trol the time lags and the order of the exponential average respectively. These values can befound by plotting the partial autocorrelation and autocorrelation functions and observing forwhich time step they approach zero. The last step of the method is to evaluate the proposedmodel on its accuracy to make sure that the previously made assumptions hold.

An extension of the vanilla ARIMA model is the seasonal ARIMA model or SARIMA. Thismodel allows for selecting a seasonal factor m which for example could be three months if thedata contains a quarterly pattern. Three parameters are added to the model which correspondto (p, d, q) but on a seasonal basis. [5]

17

2.14. Stacked ensembles

2.14 Stacked ensembles

A possibility when faced with a supervised machine learning problem is developing severalmodels which can perform the task and then combine and weigh their outputs to generate aprediction stronger than each of their individual outputs. Different algorithms have differentstrengths and weaknesses and while some might have issues modelling a certain part of theproblem domain, others might successfully model the same part but fail somewhere else. Bycombining for example an ARIMA model, two deep neural networks and a gradient boostingmachine, which is another machine learning algorithm, a stacked ensemble can be createdwhich is able to compete with a more complex model. [42]

2.15 Mixed Precision

Computing the product of matrix multiplications is done very often in a neural networkswhen e.g. calculating the output of a neuron, see equation 2.4. How the matrices are rep-resented in the computer memory affects how expensive the calculation is. Typically, 32-bitfloating point precision is used to represent numbers, they can then range from´3.4˚10´38 to3.4˚1038. Reducing the precision has the positive effects of reducing the space needed to storethe number, the energy required by the processing unit to do computations and improvingthe performance of operations done on the number. The downside is that the numbers can’tbe as accurately represented, a loss in precision. Specifically, the range of numbers which canbe represented when switching to 16-bit floating point precision from 32-bit is only 6 ˚ 105

to 6 ˚ 104. In addition, the precision with which a decimal can be represented is reduced.It has been shown that this range and precision loss has a limited negative impact on theperformance of deep neural networks while still producing the positive effects. [11, 23]

2.16 Method

To ensure that the research is conducted in compliance with the scientific standard a coupleof relevant studies have been reviewed which provide guidelines and best practices.

CRoss-Industry Standard Process for Data Mining (CRISP-DM) is a process used dur-ing data mining and machine learing projects. It was first described by Rüdiger Wirth etal. in 2000 and consists of six phases: business understanding, data understanding, datapreparation, modeling, evaluation and deployment. Although primarily created for teams,it provides a framework of common terminology, thus making it easier to communicate thedevelopment process to outside stakeholders. It also provides a logical order in which tocomplete the phases of a data mining or machine learning project. [53]

A concrete example of how to empirically evaluate the performance of supervised learn-ing algorithms is provided by R. Caruana and A. Niculescu-Mizil. They provide guidelinespertaining to both process and how to properly evaluate algorithms within machine learning.The empirical evaluation guidelines are for example hyperparameter tuning, proper evalu-ation of error metrics and a bootstrap analysis. They also provide a clear overview of theirprocesses, thoroughly describe their technical choices and point out the key principles behindsplitting the data into training, testing and validation sets. [7]

18

3 Literature Review

The purpose of this section is to contextualize this thesis and TDNet in the current field ofresearch. Simply put, the idea is to apply a WaveNet architecture, which has provided state-of-the-art results in other domains, to the domain of taxi demand prediction.

3.1 WaveNet Architectures

The WaveNet architecture has previously been used for problems outside of the domains ofthe original paper. A number of examples can be found on the data science platform Kag-gle1 where its members, which range from beginners to industry experts and researchers, cancompete. Concretely, Glib Kechyn et al. came in second place in the competition CorporacionFavorita Grocery Sales Forecasting competition where they proposed a WaveNet architecture topredict sales for a large grocery chain [9, 25]. As the defined prediction horizon was 16 days,modifications were made to the original WaveNet model to output sequential predictions forthe entire period. A problem that occurred due to predictions being conditioned on previ-ous predictions was accumulating errors. To handle this they implemented a sequence tosequence learning method using an encoder-decoder. The first 1D CNN encoded a sequenceof grocery sales data into a fixed length vector which represented the original sequence. Thesecond 1D CNN then decoded the output of the first network back to a sequence with alength of 16 days. Glib Kechyn et al. did not share the parameters between the encoding andthe decoding network.

The architecture of the network which was used by the authors is similar in its structureto what has been used in this paper. There are many details of the implementation which theauthors have chosen not to present in their paper which surely differ. Furthermore, there aretwo major differences relating to the task, the obvious one being that the domain is different.The second one is the difference in available data, the data set used by Glib Kechyn et al.contained more than 125 million observations in comparison to approximately 1.6 millionfor each of the cities in this paper. Additionally, the model described in this paper shouldperform well for multiple customers in multiple cities. Thus it is important for it to generalizewell. When predicting the demand of grocieres, the authors explicitly state that they used all

1https://kaggle.com

19

3.2. Alternative approaches

possibilities to increase the accuracy of competition predictions, this most likely caused themto overfit to that particular dataset, thus reducing generalizability.

Another domain where conditional time series forecasting is of great interest is finance.A. Borovykh et al. have successfully used a simplified WaveNet to perform multivariate fore-casting on multiple exchange rates and the second largest stock market index in the U.S. [4].They’ve predicted daily prices which has reduced the size of the training set by a factor of 24in comparison to if they would’ve measured hourly prices. In order to reduce the trainingtime of their model, ReLUs are used instead of gated activation units. The authors of theoriginal WaveNet paper considered using ReLUs but discarded that idea based on their poorperformance when modelling sound [35]. A. Borovykh et al. conclude that their solutionoffers a viable alternative to RNNs and traditional economic models both when it comes toimplementation difficulty, training required and performance. They exploit the correlationbetween the exchange rates by conditioning their WaveNet on multiple exchange rates. Thisis simliar to what is done by the model presented in this paper, but with different city zonesinstead. A difference in the approaches is that TDNet is fed with taxi demand lagged by aday up to a year, this is done to further increase the receptive field which is necessary sincethere are many more datapoints when measuring by the hour instead of by day.

3.2 Alternative approaches

A paper written by Lv et al., proposed using stacked autoencoders for the purpose of pre-dicting traffic flow in the upcoming hour [32]. An autoencoder is a neural net which ideallyoutputs an exact reconstruction of its input. It achieves this by first encoding the input to amessage with reduced dimensions, it then decodes/decompresses the message to an outputwhich matches the input as closely as possible. The consequences of this procedure is thatthe autoencoder learns to filter out the noise in the input and to succinctly represent it. Whenstacking several of these neural nets on top of each other and finishing with a prediction layer,a deep neural network is created which can do e.g. time-series predictions [51]. Predictingtraffic flow is similar to predicting taxi demand and this was one of the first attempts madeat utilizing machine learning for such a task. Their model displayed great performance incomparison to the statistical models they used as benchmarks.

Apart from applying deep neural networks, traditional statistical models can be used topredict taxi demand. Examples of these are time-weighted time-varying Poisson models andARIMA [34], Markov predictors [59] and multi-level clustering [12]. The advantages of usingthese approaches are that they are well-understood and are less computationally heavy. Thedisadvantage is that they have a difficulties modelling deep underlying trends, somethingthat exists in the taxi demand and traffic flow domains [32]. To properly evaluate the balancebetween traditional statistical models and the increasingly popular deep neural networks, aSARIMA model has been chosen to serve as a benchmark in this paper.

3.3 Taxi Demand

K. Zhao et. al investigate the limit of predictability when it comes to taxi demand in NYC.They divide the city into zones based on large buildings and calculate three different kinds ofentropy to approximate how well taxi demand can be predicted for each of the zones. Theysplit up the causes for taxi demand into temporal and random correlation. A low randomcorrelation indicates that a pure time-series model which only takes historical demand intoaccount can predict the future demand well and a high random correlation indicates thatfurther information is needed. They find that the hourly limit of predictability for their smallbuilding zones is 83% on average. To predict the taxi demand, they use a hidden markovmodel (HMM) and a shallow neural network. They conclude that the HMM, which is apure time-series model like SARIMA, is faster and performs better than their NN in zones

20

3.3. Taxi Demand

where the predictability is high. The NN is slower but performs better in zones with lowpredictability, i.e. irregular demand.

21

4 Method

In this section, the data, the feature engineering, TDNet, the evaluation process, benchmarksand the hyperparameter tuning will be described.

4.1 Data Description

At the center of each machine learning project is the data. In this section, the data as well asthe steps taken to clean the data will be described. The data can be divided into company-provided data and data provided by external sources. In the interest of the confidentiality ofTaxiCaller and their customers, certain details of the data will be omitted.

Data Provided by the Company

For each taxi trip, the coordinates of the pick-up point are recorded as well as when it oc-curred. The time is represented as a time stamp which contains the year, month, day, hourand minute of the pick-up. For the sake of business purposes, each city is divided into dif-ferent zones of varying sizes and based on the coordinates of the pick-up point, a zone isassigned to the trip. Furthermore, all the trips which occur during the same hour in the samezone are totaled and are refered to as the zone demand. The data ranged from the 1st ofJanuary 2017 to February 2019. This range consists of approximately 18500 hours.

External Data

As described in section 2.1 on taxi demand, factors such as the weather, national holidays,connecting traffic as well as special events affect the taxi demand. To model this, the DarkSky1 API was used to gather information about the temperature in degrees Celsius, windspeed in meters per second and precipitation. The precipitation was further divided intotype, i. e. rain or snow, intensity as measured by millimeters per hour and accumulation asmeasured by centimeters. If hourly data was available, it was used. Otherwise, daily datawas used. Information about national holidays was also added to the data set of the customerin city NE.

1https://darksky.net/dev

22

4.2. Data Cleaning

time zone id demand holiday precipitation type wind speed temperature

20170101T00 A 10 1 snow 2.1 -1.1

Table 4.1: Dataframe containing an example of data provided by Taxicaller merged with ex-ternal data.

4.2 Data Cleaning

As data quality is a determining factor for machine learning results, the data had to becleaned. To ensure that there were no outliers, negative zone flow, invalid zone ids or timesoutside the preset range, a script has validated the company data before it was fed to thepreprocessing step. This didn’t eliminate any data points for any of the two cities which is anindication of high quality data.

The column precipitation accumulation only contained zero entries which lead to its re-moval. The column precipitation intensity contained a suspicious amount of zero entries, bymanually checking another weather service it was concluded that zeros were added as thedefault value in the Darsky API and that very few of the data points were valid. Addition-ally, for the city NE, precipitation data was completely missing for almost a full year whichwas remedied by adding data from a city close by.

4.3 Data Splitting

The data set contained about 24 months of data for each city initially, 18 were used for train-ing and three for cross-validation. Three months were used for the final evaluation and con-stituted the test set. Splitting the data was done immediately after the data cleaning wascompleted to prevent introducing a bias while exploring and analyzing the data with the testset in it. If this wasn’t done, information about the test set would leak and most likely affectfuture decisions.

4.4 Data Preprocessing

In some zones there were only a couple of pick-ups for the investigated time period of twoyears. In other zones the number of pick-ups was negligible in comparison to the most activezones. Therefore a decision was made to remove the zones which accounted for less than 1%of the total taxi demand of the city, given that the city didn’t contain more than 50 zones. Atthe end of this operation, about 200 000 data points remained for each of the cities, the zoneswhich made the cut will be referred to as the significant zones.

At this point in the preprocessing, a row in the data set would contain nominal and cate-gorical numerical features, one time-stamp and one categorical string feature. In table 4.1, anexample dataframe with one row is displayed.

To be able to feed the input to the model, the data was transformed to a 3D matrix, alsoknown as a tensor, where the rows were different zones, the columns were hours and the thirddimension varied with hour and zone. Features which didn’t vary in all three dimensionswere also transformed to this shape but only received a dimension of one in the insignificantdimensions.

Feature Engineering

Numerical features could be fed directly to the model, they have however been transformedto formats which best represent the underlying information. For example the zone id featureuniquely identifies a zone, is an integer and only has as many values as there are zones ina city. A higher id doesn’t convey any more information than a lower id nor any relation

23

4.5. Data Exploration

time Zone A Zone B Zone C snow rain NaN demand20170101T00 1 0 0 1 0 0 10

Table 4.2: Dataframe containing zones and precipitation type as binary features if the zone isA and the precipitation type is snow.

to any of the other zones, it just uniquely identifies the zone. Therefore it is a categoricalvariable and was transformed into a binary tensor. The same went for the precipitation type.Table 4.2 shows what the row in table 4.1 would look like after going through the binarytransformation process and removing the holiday, wind speed and temperature feature. Italso reduces the number of zones to three, all of this is done to improve table readability.

Whether a certain date is a holiday or not was represented by a binary variable thatwas 1 for holidays and 0 for normal days. The wind speed and temperature were nor-malized and standardized so that their mean was 0 and standard deviation 1. Thezone demand went through a logarithmic transformation, normalization and standardiza-tion. Scaling the features in this manner improves the efficiency of gradient descent andleads to faster training. [29]

This leaves the time stamp which contains plenty of important information. Initially, thetime stamp column was divided into a year, a month, a day of month and an hour column.The day of the week was calculated based on the date. The year column was transformedfrom 2019, 2018, 2017 to 2, 1, 0 which is similar to removing the mean and moves the featureto approximately the same range as the other features. The categorical day of week columnwas one-hot encoded. Finally the day of month and hour of day columns were turned intocyclical features by calculating the sine and cosine values.

Under the assumption that there were cyclical patterns in the data, which the data explo-ration indicated, lag features were created. Concretely, for each zone the model was fed thedemand of that zone one hour, 24 hours, one week, one month and one year before. Since thedemand during the last hour might not be available in a production environment, it is onlyfed to the encoder during training, not to the decoder when making predictions. However,the other lag features are assumed to be available.

The zone id constitutes what is called a conditional vector in the original WaveNet paper[49]. It isn’t explicitly represented by a zone id but each row always corresponds to thesame zone and the distribution of demand in each of the zones is conditioned to the rest ofthem. This is similar to the original WaveNet being fed with the values of different speakerssimultaneously. The model learns during the training phase how the zones or speakers relateto each other.

4.5 Data Exploration

Numerous graphs were created to gain a better understanding of the distribution of the taxidemand based on the features. The historical demand was investigated by looking at rollingmeans and standard deviations of it. In this phase, suspicions such as the taxi demandon weekdays and weekends should differ were confirmed. Additionally, finds were madewhich indicated that holidays would be an interesting feature to add to the data set due tothem causing spikes in the data. A check for linear correlation between the features and theresponse variable was performed without finding any relationships. The decision to removethe insignificant zones was made when examining the distribution of demand between zones.

The demand throughout the day for the two cities can be seen in figure 4.2 and 4.1. Thedemand has been summed up for all zones for the whole train period and this total demandis displayed per hour as a fraction of the hour with the highest demand. For both cities, thedemand is the lowest in the middle of the night but still doesn’t go below one quarter of thepeak hourly demand. For city SA, the demand starts increasing rapidly at 9:00 up until its

24


peak at 14:00, then it decreases until the the morning hours. The demand for NE doesn’tfollow the same smooth curve, instead the demand increases from 4:00 until a peak 8:00,afterwards the demand decreases slightly and stays level until 16:00 when it goes up and hitsit peak at 18:00. Thereafter it decreases until 4:00.

Most zones for both cities didn’t contribute significantly to the demand, therefore thezones which didn’t contain more than 1% of the demand were removed. This, accidentally,left 14 zones in each of the cities. The distribution of demand between these zones can beseen in figure 4.3 and 4.4 and differ substantially. In SA, one zone stands for almost 35% ofthe demand and a few zones hover just over the red line which represents the 1% cut-off.In the other city NE, the distribution is more even with three zones over the 10% mark, 7between 4% and 10% and 4 between 4% and 1%.

Figure 4.1: True Demand per Hour in SA as a fraction of the max hourly demand.

Figure 4.2: True Demand per Hour in NE as a fraction of the max hourly demand.

25


Figure 4.3: Zone Demand Distribution in SA as a percentage of the total demand.

Figure 4.4: Zone Demand Distribution in NE as a percentage of the total demand.

26

4.6. Model Implementation

4.6 Model Implementation

The model implementation is based off of two open-source implementations which have bothbeen used for Kaggle competitions. They have both been created by the same author and haveclaimed 4th and 6th place out of more than 1000 participants in their respective competitions[50]. This is notable since Kaggle is known for having competitions where the winners haveused stacked ensembles on top of each other and overfit them to the competition problem toscore as high as possible. This is what WaveNet was going up against in these competitionsas well and it still outperformed most of these so called stacked ensembles, see section 2.14.On top of that, the stacked ensembles in these competitions are usually composed of modelswhich the developers don’t have to implement themselves but instead enable spending mostof the time performing feature engineering or hyperparameter tuning.

The changes to make TDNet different from the open-source implementations have mainlybeen done to the feature engineering, architecture, hyperparameters, dependencies, batchgeneration and precision used. The layers and the algorithms for training and predictinghave only been changed for the sake of updating the dependencies, not to fundamentallychange the logic.

TDNet Architecture

In figure 4.5, a building block of the network can be seen. TDD stands for time distributeddense layer, Dilated Conv is a dilated convolution, σ represents the sigmoid activation func-tion. In total two of these comprise TDNet, the first one takes the input features and producesa tensor which serves as part of the input for the second block. The input for each of the klayers is also saved and used as input for the second block. The output of the second block isfuture predictions of taxi demand. k has been tuned as a hyperparameter and is the same asthe number of dilations.

Precision

The frameworks described in 5.1 are all being updated frequently and the new versions oftenimprove performance, reduce bugs and add new features. A relevant example of this is theaddition of full support for 16-bit floating point precision in CUDA 10, cudNN > 7.4.1 andtensorflow-gpu >= 1.13, given that the hardware supports it. The default precision used intensorflow is 32-bit but as discussed in 2.15, reducing the floating point precision can yieldsignificant benefits. Therefore the dependencies of the original model were updated andthe optimizer was switched to a mixed precision optimizer. Unfortunately, only partial wassupport for mixed precision was achieved due to compatibility issues with a dependencymanagement system.

Batches

Training and validation batches shared dimensions and were generated using the samemethod, except that they were drawn from different subsets of the data. As an example,if the date to predict the zone demand in all the zones was randomly selected to be the 1st ofFebruary 2018, then a batch contained the zone demand for all the zones from 30 days backup until the 31st of January. It also contained 30 days of lagged data from the day, week,month and year before. If information about the weather on the day or whether it was aholiday should be fed to the model, information about it was added to the batch.

4.7 Hyperparameter Tuning

The hyperparameter tuning was performed using the hyperopt implemenation of TPE whichis described in section 2.12. Knowing which hyperparameters to tune before doing it is a hard

27

4.7. Hyperparameter Tuning

Input

TDD

tanh

Dilated Conv

TDD TDD

Output

tanh

ReLU

σ

*

+

+

k Layers

Skip-connections

Residual

ReLUReLU

TDD

Figure 4.5: Building block of TDNet architecture, connecting two of these together formedTDNet.

task, based on the recommendations of an industry expert and guesses based on previousimplementations of WaveNet, the hyperparameters listed in table 4.3 were tuned [39]. Dueto the size of the taxi data sets being smaller than the data sets for which the models wasoriginally built, both the number of layers as well as their width had to be reduced.

Step size is the equivalent of learning rate for the ADAM optimization algorithm, it de-termines how big steps are taken on the loss function surface. Training steps determined forhow many iteration TDNet was to be trained, in reality the early stopping conditions inter-rupted the training for most attempts, not the limit for training steps. Channels is the numberof skip channels and residual channels. The width of the two time distributed dense layerswhich made up the input layers as well as the layer just before the predictor were also tuned.The number of filters, their widths and how much they were dilated were all decided by tun-ing the hyperparameters dilations and filter widths. If the filter widths (2, 2, 2, 2) were tried,only the first four dilation values were used.

The different types of distributions are Log-Uniform, Discrete-Uniform and Choice. Log-Uniform allows for choosing values with probabilities ordered by magnitudes. In the case ofthe step size, the difference between 0.0001 and 0.001 is of greater interest than the differencebetween 0.8 and 0.9. Discrete-Uniform defines a uniform distribution between the Min andMax values which are separated by the step size. Choice simply leads to a decision betweenthe hard-coded values defined. The distribution choices have mathematical reasons but thelimits and the "Choices" have been selected due to similar values being found in other open-source implementations, albeit of scaled-up values.

28

4.8. Evaluation

Name Distribution Min Max StepStep Size Log-Uniform 0.0001 0.1Training Steps Discrete-Uniform 50 000 200 000 1000Channels Discrete-Uniform 2 8 2Encode/Decode Discrete-Uniform 8 40 16Dilations Choice*Filter Widths Choice**

Table 4.3: Table describing the probability distributions of the hyperparameters tuned.Choice* was defined as four different values, namely {(1, 4, 16, 64, 1, 4, 16, 64), (1, 2, 4, 8,16, 32, 64, 128), (1, 2, 4, 8, 1 , 2, 4, 8), (1, 8, 1, 8, 1, 8, 1, 8)}. Choice** was defined as {(2, 2, 2, 2),(2, 2, 2, 2)x2, (3, 3, 3, 3)x2} where x2 means repeating the filter widths.

4.8 Evaluation

The performance of a model was evaluated using either the RMSLE metric as defined inequation 2.17 or the RMSE from equation 2.3. If a model didn’t have a better cross-validationerror for 2000 iterations and was out of restarts it was stopped, otherwise it restarted fromthe point where it had achieved its lowest error with a decreased step size. This is the sameapproach taken by J. Xu et. al in their paper to speed up training [55]. Furthermore, they alsoused RMSE to measure the performance of their LSTM for the same task. The task was topredict 26 hours ahead, this is due to the fact that if TDNet was to be used in a productionenvironment, an overlap of two hours for each day would make sure that predictions werealways available.

A generated prediction of the hourly demand of a zone can be categorized in one of threedifferent buckets. An underestimation where the true value is two or more away from theprediction, an overestimation where the true value is two or more lower than the predictionand finally an accurate prediction where the true value is the same as the prediction or lessthan two away. These values have been decided based on the demand distribution in the twocities investigated. For other cities the same principle applies but not the same exact values.

Cross-Validation

Validation batches were generated from the cross-validation subset of the data, this subsetwas made up of the months closest to the test set in the training set. A validation batch hadthe same dimensions as a training batch, i.e. it saw 30 days of training data for all significantzones, starting at a random point in the validation set range, and then predicted the demandof the next 26 hours. The error was calculated for this randomly chosen 26-hour period in allzones. An issue with this is that some dates will contain less noise than others and be easierto predict. Since the model selection process is based on the cross-validation error, this couldlead to an inferior model being selected just because the validation happened to occur on aday that was easy to predict. To combat this, a loss averaging window was applied which cal-culates the average train and validation error over the last 100 training steps. This averagederror serves as a metric to pick the best model throughout the training process. Calculating arolling mean of the errors adds robustness and ensures that a model that generalizes well isselected.

Final Evaluation

For the final evaluation, the chosen model has predicted the demand in all zones betweenthe first of October 2018 and the 10th of January 2019. The month of September 2018 is alsoincluded in the test set as unseen data but is only used to provide context for predicting thedemand in October. The error for this whole period for all zones was calculated and that

29

4.9. Benchmarks

constitutes the final result for a certain city. This was compared to the benchmark algorithms.SARIMA went through the same procedure where 26-hour forecasts were generated but thestacked ensemble, which isn’t dependent on historical demand and therefore can predict thedemand for arbitrary time periods at any time only predicted 24 hours for each day. Thismeans that the exact same task wasn’t performed but the evaluation is fairer in the sense thatthe models are used in the same way that they would be used in a production environment.The stacked ensemble

4.9 Benchmarks

In order to compare the results, two benchmarks have been implemented. They are aSARIMA model and a stacked ensemble of supervised learning models. All benchmarkshave been evaluated using the same error metric over the same dates. Since the purpose ofthe benchmarks is to put the performance of TDNet in context, their details are therefore notdescribed as thoroughly nor all terms.

The machine learning models are different supervised regression models. This means thatthey are fed the same data as TDNet and the time stamps are converted to features whichconvey information more clearly than the time stamps alone, these features have also beenfed to TDNet. However, the machine learning models don’t inherently treat the time stampsas having a temporal order. The machine learning models which have been considered arethe following: a random forest, an extremely-randomized forest, XGBoost, a random grid ofGradient Boosting Machines (GBMs) and a random grid of deep neural networks. A stackedensemble model was then trained using the models which performed the best and was usedto generate predictions. This stacked ensemble was created using a framework which hasimplemented all the different models and provides a fairly simple programming interfacewhich handles details under the hood.

The traditional statistical time-series forecasting model was a SARIMA model. The firststep in using this was investigating whether the time-series was stationary or not which wasdone with the augmented Dickey-Fuller test. Secondly the parameters p and q were identi-fied by plotting the partial autocorrelation and autocorrelation functions. Different SARIMAmodels were created for all the different zones for both of the cities. Thirdly, the seasonalparameters were chosen to not induce non-stationary which was necessary for the imple-mentation of this model to work. Given that the seasonal parameters met these conditions,they were set according to rules laid out by R. Nau [36]. The seasonality chosen was 24-hours. Meaning that the model used data from 24 hours back to predict the current demand.SARIMA models have previously been used to predict taxi demand [34].

4.10 Feature Importance

To evaluate the impact of the features added by the holiday and weather data, i.e. the externaldata sources, the model was retrained without them for the city NE. Its performance wasthen measured and compared to that of the same model with the same hyperparameterswithout access to the additional data. Specifically, the input dimensions were changed from(zones, hours, 26) to (zones, hours, 22) due to the holiday, precipitation-type, temperatureand windspeed columns being removed in the third dimension. This demand-only modelthus had fewer parameters.

4.11 Rounding

If TDNet were to be put in a production environment, the use-case would most likely demanddelivering predictions in the form of integers. In order to measure the impact this would

30

4.12. Models trained

have on the performance, the final evaluation errors were also calculated for floating pointpredictions rounded to integers.

4.12 Models trained

Several TDNet models have been trained to enable accurately answering the research ques-tions. For the city of NE, four different models were trained and hyperparameter tuning wasperformed for two of them. For the city SA, two different models were trained and hyper-parameter tuning was performed for one of them. The TDNets to predict SA demand weretrained using either the RMSE or the RMSLE loss function. The reason that two more modelswere trained to predict the demand in NE is that the impact of weather and holiday data wasevaluated on that city. Hyperparameter tuning was performed using the RMSLE loss func-tion for NE and RMSE for SA. For NE the hyperparameter tuning ran for 100 iterations, forSA only 30 due to time constraints.

31

5 Empirical Evaluation

The setup for the experiments performed as well as the results are presented in this sec-tion. For two of the companies customers, the hyperparameters of the architecture have beentuned. The results of these two cities, NE located in northern Europe and SA located in SouthAmerica are presented. The predictions are made on the time period 2018-10-01 to 2019-01-10 in 26 hour intervals for all zones in a city. The forecasts are 26 hours long but are madeevery 24th hour, thus the overlap is removed for graphs and tasks were duplicates should beavoided.

From the predictions and true values, the error metrics RMSE, equation 2.3 and RMSLE,equation 2.17 are calculated for all hours and all zones which results in a numeric error for thewhole city. As the real demand is measured in integers, the error metrics are also calculatedfor rounded values of the predictions. In addition to the performance of the TDNets, theperformance of the SARIMA and the stacked ensemble benchmark are presented.

5.1 Experimental Setup

Hardware specifications for running the model will be listed as well as the core frameworksand libraries used.

Frameworks and Libraries

The language used throughout this project has been Python 3.6. The model has been writ-ten in Tensorflow, an open-source platform for machine learning1. It is written in C++ forperformance but the API is made primarily for Python development. To explore the dataand preprocess it, the libraries pandas2 and numpy3 have been used. Pandas provides data-structures for easily handling large amounts of data and numpy provides efficient array im-plementations and functions. To enable the use of GPU-accelerated computation, the librarycuDNN by NVIDIA was used as well as the CUDA platform [40, 41]. The hyperparameterturning was done using a python library called hyperopt4 which provides the possibility to

1https://www.tensorflow.org/2https://pandas.pydata.org/3http://www.numpy.org/4http://hyperopt.github.io/hyperopt/

32

5.2. Results NE

use sequential model-based optimization to find a good set of hyperparameters [3]. The su-pervised learning benchmark was implemented in R using h2o which is a machine learningplatform which specializes in making AI accessible. It allows for easily making quite ad-vanced models at the cost of customizability5. The SARIMA benchmark was implementedin python using the open-source statsmodels module. It provides tools and algorithms for,among other things, performing time-series forecasting6.

Hardware Specifications

In order to make use of mixed precision, it’s essential to have a GPU which supports it. TheCPU and RAM don’t have to be as powerful as the GPU to be able to train deep neural netsefficiently, but they mustn’t be bottlenecks. A desktop computer with the following relevantspecifications was used:

• GPU: NVIDIA GeForce RTX 2070, 8GB

• CPU: AMD Ryzen 2600 3,4 GHz

• RAM: 2x8GB

• Operating System: Ubuntu 18.04 64-bit

5.2 Results NE

For this city, the performance of the model including external data sources in the form ofweather and holiday data has been evaluated. This has been done in order to measure theimpact of additional data and estimate the value of spending time to add this data to themodel. The error of TDNet trained on demand only, TDNet trained on demand and externaldata as well as the error of the two benchmarks as measured by RMSLE is displayed in thegraph to the right in figure 5.1. The benchmarks are a stacked ensemble model and a SARIMAmodel. TDNet, demandperformed the best followed closely by TDNet. In third place came thestacked ensemble benchmark and forth the SARIMA model. TDNet is close to being beatenby the stacked ensemble model. In the top right graph, the error measured in RMSLE of eachof the models is depicted. In this case, TDNet beats TDNet, demand which beats the stackedensemble which is followed by the SARIMA model.

In the bottom left corner of graph 5.1, the RMSLE of TDNet and TDNet trained withdemand only is displayed. Next to each of them are bars showing their performance whenrounded to integers. When rounding the predictions, the RMSLE of TDNet increased by3.5% for the model trained on all data sources, the RMSLE of TDNet increased by 2.6% forthe model trained on demand only. Notably, the accuracy of the rounded TDNet predictionsis lower than that of the unrounded TDNet, demand predictions. For the models trained usingRMSE, which aren’t displayed in the graph, the difference is the same for both of them, 0.7%.

To get a concrete sense of how far away the predictions are from the true values, the differ-ence between all the true values and their corresponding predictions are plotted in figure 5.2.There are more extreme differences but these are very few in number and have been omittedfrom this particular graph to increase readability. The prediction buckets, which are definedin section 4.8 have the following sizes: 64% of predictions are accurate, 9% are overestima-tions and 27% are underestimations. This can be compared to the figure 5.4 which shows thedifference distribution of the best RMSE model. For that model, 63% of predictions are accu-rate, 15% are overestimations and 22% are underestimations. Figure 5.3 shows how the errordistribution differs when the true demand has to be higher than zero. This slightly changes

5https://www.h2o.ai6https://www.statsmodels.org/stable/index.html

33

5.2. Results NE

Figure 5.1: Results of TDNet and benchmarks as measured by RMSE and RMSLE. The bottomleft histogram shows the results from rounding the predictions.

Figure 5.2: Difference between prediction and truth in real numbers.

34

5.2. Results NE

Figure 5.3: Difference between prediction and truth in real numbers where the true demandis greater than zero.

the performance of the model in that 59% of predictions are accurate, 6% are overestimationsand 35% are underestimations.

The city has been divided into different zones and demand predictions have been madefor all the zones which contribute to 1% or more of the total demand in the city. An insightinto how well TDNet predicts the demand for each of the zones can be gained by observingfigure 5.5. The error is measured by RMSE and the errors of the zones range from 1 in zone Mto just below 3 in zone B. This can be compared to the total error of the model of 2.37 whichsix zones are above and eight are below.

Figure 5.4: Difference between prediction and truth in real numbers for RMSE model.

35

5.2. Results NE

Figure 5.5: Distribution of RMSE across all zones for city NE.

The performance of the models summed over 24-hour periods and summed over all zonesis shown in 5.6. This is done to display a simple overview of the performance of the modelsand highlights differences between the two different error metrics. The true demand in allzones for all hours of a day has been summed up and is represented by the green line in thefigure, the demand for each day has been divided by the demand of the max day to avoidshowing real numbers. The other two lines showcase the predicted demand of the two bestmodels trained with different loss functions. The graph shows how the true demand variessubstantially over time. Especially the month of November contains first a few spikes andthen the lowest trough followed by the peak throughout the whole period which is reachedon the 25th. Both TDNets seem to follow the general trend pretty well but fail to capturethe spikes, though the model trained using RMSLE consistently underestimates the total de-mand.

Similar to figure 5.6 is figure 5.7 where the total demand of the city is displayed in relationto the total demand as predicted by the SARIMA and stacked ensemble benchmarks. Theaggregated predictions of the SARIMA model are represented by a blue line and it doesn’tvary much. The orange line represents the predictions by the stacked ensemble and thesevary more and in a seemingly regular way. This regularity doesn’t mirror all the changes inthe true demand but appears to capture a weekly pattern.

To measure the impact of what loss function was used during training, the cross-comparison in table 5.1 has been created. The predictions of the model trained with theRMSLE loss function was evaluated using RMSLE and RMSE. The predictions of the modeltrained with the RMSE loss function was evaluated using RMSLE and RMSE. As expected, amodel trained using one loss function and then evaluated using the same one outperformedswitching loss function between training and evaluation. In percent, the decrease in accuracywhen evaluating using RMSE was only 0.4% and the decrease in accuracy when evaluatingusing RMSLE was 6.1%.

In figure 5.8, the RMSE of all predictions for each hour is shown for TDNet and bench-marks. The SARIMA model constantly produces the highest error, the stacked ensemble andTDNet intermittently produce the lowest error up until 10:00 from which point TDNet per-

36

5.2. Results NE

Figure 5.6: Total Demand by day for all zones over the test period.

Figure 5.7: Total Demand by day for all zones over the test period and predictions by bench-marks.

RMSLE Eval RMSE EvalRMSLE Train 0.6639232 2.4465284RMSE Train 0.6681081 2.3061376

Table 5.1: Results of comparing the impact of the two different error metrics.

37

5.2. Results NE

Figure 5.8: Prediction error per hour in NE as measured by RMSE for TDNet and benchmarks.

forms worse until it reaches its peak error at 18:00. From then on it reaches its minimum andoutperforms the benchmarks by far.

To capture the process of training a model, the loss for each twentieth iteration was logged.As depicted in figure 5.9, the loss doesn’t strictly decrease due to the random element ofthe batch generation but there is a decreasing trend up until the last couple of thousanditerations. During the first 100 steps, which aren’t included in the graph, the loss decreasedrapidly. These steps have been removed to increase graph readability. A closer look at thefigure shows two straight lines close to iteration 10000 and 25000 which are due to the modelrestarting on a previous iteration with a halved learning rate.

Figure 5.9: Train loss of the best RMSLE model for NE.

38

5.3. Results SA

5.3 Results SA

Figure 5.10: RMSLE and RMSE of TDNet and benchmarks.

The results for this city measured in RMSE and RMSLE can be found in figure 5.10. Inboth cases, TDNet trained on demand only performed the best followed by the benchmarks.When measuring with RMSLE the stacked ensemble beats the SARIMA model but whenusing RMSE the SARIMA error is just below that of the stacked ensemble. In comparison tothe results of the other city, TDNet with extra features is missing since weather and holidaydata was only collected for that city.

Figure 5.11: Distribution of RMSE across all zones for city SA.

39

5.3. Results SA

The distribution of errors for all zones which account for more than 1% of the total de-mand in the city can be seen in figure 5.11. Zone A accounts for more than twice the error ofzone E which has the second lowest error. All the other zones have a RMSE of similar size.

To measure the actual differences between the predictions and the true values for thedemand, figure 5.12 was created for the model that trained on RMSE. The histogram showsthe frequency of predictions which differ by the different values found on the x-axis. A zeromeans that the prediction was exactly right, this was the case for approximately 11% of thecases. 40% of the predictions were less than two away from the truth. The most commonvalue in the histogram, a difference between the prediction and truth of negative two, isclassified as an overestimation and occured in about 33% of the cases. Lastly, 27% fell in thecategory underestimations.

Figure 5.12: Difference between prediction and truth in real numbers for RMSE model.

The performance of TDNet summed over 24-hour periods and summed over all zones inthe city is shown in figure 5.13. The graph shows how well TDNet has been able to capturethe aggregated demand in the city. Most predicted highs and lows aren’t as extreme as theyturned out to be in reality and the model is thrown off in the beginning of 2019 but in generalthe demand has been predicted fairly accurately.

The performance of the benchmarks summed over 24-hour periods and summed over allzones in the city is shown in figure 5.14. This is done to display a simple overview. The truedemand in all zones for all hours of a day has been summed up and is represented by thegreen line in the figure, the demand for each day has been divided by the demand of the maxday to avoid showing real numbers. The other two lines showcase the predicted demandof the SARIMA and stacked ensemble benchmarks. Predicting this city-wide aggregateddemand isn’t the task of these models but shows how the true demand varies substantiallyover time.

Figure 5.15 shows the RMSE for the different models over all hours in a day. TDNet isthe best performing model with the lowest error for all hours except a couple. The errors arestable and range from 3 to 4, in contrast the stacked ensemble achieves errors as low as 3 butalso an error close to 6 at 14:00. The SARIMA model performs poorly at night but well incomparison in the afternoon, it is still worse than the other two alternatives. None of thesemodels have been trained to perform this task where all the zones are aggregated. Thus, this

40

5.3. Results SA

Figure 5.13: Total Demand by day for all zones over the test period and predictions by RMSEmodel.

Figure 5.14: Total Demand by day for all zones over the test period and predictions by bench-marks.

41

5.4. Hyperparameters and Architecture

Figure 5.15: Prediction Error per Hour SA as measured by RMSE for TDNet and benchmarks.

graph might be misleading in comparison to the graph which shows the actual loss achievedon the main task shown in figure 5.10.

5.4 Hyperparameters and Architecture

Table 5.2: Best hyperparameters found from hyperparameter tuning and meta informationabout the training process.

Table 5.3: Hyperparameters SA

Name ValueStep Size 0.0014Training Steps 58000Channels 8Encode/Decode 32Dilations 1, 2, 4, 8, 16, 32, 64, 128Filter Widths 2, 2, 2, 2, 2, 2, 2, 2Duration 84 minIteration 17 out of 30

Table 5.4: Hyperparameters NE

Name ValueStep Size 0.003Training Steps 175000Channels 8Encode/Decode 16Dilations 1, 4, 16, 64, 1, 4, 16, 64Filter Widths 3, 3, 3, 3, 3, 3, 3, 3Duration 124 minIteration 66 out of 100

The best sets of hyperparameters found during the tuning for NE and SA are are shownin table 5.2. For NE, the cross-validation error of the best tuned model was about 13% betterthan that of the first model which was created during development with arbitrarily chosenhyperparameters. No initial model was trained before the hyperparameter tuning for SA, thefirst trained model for this city had hyperparameters which had been tuned.

42

6 Discussion

The sections presented in this chapter will be an analysis and discussion of the results ob-tained from the empirical evaluation. A comparison between the cities and the models willbe included. TDNet will be compared to the current state-of-the-art and suggestions for im-proving its performance will be provided. Additionally, the sources as well as the methodused to generate the results will be critically examined. Lastly, the subject of this thesis willbe contextualized in a wider perspective.

6.1 Results NE

As implemented, TDNet trained with all features as well as TDNet trained with demand onlyboth beat the benchmarks as measured by RMSE and RMSLE. Interestingly, their internal or-der differs depending on what error metric is used. When using RMSLE, TDNet barely beatsTDNet, demand only but when measuring using RMSE the relationship is reversed. Round-ing the predictions as is done in figure 5.1 and measuring the RMSLE leads to the unroundedTDNet trained on demand only beating the rounded TDNet, this further emphasizes howminor the difference is between the two approaches. It can be concluded that adding weatherand holiday information doesn’t improve the accuracy of TDNet for this city.

Regarding the holiday information, some data analysis has been performed which hasfound that the day with the highest total demand over the two years of data was a nationalholiday. The day with the lowest total demand was also a national holiday, namely Christ-mas day. This suggests that splitting the holiday column depending on whether the holidaypositively or negatively impacts the taxi demand could be worthwhile. Especially since find-ing information on holidays and adding it to the data set is much easier than adding weatherinformation.

Regarding the weather information, the conclusion that it doesn’t provide predictivepower could very likely be extrapolated to include at least all of northern Europe which isthe region in which NE is situated. But since the customers of Taxicaller AB are located allover the globe and weather varies a lot, it can’t be ruled out that having weather data wouldincrease accuracy in other areas.

Rounding the results lead to an about 3% difference when using RMSLE and 0.7% whenusing RMSE. The size of these rounding errors is quite small and makes it reasonable to

43

6.1. Results NE

simply round the predictions if TDNet was to be used in a production environment insteadof making the model output discrete predictions.

To provide a connection between the abstract loss functions and the reality of having theright amount of taxis in the right place at the right time, an accurate prediction has beendefined as being within +1/-1 from the true value. This is the case in about 64% of the casesindependent of the loss function used. Removing the hours in the zones where the truedemand is zero, as is done in the histogram 5.3 decreases the accuracy to 59%, increases thepercent of underestimations from 27 to 35 and decreases the overestimations from 9 to 6.This shows that non-zero demand is slightly harder to predict and the change in estimationdistribution is logical since a true demand of zero is only possible to predict accurately oroverestimate.

Looking at the demand per hour in graph 4.2 and the error of the predictions per hourin 5.8, it can be seen that the RMSE is higher than average after midnight when the truedemand is the lowest. At first glance, this is surprising but can probably be explained bythe fact that this demand comes from people taking a taxi home from a night out. The errormeasurements indicate that this demand is more irregular and there are a few reasons whythat might be. Street hailing is more common which means that the location of taxis is moreimportant. Under the assumption that the amount of people who are out and where they godiffers between weekends, it can be hard for the drivers to know where to be and whetherthere are plenty of people out on a certain night or not. A safer choice to get customers is e.g.driving between a business district and the main station before and after normal workinghours.

Loss Functions

The demand summed over all zones over 24 hours for the whole prediction period leads toa total demand of a city. As can be seen in figure 5.6, the total predicted demand of the bestRMSLE model is lower than the total predicted demand of the best RMSE model by quite alarge margin. Furthermore, a comparison between the error distributions in figure 5.2 andfigure 5.4 shows that TDNet trained with RMSE is much more prone to overestimations thanwhen trained with RMSLE. However, when cross-comparing the performance of the modelson the main task in table 5.1, the difference doesn’t appear to be as significant. To explainthis discrepancy, the behaviour of the two different loss functions in the lower number rangehas been examined. As can be seen to the right in figure 6.1, the RMSE for a prediction ofzero when the true value is two leads to an error of two. The same error can be achieved bypredicting four. An error in either direction is treated equally. Consider instead the curvatureof the RMSLE loss function to the left in the same figure. The RSMLE for a prediction ofzero when the true value is two leads to an error of about 1.1. To achieve the same error bypredicting too much, the answer would be eight. This is due to the logarithmic nature of theerror metric, log(2 + 1)´ log(0 + 1) = log(8 + 1)´ log(2 + 1). The advantages of using anerror metric which over-penalizes underestimates has been discussed in the theory chapter.Concretely, "[RMSLE] ... is applicable when predicting across a large range and magnitude ofvalues". As it turns out, on an hourly basis for each zone, the demand in the cities investigateddoesn’t span across a large range or magnitude of values. On the contrary, the mean hourlydemand in each zone for the city NE is about 2 with a standard deviation of 3. Consequently,the predictions made by the model trained using RMSLE are conservative and consistentlyfavor staying in-between zero and two.

The objective of this thesis isn’t to answer to what degree overestimates of the demand aremore expensive than underestimates, but in the interest of evaluating the performance froma wider perspective, both the RMSE as well as the RMSLE of the models have been taken intoconsideration. From a theoretical standpoint, customers with a higher hourly demand perzone would probably prefer using the RMSLE and customers with a lower hourly demandper zone RMSE. Increasing the hourly demand per zone could be achieved by having a higher

44

6.2. Results SA

demand or larger zones. Another approach would be increasing the time-span to includeseveral hours.

Figure 6.1: RMSE and RMSLE for different predictions px when the true value of x is 2.

6.2 Results SA

TDNet is able to produce better results than both the benchmarks for both error metrics.With the definition that a prediction within +1/-1 of the true value is an accurate prediction,an accuracy of 40% was reached. 33% of predictions were overestimations and 27% wereunderestimations. These numbers are low but as can be seen in figure 5.13 where the citydemand over the whole test period can be seen, TDNet has been able to capture the patternin the demand fairly well. This can not be said for the benchmarks which have their city-wide predictions shown in figure 5.14. To produce better results, further tuning of the modelwould be helpful. Furthermore, relating the definition of an accurate prediction to the hourlyaverage number of pick-ups in a zone would be an improvement. As can be seen in figure 4.3in the method section, the zone demand is unevenly distributed and the most active zone hasan approximately three times larger demand than number two. Also, the city mean hourlydemand is about 4.6 with a standard deviation of 7.5, this makes the definition for an accurateprediction quite strict. Increasing an accurate prediction to be within +2/-2 of the true valuewould increase the accuracy by about 30%, mostly due to a prediction of 2 above the truevalue being the most common prediction.

The error of the hourly demand which can be found in figure 5.15 is relatively even forTDNet even though the difference in demand is substantial between the early morning andlate afternoon. This relationship is similar to the one discussed in the previous section aboutthe hourly demand in NE.

6.3 Comparing the Cities

The demand distribution will be analyzed to get a better view of what impact it has on acity. First of all it should be mentioned that the fact that both cities have been divided into14 significant zones is a coincidence. Originally, they were divided in different amounts ofzones but the requirement that a zone must contribute to more than 1% of the total demandof the city to be worthwhile predicting lead to this.

45

6.4. Method Criticism

The zone error distributions as depicted in figure 5.5 and figure 5.11 differ substantially. InNE, the errors are of similar size and they hover around the RMSE accuracy of the whole city.In SA, the error in zone A overshadows that of all the other zones. This is reasonable sincezone A stands for more than a third of the city demand and causes an error of approximatelythree times the size of the second worst zone. However, the errors are proportional to thehourly demand. The relative demand of a zone in comparison to the city demand is a goodestimation of how large the prediction error will be in relation to the error zones in the city.

The zone distribution also affects the mind set of the drivers in the city. In SA, a risk-aversedriver could simply always wait for customers in zone A and probably do reasonably well.In NA there exists no obvious hot spot and drivers are encouraged to roam around betweenzones. With access to all the locations of the taxis and the predicted demand, a machinelearning model could be developed to predict the expected average wait time in a zone. Thiswould benefit both the risk-averse drivers in SA and the roaming drivers in NE.

The actual value of the hourly demand in a zone impacts the size of the error metrics,therefore SA with a higher mean and standard deviation is expected to have a numericallyhigher error in relation to NE. With the current definition of an accurate prediction, it is alsoexpected that the accuracy in NE 64% is better than in SA 40%. When comparing the total citydemand over the whole period the pattern in NE is not as evident as in SA and the Novemberpeak in NE has no counterpart in SA. This regular temporal pattern in SA which has beencaptured is also the reason as to why TDNet has clearly outperformed the benchmarks in SAcompared to NE.

Overfitting

A constant threat when working with supervised machine learning models is that of overfit-ting. By rigorously cross-validating the performance of the model during training, the threatcan be kept at bay. Although graphs such as the one in figure 5.9 weren’t drawn repeatedlyduring training, a check was in place to make sure that if the cross-validation error as an av-erage over 100 iterations didn’t decrease for 2000 iterations, training was interrupted. If over-fitting would have been a bigger issue for any of the two cities, which is usually indicatedby the training loss having a much lower value than the cross-validation error, techniqueswould have been used to combat this issue.

6.4 Method Criticism

The field of machine learning is developing very quickly and there are general issues of reli-ability, explanability and replicability which are yet to be tackled. A selection of these will bebrought up in the context of this thesis as well as issues and criticisms unique to this thesis.

Batch Generation

The batch-sampling was done with replacement meaning that the model wasn’t guaranteedto see all of the training data during one training epoch. The consequences of this is that themodel converges slower. Fortunately, it doesn’t prevent convergence. [44]

Even though the loss averaging window technique is applied to increase robustness, itis possible that the same day is used for cross validation a very high number of times over100 batches. Then a model which happens to accurately predict the demand of this specificday is chosen over a model which might be better overall. However, as the total size of thevalidation set is just below 100 days, it is expected that some days are selected a few timesand some none at all.

46

6.4. Method Criticism

Hyperparameter Tuning

The first step of performing hyperparameter tuning is defining the domain, i.e. the possi-ble values of the hyperparameters. Although the TPE-algorithm is able to find optimal val-ues outside of the initial domain, given that the hyperparameters are defined as continuousdistributions, it might not be able to do so in a limited number of iterations [22]. For thehyperparameters defined as discrete distributions or categorical choices the TPE-algorithmis confined to the values defined by the developer and it with a limited range and compu-tational resources, it is unlikely that the optimal values are among these. It should also benoted that the sampling of the hyperparameters is determined by the seed or random state ofthe TPE-algorithm which affects the convergence speed.

In a complex model, such as the one described in this paper, the number of hyperparam-eters to tune is high and due to time and computational restraints only a subset of them areselected. At the time of writing, no algorithms or golden rules exist for knowing in advancewhich have the most significant impact on the performance of a machine learning model. Asdiscussed in section 2.10, even slight variations to the same dataset can have a major impacton the importance of the hyperparameters [2]. With this in mind, it’s impossible to determinewhether the result achieved is close or far away from the theoretical limit of predictability.Luckily, the paper cited in that section points out that only a few of all hyperparameters hasa fundamental impact on the performance of a model.

Examples of hyperparameters which have been used with their default values and haven’tbeen tuned are which optimization algorithm to use, the number of restarts, the regular-ization parameter, the dropout rate, whether parameter averaging should be used on theweights, the number of validation batches, the average loss window size, the activation func-tions used in the different layers and so on. From a validity standpoint this is dissatisfying asit leaves much up to the intuition and estimations of the developer.

Unfortunately, there are few papers in the machine learning field which release completeinformation regarding the values of their hyperparameters or trained parameters. Even fewerrelease the full source code of their models. The authors of the original WaveNet have chosento go down this path of obfuscation as well and thus the open source implementations haven’tbeen able to fully reproduce the results of the original paper.

Computational Limits

As mentioned in section 2.2, the cost of training a deep neural network such as TDNet can beconsiderable. Although the computer used to train the models had a strong consumer-gradeGPU and additional hardware to support it, training the TDNet with one set of hyperparam-eters took between 30 and 120 minutes, mostly depending on the number of iterations as wellas the number of restarts. The most exhaustive tuning of TDNet searched for 100 iterationsand was done for the city NE with demand, weather and holiday features using the RMSLEloss function. The process took approximately 100 hours. Hyperparameters were searchedfor with only the demand as input as well as for the other city, SA. The RMSE hyperparametersearch for NE only ran for 20 iterations, the cross-validation error of that model didn’t beatthe RMSE cross-validation error of the best RMSLE model.

For the stacked ensemble benchmark, training and hyperparameter tuning was done for amaximum of 20 minutes. If a longer time was set, the h2o framework threw a memory errorbefore training was completed. The time required to generate predictions was negligible.The SARIMA model on the other hand, required about a minute to find parameters andinvestigate whether the time-series was non-stationary and then an additional 90 minutesper zone to train and generate predictions, this lead to a total of approximately 20 hours.

47

6.5. Comparing the Models

Data Handling and Feature Engineering

There were a few indicators that the quality of the data retrieved from the Dark Sky API wasdubious. As has been mention in section 4.2, data was missing for almost a full year, dailydata was replaced with daily and two columns were removed due to them being full of zeroentries. The data that was actually used has been analyzed and deemed to be adequate but itmight contain minor inaccuracies which could add up. The steps which have been taken toalleviate the concerns raised by the data are described in the previously mentioned section.

A mistake made was not creating yearly lags for the test set, i.e. it was full of zeros.However, the yearly lags for the first year of the training set were also zeros as well as themonthly lag for the first month of training, though this was unavoidable.

Benchmarks

Statistical models such as SARIMA usually enjoy an increased performance when the pa-rameters of the model are finely tuned [5]. In this case, squeezing performance out of themodel wasn’t prioritized and the parameters found most appropriate when investigating theautocorrelation and partial autocorrelation plots were not updated once originally set for azone in a city. Furthermore, the seasonal parameters were chosen according to guidelinesby a field expert and not quantitatively evaluated and tuned which would be the preferredmethod [36].

Source Criticism

The sources which have been used to provide a theoretical background for the machine learn-ing techniques used in this thesis are mostly well-cited and written by experts such as A. Ng,Y. Bengio, J. Bergstra, I. Goodfellow and Y. LeCun. Some sources are classic books such as"Artifical Intelligence: A Modern Approach" by S. Russell and Peter Norvig or "Time SeriesAnalysis: forecasting and control" by GE. P. Box et. al which first came out in 1970. Occa-sionally there are sections in the theory were concrete examples are brought up and in thosecases the sources might be rare but then the reasoning is backed by math. Due to the fastgrowth in the machine learning field, there are techniques which aren’t as well theoreticallyunderstood as one might hope. As an example the Adam optimizer, introduced in a paperwith more than 20000 citatations at the time of this writing, has been shown not to convergeto the optimal solution for specific, but quite simple tasks [45].

The papers which specifically focus on predicting taxi demand using statistical and ma-chine learning methods aren’t very well cited due to the size of the field. However, the onescited are among the most established. Most sources for taxi demand prediction have beenfound through Google Scholar or through related work sections in other papers.

The open-source implementation upon which this master thesis is built has been shownto work very well empirically in at least three previous cases, one of which lead to a peer-reviewed paper being released [25]. It probably isn’t bug-free but has been shown to deliverdecent predictions in the context of taxi demand as well.

6.5 Comparing the Models

In the experiments conducted in this paper, TDNet outperforms the benchmarks in terms ofaccuracy as measured by RMSE and RMSLE. From a theoretical standpoint, this is due to itsability to model non-linear relationships between the demand now and the demand duringthe previous hours, days, weeks and months [14]. The stacked ensemble, which is the bestcontender, doesn’t inherently model these relationships but is able to non-linearly combinethe features fed to it and based on its performance, the date and time, day of week and day ofmonth seem to contain useful information. It considers the historical demand but doesn’t use

48

6.6. Comparing TDNet to the Literature

e.g. the demand 24 hours ago as an input feature. SARIMA on the other hand is able to modeland make use of the historical demand by e.g. basing its predictions on a rolling average butis only able to produce a linear combination of them. Furthermore, it can’t be fed the day ofweek, date and time or any other feature outside from the pure demand time-series.

To explain the results from a practical standpoint, the time that has been invested in find-ing the best set of hyperparameters as well as implementing TDNet is far more than thatwhich has been spent on the benchmarks. It can’t be ruled out that the stacked ensemblewouldn’t beat TDNet if it was fed features such as a rolling mean, lagged demand features orwas able to train for a longer period of time. The same might be true for SARIMA regardingparameter tuning but is more unlikely.

The implementation costs and where they occur for each of the models differs substan-tially. SARIMA would ideally have a specific set of parameters for each zone in each city andfinding good values for these takes a long time. On one hand, that this haven’t been done isunfortunate and weakens statements made about the superiority of the other algorithms. Onthe other hand, this speaks to the need for an iterative process when working with this typeof model.

For SARIMA, the training process as implemented in the library doesn’t make use of theGPU nor all the CPU cores which leads to very slow training. Since it makes use of historicaldata in the same way as TDNet, it would require access to the taxi demand of the last day inthe same way which places sharper constraints on the system. TDNet only requires one setof hyperparameters for all zones in a city and given that the city demand is of approximatelythe same distribution as another city, the hyperparameters probably deliver decent resultsand don’t have to be recalculated. However, the number of hyperparameters is vastly higherthan that of the benchmarks and it isn’t abstracted away by the framework as is the case withh2o and the stacked ensemble. H2o strips the developer of the duty of writing code to tunethe hyperparameters of the models, it already comes pre-implemented.

An attempt has been made in this thesis to find a sufficiently good set of hyperparametersfor TDNet but not all have been tuned and it can’t be guaranteed that the search has beenconducted in the right value range. The hyperparameter tuning is the most expensive stepof creating TDNet for a city but it fully uses the GPU and the implementation frameworkTensorflow is state-of-the-art. The implementation is also very explicit and the developer hasalmost full control of everything which is positive but it also makes it easier for bugs to sneakin.

When engineering a machine learning model in general, factors outside of which modelproduces the lowest error come into play. The cost of implementing a tuning process mighteasily outweigh the gain of a decreased prediction error. In that sense, the stacked ensembleas implemented by the automl function of the h2o platform has a solid advantage. This ad-vantage comes at the cost of code control, the memory error which was encountered duringthe training process which limited the training time to 20 minutes lay deep within the frame-work and an easy fix was not possible. Consequently the training and tuning time was vastlyreduced which lead to the stacked ensemble not living up to its full potential.

The advantages of the SARIMA model might not be as obvious as those of the other two.But from an implementation perspective, the complexity is low in comparison to TDNet.Much is outsourced to the used module and there are several examples of similar tasks totaxi demand prediction following the same steps online. When it comes to explainability,exact statements can be made about how many hours back in time are being used, to whatextent they contribute to the prediction and what seasonal assumption was made.

6.6 Comparing TDNet to the Literature

A previously brought up alternative to a 1D-CNN such as TDNet is a LSTM. In a paper from2018, J. Xu et. al predict the taxi demand in different areas of New York using a LSTM [55].

49

6.7. Improving TDNet

When using the same time step as TDNet of 60 minutes, they achieve a RMSE of about 2.4which is extremely close to the RMSE of TDNet in city NE. These two numbers aren’t directlycomparable since they originate from two different data sets and the actual number of pick-ups affects the RMSE. Concretely, the max demand in a zone in one hour was about 12 timeslarger in NY than in NE and the standard deviation about 4 times larger despite the fact thatNY was split into 6500 zones in comparison to 14. Their low error can partly be attributed totheir forecast horizon of one hour in comparison to 26 hours.

The authors of the paper go on to investigate what impact historical demand, weather, dayof week, day and time and drop-offs have on the accuracy of their LSTM. Historical demandwas found to have the highest impact followed by day of week, drop-offs, date and timeand lastly weather. Using all features in comparison to just using historical demand didn’tsignificantly improve the performance which showcases the importance of that feature.

Each trip in the datasets used in this paper was connected to a timestamp, this madeit trivial to add the date and time and day of week features. These features in addition tohistorical demand constituted the baseline TDNet. Adding holiday and weather informationrequired more manual work and a goal of this thesis was to determine whether the accuracygained by including these features makes it worth it. The outcome was almost the same as inthe paper by J. Xu et. al. In their case, the addition of features beyond pick-ups and temporalfeatures, i.e. day of week and date and time resulted in a marginally improved accuracy. Inthe case of TDNet, the accuracy also marginally increased when measured by RMSLE butdecreased when measured by RMSE. To benchmark their LSTM, the authors made use of afeedforward neural network and a rolling mean. Similar to what has been found in this paper,a standard supervised learning algorithm can perform at an accuracy close to that of a modelmade for time-series forecasting. Additionally, simpler models such as a rolling mean or acoarse SARIMA model struggle to perform on the same level, especially when the varianceof the demand increases.

In the literature review in section 3.3, a paper has been summarized which provides in-structions for calculating the limit of taxi demand predictability. This could be done for all ofthe cities where it’s of interest to predict taxi demand in order to have a goal to aim for andmeasure the relative success of a model. Doing this could also provide insight into what kindof model might be most suitable for the specific city, i.e. should a machine learning modelwith additional features be used or is a pure time-series model sufficient. The average limitof hourly predictability calculated for NYC is 83% which is more than what is achieved byTDNet using a 26-hour forecast horizon. [59] From a validity standpoint, performing testssuch as this one to statistically analyze the data, form a hypothesis and try to disprove it isgood and follows the scientific process closely. Especially in comparison to throwing big dataat a black-box algorithm, fiddle with its settings and hope it produces an error lower thanbefore.

6.7 Improving TDNet

The simplest practical improvement would be to make full use of the instructions for mixedprecision GPU training which will become available as soon as the dependency managementsystem anaconda receives an update for the cudNN package. As it stands, some benefits aregained but the feature isn’t completely support. The second simplest improvement would beadjusting the batch generation process so that each batch is drawn without replacement asdiscussed in section 6.4 on method criticism.

A theoretical improvement would be implementing the caching algorithm proposed byRamachandran, P. et. al which has made the sequence generation of the original WaveNet 21times faster [43]. The idea is to cache calculations made in the nodes of the hidden layers thatare repeated several times as new output is generated. If the output of a node in the secondlayer for time step t depends on that of two nodes in first layer, i.e. p(a2

t |a1t , a1

t´2) then the

50

6.8. The work in a wider context

output of that node for time step t + 2 would be p(a2t+2|a

1t+2, a1

t ). These two expressions sharethe term a1

t and the naive implementation used in this paper calculates it twice. The cachingalgorithm on the other hand calculates it once, then stores the result and retrieves it for timestep t + 2. Consequently doing this for all layers results in a considerable speed up.

There are potential improvements which are general to CNNs, all of the following areapplicable to TDNet but plenty of them have been developed and evaluated on image datawhich is higher-dimensional [18]. The first and last layer of TDNet are time distributed denselayers which use the rectified linear activation function from equation 2.13. Although ReLUlead to significant performance improvements when initially introduced, issues such as anode producing zero gradients constantly when x ă 0 may slow down training. There areseveral suggested solutions to this issue, most rely on defining a gentle gradient for whenx ă 0 and they have been shown to increase performance [54]. Thus, swapping the activationfunctions could improve TDNet.

A technique to speed up training and improve accuracy known as batch normalizationuses the same principles as the scaling of the input features as described in section 4.4. Theidea is to normalize the input to all layers after a mini-batch which scales the parameters toa similar range. This makes it possible to use a higher learning rate without the gradientsexploding and adds a slight regularizing effect which prevents overfitting. [21]

The problem statement of short-term demand could be reinterpreted as to not entail a 26-hour forecast horizon. Forecasting one or two hours ahead with access to the current demandwould most likely greatly increase the accuracy of TDNet as well as the benchmarks. In apaper by L. Moreira-Matias et. al, they achieve an aggregated error measurement of 26% for atask very similar undertaken in this paper using a stacked ensemble. The ensemble containsamong three other models an ARIMA and the forecast horizon is 30 minutes [34]. This shorterhorizon allows for taking special irregular events which lead to an increase in demand over aday to be taken into account. Furthermore, the systems of TaxiCaller register future bookingsand feeding the bookings that are placed far ahead of time to the model would help predictingshort-term bookings and street hailing demand.

6.8 The work in a wider context

As with all machine learning tasks, the "ground truth" or data set is only an approximationof reality and might not contain the actual truth. Real-world biases are most likely repre-sented in the data and this might lead to unwanted consequences. As an example, it could bethat taxi drivers refuse to pick-up certain customers or avoid certain neighbourhoods due toracial discrimination [37]. This makes the demand data misleading and might force potentialcustomers to walk to another area to get picked-up or use other modes of transportation. Amachine learning model could pick up this bias against certain zones as well and if driversin the future rely on it, they might continue to avoid these neighbourhoods due to the modelpredicting that there’s no demand, even though they are willing to go there. Bookings andunprejudiced taxi drivers luckily work as a natural countermeasure against these issues.

Currently, a more experienced driver is expected to have a better understanding of wherethe customers are and therefore be able to predict the demand of taxis in a city [15]. A conse-quence of deploying a machine learning model which helps drivers is that the gap betweennew and experienced drivers would decrease. The senior drivers might be annoyed sincethey’ve worked longer and might feel entitled to an advantage. An experienced driver stillwould have an advantage over inexperienced ones in interpreting factors other than histor-ical demand, because when it comes to demand data, TDNet trumps that of a single driver,whoever it may be. Examples of factors unavailable to TDNet are information about eventsand special occasions, schedules for public transport, city-specific knowledge, competitorsand visual input.

51

6.8. The work in a wider context

In the long term, an autonomous taxi fleet could benefit greatly from an algorithm whichaccurately predicts taxi or by extension, transportation demand. This would help positionthe cars efficiently and make sure that the supply matches the demand. T. Litman bringsup several possible consequences of society widely adapting autonomous vehicles. Benefitsinclude increased mobility and safety, reduced traffic, energy use and public transport butalso the possibility of increased emission and less infrastructure to support traveling by bikeor walking. Of course, an autonomous fleet, or at least one with self-driving vehicles wouldsignificantly decrease human involvement and meaning that the algorithm would contributeto people losing their jobs [31]. If the positives will outweigh the negatives is yet to be seenbut society will definitely change.

52

7 Conclusion

The aim of this study was to apply TDNet, a machine learning model with a WaveNet ar-chitecture, to the task of predicting short-term taxi demand in cities and evaluate its per-formance. This has been done by exploring, cleaning and creating features from two taxidemand data sets and modifying an open-source implementation of WaveNet. To improvethe performance of TDNet, a Bayesian optimization algorithm for hyperparameter tuning hasbeen used. With the best set of hyperparameters found, taxi demand for the next two monthsin different zones of a city was forecasted. The predictions have been analyzed, discussedand in this chapter follow the final conclusions.

7.1 Connection to Research Questions

The first question, regarding how well short-term taxi demand can be predicted, has beenanswered by training TDNet on historical demand data as well as additional data whichpotentially impacts taxi demand. Furthermore, a hyperparameter tuning algorithm knownas a Tree Parzen Estimator has been run for 100 iterations to improve the performance ofTDNet. With these hyperparameters, TDNet was able to predict the taxi demand within +1/-1 of the true value in 64% of the cases in the city NE and 40% in the city SA. In additionto this, two different error metrics have been used to provide a wider perspective on howaccuracy should be interpreted in this domain. The RMSE is a safe choice which punishesover and underestimations equally, the the RMSLE depends on what the true demand is andcan lead to very conservative predictions if the hourly demand for a zone is low. With a highaverage demand, the RMSLE is expected to better predict low demand but as has been thecase for these two cities, the average demand is only two and four with standard deviationsof three and seven and therefore RMSE-trained models which are better at predicting peaksin demand are preferred.

The second question about how the distribution of taxi demand between zones in citiesaffects performance as well as how features other than demand impact performance has beenanswered by conducting two experiments. The first one consisted of evaluating TDNet intwo different cities with different demand distributions between zones, the outcome wasthat the average hourly demand and its standard deviation impacts the prediction accuracymore than the distribution between zones. The second experiment consisted of measuringthe accuracy of TDNet when trained with access to demand, holiday and weather data and

53

7.2. Future Research

comparing it with the accuracy of a TDNet trained on demand only. For the city NE, weatherand holiday features didn’t improve the accuracy of TDNet.

The third question, which puts the performance of TDNet in relation to existing time-series forecasting models, has been answered by using a SARIMA model and a stacked en-semble of supervised machine learning models to predict taxi demand. TDNet beat bothbenchmarks in both cities, but the margin was small in NE and the computational resourcesspent on the best benchmark, the stacked ensemble, was limited in comparison to those spenton TDNet. From a theoretical standpoint, the superior performance of TDNet can be ex-plained by it using both temporal features such as time of day, day of week, year and day ofmonth as well as historical demand. Furthermore, it conditions the demand in one zone onthe demand in other zones. This separates it from the stacked ensemble which relies solelyon temporal features and SARIMA which only uses historical demand.

The implementation complexity of TDNet is much higher than that of the benchmarks butit has performed slightly better in NE but significantly better in SA. If the gain in accuracyweighs up for the cost of maintaining a complex model, then TDNet should be used. Other-wise the stacked ensemble as implemented by h2o should be the preferred choice due to itslow complexity and the fact that it doesn’t rely on lagged demand input features. That meansthat last week’s demand doesn’t need to be available for it to predict the demand of the up-coming hour which separates it from TDNet and SARIMA. This relaxed constraint makes iteven easier to use.

This thesis presents another example of a domain where a WaveNet architecture has beenapplied successfully to generate time-series which can be used for prediction. It provides adiscussion of the impact of different loss functions tied to a concrete example and suppliesproof that neither weather nor holiday features improve taxi demand prediction power giventhat historical demand and temporal features are available.

7.2 Future Research

As mentioned in section 2.1 on taxi demand, the total taxi demand of any given city is un-known. Furthermore, most taxi demand data sets aren’t publicly available. There are a fewexceptions such as the one week of available data for Beijing in 2008 which has been used inmultiple studies [58, 57]. A taxi demand data set of larger size which gets updated continu-ously is that of New York City. The NYC Taxi and Limousine Commission force all taxi com-panies to submit their bookings and then release them in publicly available data sets whichare split based on the kind of taxi. The famous yellow cabs are for example only allowed topick up hailing customers in the central parts of the city, while a kind of cab called for hirevehicles only accept bookings [8]. Unfortunately, the data sets don’t contain complete recordsof the bookings of competing ride-sharing companies such as Uber. Nonetheless, these datasets would ideally serve as standard benchmarks for the task of predicting taxi demand evenfor algorithms built specifically for predicting demand in e.g. Porto, Tokyo or Bengaluru,India [34, 24, 12]. Papers such as the one written by J. Xu et. al or the one by K. Zhao et. aluse the NYC data set and are therefore much easier to compare [55, 59]. To determine theviability of TDNet and concretely put it in relation to state-of-the-art alternatives, it shouldbe benchmarked on this larger data set.

Predicting the zone wait time instead of just the demand cuts closer to the heart of thesupply and demand problem and running such a system in real-time would be challenging.Moreira et. al ran their taxi demand prediction system in real-time for a few of months butmost models never get deployed and used in reality [34]. Including the real-time location ofdrivers and predicting the wait time would be possible and could yield interesting results.

54

Bibliography

[1] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencieswith gradient descent is difficult”. In: IEEE transactions on neural networks 5.2 (1994),pp. 157–166.

[2] James Bergstra and Yoshua Bengio. “Random search for hyper-parameter optimiza-tion”. In: Journal of Machine Learning Research 13.Feb (2012), pp. 281–305.

[3] James Bergstra, Dan Yamins, and David D Cox. “Hyperopt: A python library for opti-mizing the hyperparameters of machine learning algorithms”. In: Proceedings of the 12thPython in science conference. Citeseer. 2013, pp. 13–20.

[4] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. “Conditional time seriesforecasting with convolutional neural networks”. In: arXiv preprint arXiv:1703.04691(2017).

[5] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time seriesanalysis: forecasting and control. John Wiley & Sons, 2015.

[6] Stefan Burgstaller, Demian Flowers, Tamberrino David, and Yipeng Terry Heath P.andYang. “Rethinking Mobility: The ’pay as you go’ car: Ride hailing just the start”. In:Venture Capital Horizons (2017).

[7] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison of super-vised learning algorithms”. In: Proceedings of the 23rd international conference on Machinelearning. ACM. 2006, pp. 161–168.

[8] NYC Taxi Limousine Commision. Vehicle Licenses. URL: https://www1.nyc.gov/site/tlc/vehicles/get-a-vehicle-license.page (visited on 03/27/2019).

[9] Corporación Favorita Grocery Sales Forecasting. https : / / www . kaggle . com / c /favorita-grocery-sales-forecasting. Accessed: 2018-11-29.

[10] Judd Cramer and Alan B Krueger. “Disruptive change in the taxi business: The case ofUber”. In: American Economic Review 106.5 (2016), pp. 177–82.

[11] W. Dally. High Performance Hardware for Machine Learning. Dec. 2015. URL: https ://media.nips.cc/Conferences/2015/tutorialslides/Dally- NIPS-Tutorial-2015.pdf.

[12] Neema Davis, Gaurav Raina, and Krishna Jagannathan. “A multi-level clustering ap-proach for forecasting taxi travel demand”. In: Intelligent Transportation Systems (ITSC),2016 IEEE 19th International Conference on. IEEE. 2016, pp. 223–228.

55

https://www1.nyc.gov/site/tlc/vehicles/get-a-vehicle-license.page

https://www1.nyc.gov/site/tlc/vehicles/get-a-vehicle-license.page

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf



Bibliography

[13] David A Dickey and Wayne A Fuller. “Distribution of the estimators for autoregressivetime series with a unit root”. In: Journal of the American statistical association 74.366a(1979), pp. 427–431.

[14] Vincent Dumoulin and Francesco Visin. “A guide to convolution arithmetic for deeplearning”. In: arXiv preprint arXiv:1603.07285 (2016).

[15] Henry S Farber. “Why you can’t find a taxi in the rain and other labor supply lessonsfrom cab drivers”. In: The Quarterly Journal of Economics 130.4 (2015), pp. 1975–2026.

[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[17] Klaus Greff, Rupesh K Srivastava, Jan Koutnık, Bas R Steunebrink, and Jürgen Schmid-huber. “LSTM: A search space odyssey”. In: IEEE transactions on neural networks andlearning systems 28.10 (2017), pp. 2222–2232.

[18] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai,Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. “Recent advances in convolu-tional neural networks”. In: Pattern Recognition 77 (2018), pp. 354–377.

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning forimage recognition”. In: Proceedings of the IEEE conference on computer vision and patternrecognition. 2016, pp. 770–778.

[20] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural com-putation 9.8 (1997), pp. 1735–1780.

[21] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep networktraining by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167 (2015).

[22] Donald R Jones. “A taxonomy of global optimization methods based on response sur-faces”. In: Journal of global optimization 21.4 (2001), pp. 345–383.

[23] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Ra-minder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. “In-datacenterperformance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th Annual In-ternational Symposium on Computer Architecture (ISCA). IEEE. 2017, pp. 1–12.

[24] Yuki Oyabu Kaz Sato. Now live in Tokyo: using TensorFlow to predict taxi demand. URL:https://cloud.google.com/blog/products/gcp/now-live-in-tokyo-using-tensorflow-to-predict-taxi-demand (visited on 03/29/2019).

[25] Glib Kechyn, Lucius Yu, Yangguang Zang, and Svyatoslav Kechyn. “Sales forecastingusing WaveNet within the framework of the Kaggle competition”. In: arXiv preprintarXiv:1803.04037 (2018).

[26] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980 (2014).

[27] John R Koza, Forrest H. Bennett, David Andre, and Martin A. Keane. Automated Designof Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming.Artificial Intelligence in Design. Springer, Dordrecht, 1996.

[28] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553(2015), p. 436.

[29] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. “Gradient-based learn-ing applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[30] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. “Independently recurrentneural network (indrnn): Building A longer and deeper RNN”. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 5457–5466.

56

http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://cloud.google.com/blog/products/gcp/now-live-in-tokyo-using-tensorflow-to-predict-taxi-demand

https://cloud.google.com/blog/products/gcp/now-live-in-tokyo-using-tensorflow-to-predict-taxi-demand

Bibliography

[31] Todd Litman. Autonomous vehicle implementation predictions. Victoria Transport PolicyInstitute Victoria, Canada, 2017.

[32] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. “Traffic flowprediction with big data: a deep learning approach”. In: IEEE Transactions on IntelligentTransportation Systems 16.2 (2015), pp. 865–873.

[33] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. “Long short-term memory neural network for traffic speed prediction using remote microwave sen-sor data”. In: Transportation Research Part C: Emerging Technologies 54 (2015), pp. 187–197.

[34] Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and LuisDamas. “Predicting taxi–passenger demand using streaming data”. In: IEEE Transac-tions on Intelligent Transportation Systems 14.3 (2013), pp. 1393–1402.

[35] Vinod Nair and Geoffrey E Hinton. “Rectified linear units improve restricted boltz-mann machines”. In: Proceedings of the 27th international conference on machine learning(ICML-10). 2010, pp. 807–814.

[36] Robert Nau. General seasonal ARIMA models. URL: https://people.duke.edu/~rnau/seasarim.htm (visited on 05/02/2019).

[37] William Neuman. New York Office to Address Discrimination by Taxis and For-Hire Vehicles.URL: https://www.nytimes.com/2018/07/31/nyregion/uber-taxis-minorities-bias-refusal-nyc.html (visited on 05/25/2019).

[38] Andrew Ng. Train Test Data Split - Improving Deep Neural Networks, Hyperparameter tun-ing, Regularization and Optimization. URL: https://www.coursera.org/lecture/deep-neural-network/train-dev-test-sets-cxG1s (visited on 04/07/2019).

[39] Andrew Ng. Tuning Process - Improving Deep Neural Networks, Hyperparameter tuning,Regularization and Optimization. URL: https://www.coursera.org/lecture/deep-neural-network/tuning-process-dknSn (visited on 05/15/2019).

[40] NVIDIA. CUDA. URL: https://developer.nvidia.com/cuda-zone (visited on03/25/2019).

[41] NVIDIA. cuDNN. URL: https://developer.nvidia.com/cudnn (visited on03/25/2019).

[42] Robi Polikar. “Ensemble based systems in decision making”. In: IEEE Circuits and sys-tems magazine 6.3 (2006), pp. 21–45.

[43] Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, ShiyuChang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas SHuang. “Fast generation for convolutional autoregressive models”. In: arXiv preprintarXiv:1704.06001 (2017).

[44] Benjamin Recht and Christopher Ré. “Beneath the valley of the noncommutativearithmetic-geometric mean inequality: conjectures, case-studies, and consequences”.In: 2012.

[45] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. “On the convergence of adam andbeyond”. In: arXiv preprint arXiv:1904.09237 (2019).

[46] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd. UpperSaddle River, NJ, USA: Prentice Hall Press, 2009. Chap. 1. ISBN: 0136042597.

[47] Hasim Sak, Andrew Senior, and Françoise Beaufays. “Long short-term memory basedrecurrent neural network architectures for large vocabulary speech recognition”. In:arXiv preprint arXiv:1402.1128 (2014).

[48] Jürgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural net-works 61 (2015), pp. 89–90.

57

https://people.duke.edu/~rnau/seasarim.htm

https://people.duke.edu/~rnau/seasarim.htm

https://www.nytimes.com/2018/07/31/nyregion/uber-taxis-minorities-bias-refusal-nyc.html

https://www.nytimes.com/2018/07/31/nyregion/uber-taxis-minorities-bias-refusal-nyc.html

https://www.coursera.org/lecture/deep-neural-network/train-dev-test-sets-cxG1s

https://www.coursera.org/lecture/deep-neural-network/train-dev-test-sets-cxG1s

https://www.coursera.org/lecture/deep-neural-network/tuning-process-dknSn

https://www.coursera.org/lecture/deep-neural-network/tuning-process-dknSn

https://developer.nvidia.com/cuda-zone

https://developer.nvidia.com/cudnn

Bibliography

[49] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu.“WaveNet: A generative model for raw audio.” In: SSW. 2016, p. 125.

[50] Sean Vasquez. web traffic forecasting. URL: https://github.com/sjvasquez/web-traffic-forecasting (visited on 03/25/2019).

[51] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. “Ex-tracting and composing robust features with denoising autoencoders”. In: Proceedingsof the 25th international conference on Machine learning. ACM. 2008, pp. 1096–1103.

[52] M Mitchell Waldrop. “The chips are down for Moore’s law”. In: Nature News 530.7589(2016), p. 144.

[53] Rüdiger Wirth and Jochen Hipp. “CRISP-DM: Towards a standard process model fordata mining”. In: Proceedings of the 4th international conference on the practical applicationsof knowledge discovery and data mining. Citeseer. 2000, pp. 29–39.

[54] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. “Empirical evaluation of rectifiedactivations in convolutional network”. In: arXiv preprint arXiv:1505.00853 (2015).

[55] Jun Xu, Rouhollah Rahmatizadeh, Ladislau Bölöni, and Damla Turgut. “Real-time pre-diction of taxi demand using recurrent neural networks”. In: IEEE Transactions on Intel-ligent Transportation Systems 19.8 (2018), pp. 2572–2581.

[56] Fisher Yu and Vladlen Koltun. “Multi-scale context aggregation by dilated convolu-tions”. In: arXiv preprint arXiv:1511.07122 (2015).

[57] Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. “Driving with knowledge fromthe physical world”. In: Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM. 2011, pp. 316–324.

[58] Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, andYan Huang. “T-drive: driving directions based on taxi trajectories”. In: Proceedings of the18th SIGSPATIAL International conference on advances in geographic information systems.ACM. 2010, pp. 99–108.

[59] Kai Zhao, Denis Khryashchev, Juliana Freire, Cláudio Silva, and Huy Vo. “Predictingtaxi demand at high spatial resolution: approaching the limit of predictability”. In: BigData (Big Data), 2016 IEEE International Conference on. IEEE. 2016, pp. 833–842.

[60] Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. “LSTMnetwork: a deep learning approach for short-term traffic forecast”. In: IET IntelligentTransport Systems 11.2 (2017), pp. 68–75.

58

https://github.com/sjvasquez/web-traffic-forecasting

https://github.com/sjvasquez/web-traffic-forecasting

Glossary

CNN Convolutional Neural Network. 2, 11–14, 16, 19, 49, 51

RMSE Root Mean Square Error. viii, 6, 15, 29, 31–36, 38–48, 50, 53

RMSLE Root Mean Square Logarithmic Error. viii, 29, 31–34, 36, 38, 39, 43–45, 47, 48, 50, 53

WaveNet The neural network on which TDNet is based, see section 2.8 for a detailed de-scription.. 2, 4, 13, 14, 19, 20, 24, 27, 28, 47, 50, 53, 54

59

tdnet - a generative model for taxi demand prediction1334506/... · 2019. 7. 2. · linköpings...

Documents