towards a representative benchmark for time series databases

CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

databasesTowards a representative benchmark for time series

Academic year 2018-2019

Master of Science in de industriële wetenschappen: elektronica-ICT

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Joachim Nielandt, Jasper VaneessenSupervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck

Student number: 01610806Thomas Toye

PREFACE iv

Preface

I would like to thank my supervisors, Prof. dr. Bruno Volkaert and Prof. dr. ir.

Filip De Turck.

I am very grateful for the help and guidance of my counsellors, Dr. ir. Joachim

Nielandt and Jasper Vaneessen.

I would also like to thank my parents for their support, not only during the writing

of this dissertation, but also during my transitionary programme and my master’s.

The author gives permission to make this master dissertation available for consul-

tation and to copy parts of this master dissertation for personal use. In all cases

of other use, the copyright terms have to be respected, in particular with regard to

the obligation to state explicitly the source when quoting results from this master

dissertation.

Thomas Toye, June 2019

Towards a representative benchmark

for time series databases

Thomas Toye

Master’s dissertation submitted in order to obtain the academic degree of

Master of Science in de industriele wetenschappen:

elektronica-ICT

Academic year 2018–2019

Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck

Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen

Summary

As the fastest growing database type, time series databases (TSDBs) have expe-rienced a rise in database vendors, and with it, a rise in difficulty in selecting thebest one. TSDB benchmarks compare the performance of different databases toeach other, but the workloads they use are not representative: they use randomdata, or synthesized data that is only applicable to one domain. This disserta-tion argues that these non-representative benchmarks may not always accuratelymodel real world performance, and instead, representative workloads should beused in TSDB benchmarks. In this context, workloads are defined as consistingof data sets and queries. Workload data sets can be categorized using eight pa-rameters (number of metrics, regularity, volume, data type, number of tags, tagvalue data type, tag value cardinality, variation). A new benchmark was created,which uses three representative workloads next to a baseline non-representativeworkload. Results of this benchmark show significant performance differences fordata ingestion speed for complex data, latency and maximum request rate (whenbroad time ranges are used), and storage efficiency of data points when comparingrepresentative and non-representative workloads. The results show that existingbenchmarks may not be accurate for real world performance.

Keywords

Time series database, representative benchmarking, load testing

Towards a representative benchmarkfor time series databases

Thomas Toye

Supervisor(s): Bruno Volckaert, Filip De Turck

Abstract— As the fastest growing database type, time series databases(TSDBs) have experienced a rise in database vendors, and with it, a rise indifficulty in selecting the best one. TSDB benchmarks compare the perfor-mance of different databases to each other, but the workloads they use arenot representative: they use random data, or synthesized data that is onlyapplicable to one domain. We argue that these non-representative bench-marks may not always accurately model real world performance, and in-stead, representative workloads should be used in TSDB benchmarks. Inthis context, workloads are defined as consisting of data sets and queries.Workload data sets can be categorized using eight parameters (number ofmetrics, regularity, volume, data type, number of tags, tag value data type,tag value cardinality, variation).

A new benchmark was created, which uses three representative work-loads next to a baseline non-representative workload. Results of this bench-mark show significant performance differences for data ingestion speed forcomplex data, latency and maximum request rate (when broad time rangesare used), and storage efficiency of data points when comparing represen-tative and non-representative workloads. The results show that existingbenchmarks may not be accurate for real world performance.

Keywords— Time series database, representative benchmarking, loadtesting

I. INTRODUCTION

TIME SERIES DATABASES provide storage and interfac-ing for time series. In its simplest form, time series data

are just data with an attached timestamp. This subtype of datahas seen increasing interest in the last decade, especially withthe rise of the Internet of Things, which produces time series foreverything from temperature to sea levels. Other areas wheretime series are used are the financial industry (e.g. historicalanalysis of stock performance), the DevOps industry (e.g. cap-ture of metrics from a server fleet) and the analytics industry(e.g. tracking ad performance over time).

Finding the best database to use is not an easy task. Eighty-three existing TSDBs were found by Bader et al. [1]. To deter-mine the best one, benchmarks are used. However, these bench-marks may not be representative of the use case or industry theTSDB is needed for, which makes their results difficult to gen-eralize.

In this abstract, we will first analyze existing TSDB bench-marks. Then, a new benchmark is proposed, which comparesrepresentative workloads to non-representative workloads. Theresults of this benchmark are analysed to

II. EVALUATION OF EXISTING BENCHMARKS

Chen et al. [2] consolidate the properties of a good bench-mark as follows: 1. Representative: Benchmarks must simulatereal world conditions, both the input to a system and the sys-tem itself should be representative of real world usage. 2. Rel-evant: Benchm arks must measure relevant metrics and tech-nologies. Results should be useful to compare widely-used so-lutions. 3. Portable: Benchmarks should provide a fair compar-

ison by being easily extensible to competing solutions that solvecomparable problems. 4. Scalable: Benchmarks must be able tomeasure performance in a wide range of scale. Not just single-n-ode performance, but also cluster configurations. 5. Verifiable:Benchmarks should be repeatable and independently verifiable.6. Simple: Benchmarks must be easily understandable, whilemaking choices that do not affect performance.

Existing TSDB benchmarks were evaluated, a summary isshown in Table II. Two gaps in the state of the art are clear: cur-rent benchmarks insufficiently test TSDB performance at scale,and current benchmarks are not representative or only represen-tative for a single use case. The data used is either random, orsynthetic; real world data are not used. This begs the question:are results of a non-representative benchmark generalizable toreal world performance?

Rep

rese

ntat

ive

Rev

elan

t

Port

able

Scal

able

Ver

ifiab

le

Sim

ple

TS-BenchmarkFor IoT

3 3 7 3 3use cases

IoTDB-benchmark 7 3 3 7 3 3

TSDBBench 7 3 3 3 3 7

FinTimeFor financial

3 3 7 7 7use cases

influxdb-comparisonsFor DevOps

3 3 7 3 3use cases

TABLE IEVALUATION OF EXISTING TSDB BENCHMARKS

III. BENCHMARK COMPONENTS

A new benchmark is developed to compare benchmark per-formance between representative and non-representative work-loads. Workloads consist of a workload data set that is loadedinto the TSDB and a workload query set that executes upon it.

A. Data set

Time series data sets have the following properties in com-mon: data arrives in order, updates are very rare to non-existent,deletion is rare, and data values follow a pattern.They differ on the following characteristics:

• Metrics: Data points are organizaed in metrics, which can becompared to tables in relational databases.

• Regularity: In regular time series, data points are spaced evenlyin time. Irregular time series do not emit data points regularly.Irregular time series are often the result of event triggers.

• Volume: High volume time series may emit hundreds of thou-sands of data points a seconds, while low volume time seriesonly emit one event a day.

• Data type: Traditionally, values of data points in a time serieshave been integers or floating point numbers. But they can alsobe booleans, strings or even custom data types.

• Tags: A time series data point may have one or more tags asso-ciated with the timestamp and value. There may be no tags ora lot of tags. Tags may hold special values, such as geospatialinformation.

• Tag value cardinality: The number of possible combinationsthe tag values make. Three tags with two possible values eachmake a tag value cardinality of six.

• Variation: While time series data usually follow a pattern, thevariation in a series may be very different. One series may de-scribe a flat line, while another may describe seasonal variationswith daily spikes.

B. Query set

Bader et al. describe ten distinct TSDB queries capabilitiesin [1]. These building blocks (e.g. update, delete, select from atime range) can form time series queries (e.g. select the meanof temperature values from last year, aggregated by day). Nextto the queries themselves, the relative frequency is an importantpart of the query set.

C. Measurement characteristics

Measurement characteristics describe the performance met-rics that are monitored to quantify performance. For TSDBbenchmarks, common metrics include response latency (mean,95th and 99th percentile, etc.), response size, data ingestionspeed, and storage efficiency.

IV. A REPRESENTATIVE BENCHMARK

A benchmark was created with representativeness as its de-sign goal. It compares three representative workloads to a non-representative workload to investigate possible performance dif-ferences. Three real world data sets, from domains in whichTSDB are prevalent, are used, next to a baseline. The baselineis a non-representative data set, with random values and tags.For every data set, twenty queries are written, relevant to thedata set’s domain (e.g. getting the average rating for a moviein the ratings data set), except for the baseline, for which a sin-gle query is used. Vegeta [3] was use to capture response la-tency (mean and 95th percentile), response times, and responsesize. The http load program [4] was used for load testing.Standard UNIX tools were used for storage efficiency analysis.Four TSDBs are tested: InfluxDB, OpenTSDB, KairosDB withCassandra as a backing database and KairosDB with ScyllaDB.These are modern, open source databases with an HTTP inter-face.

Table II shows an overview of the data sets used. The base-line is a data set with random values and tags, the financial data

Baseline Financial Rating IoTMetrics 1 6 1 7

Regularity Regular Semi-reg. Irregular RegularVolume Low Low Low Low

Tags 2 1 5 0Tag value 10,000 7,164 20M 0cardinalityVariation High Low High LowTotal data 20M 74.4M 20M 14,5M

pointsLicense NA CC0 Custom CC-BY-4

TABLE IIOVERVIEW OF WORKLOAD DATA SETS

set uses historical stock market information, the rating data setuses movie reviews and the IoT data set is produced by powerinformation for a house.

V. EVALUATION

A. Storage efficiency

Figure 1 shows relative storage efficiency. The size in bytesper data point was compared to the size per data point in thesource comma seperated value (CSV) source. The input sizewas one million data points for every data set. It shows that rep-resentative data sets have different storage efficiency than thereference. OpenTSDB is better at storing real world data setsthan synthesized data, InfluxDB much worse. Tag value cardi-nality and data point value variation are thought to have a highimpact on storage efficiency.

Baseline IoT Financial Ratings0

20

40

1.4

4

3.3

3

5.8

2

7.2

6

2.4

2

0.2

6

1.5

7

0.8

9

2.7

2.2

4 6.2

1

3.2

9

18.7

9

31.1

7 36.5

1

1 1 1 1

rela

tive

size

InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB

CSV

Fig. 1. Relative storage efficiency of different TSDBs per data point comparedto the CSV source format.

B. Data ingestion throughput

For every data sets, one million data points were loaded intoeach TSDBs and ingestion speed was measured (in data pointsper second). The results are shown in Figure 2. For the rep-resentative ratings workload, performance is degradede, espe-cially for InfluxDB. This is a data set with high tag cardinalityand complex tag values.

Baseline IoT Financial Ratings

104

106 4.8

2·1

05

3.1

8·1

05

1.5

6·1

05

4,3

42

89,3

60

1.6

2·1

05

86,5

78

21,1

96

54,7

92

87,4

13

78,1

98

29,9

13

59,2

31

98,7

36

82,5

35

Dat

apo

ints

pers

econ

d


Fig. 2. Data points ingested per second. Data sets used were one million datapoints each.

C. Load testing

Figure 3 shows results of the load test. The results forOpenTSDB are surprising: it performed well for the baselineand IoT query workloads, but not for the financial and ratingsquery workloads. For the latter two workloads, the time rangesare very broad, so the database has to scan more data. The otherTSDBs may be able to optimize this operation better.


103

106

6,4

00.3

6

997.5

7

235.7

3

78.3

3

347.3

7

117.3

15.5

7

2.1

312.8 29.4

3

26.8

3

15.4

7

40.1

7

28.5

7

Req

uest

spe

rsec

ond


Fig. 3. Maximum requests per second. Tests were performed on data sets onemillion data points in size.

D. Response latency

Figure 4 shows the mean response latency when using a repre-sentative query set. A performance degradation for OpenTSDBsurfaces for the financial and ratings query workloads, whichuse broad time ranges. Otherwise, the baseline is a good predic-tor for relative performance in the representative benchmarks.This is attributed to the same cause as in Section V-C.


100

102

104

1.2

7 7.6

4

57.8

8

104.4

1

12.5

6

18.2

9

2,6

81.4

4

2,5

63.0

2

862.6

4

155.9

1

106.8

5

136.8

3

74.4

9

102.1

2

Lat

ency

(ms)


Fig. 4. Mean latency per request.

E. Response size

Figure 5 shows the mean response size of TSDBs in bytes.The mean response size is correlated with the data set. The sizedifferences for large responses (e.g. financial workload) can beattributed mainly to timestamp encoding.


102

103

104

105

185

507.3

5

33,1

86.1

1,2

50.3

5

126

350.4

5

28,8

54.3

5

390.6

5

202

459.4

23,1

17.7

5

202

459.4

23,1

17.7

5

Res

pons

esi

ze(b

ytes

)


Fig. 5. Mean size in bytes of the TSDB response.

VI. CONCLUSIONS

Compared to a baseline non-representative workload, repre-sentative workloads showed significant performance differenceswhen it came to storage efficiency, data ingestion speed for com-plex data, latency and maximum request rate (when broad timeranges are used). Existing TSDB benchmarks do not use rep-resentative workloads, thus their relevance may be called intoquestion.

The fact that not all representative workloads show perfor-mance impact highlights the importance of using multiple rep-resentative workloads for general TSDB benchmarks - just onerepresentative workload may not be enough to highlight possibledeviations or performance degradations.

It is unpractical to create a representative workload for everydomain, but TSDB workload can be characterized by workloadparameters. Further research is needed to determine if these pa-rameters are enough to accurately describe a TSDB workloadand thus generalize results of one workload to another with thesame workload parameters.

REFERENCES

[1] Andreas Bader, Oliver Kopp, Micheal Falkenthal, Survey and Comparisonof Open Source Time Series Databases, Datenbanksysteme fur Business,Technologie und Web (BTW2017) – Workshopband.

[2] Yanpei Chen, Francois Raab, Randy Katz, From TPC-C to Big Data Bench-marks: A Functional Workload Model, Specifying Big Data Benchmarks.WBDB 2012, WBDB 2012. Lecture Notes in Computer Science, vol 8163.Springer, Berlin, Heidelberg.

[3] Tomas Senart, Vegeta – HTTP load testing tool and library,https://github.com/tsenart/vegeta

[4] Jef Poskanzer, http load, https://acme.com/software/http_load/

CONTENTS ix

Contents

Preface iv

Abstract v

Extended abstract vi

Table of Contents ix

1 Introduction 1

2 Literature review 2

2.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Database Management Systems . . . . . . . . . . . . . . . . 2

2.1.2 Relational databases . . . . . . . . . . . . . . . . . . . . . . 2

2.1.3 Non-relational databases . . . . . . . . . . . . . . . . . . . . 3

2.1.4 NewSQL databases . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.5 Time series databases . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Time series database benchmarks . . . . . . . . . . . . . . . . . . . 4

2.2.1 TS-Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 IoTDB-benchmark . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.3 TSDBBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.4 FinTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.5 influxdb-comparisons . . . . . . . . . . . . . . . . . . . . . . 7

2.2.6 STAC-M3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 State of the art 10

3.1 Uses of time series databases . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 TSDB usage as a data store . . . . . . . . . . . . . . . . . . 10

x

3.1.2 Inherent time series database functions used . . . . . . . . . 11

3.1.3 Common characteristics of time series data . . . . . . . . . . 12

3.1.4 Differing characteristics of time series data . . . . . . . . . . 12

3.1.5 Industry use cases . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 A “good” benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Existing benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 TS-Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.2 IoTDB-benchmark . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.3 TSDBBench/YCSB-TS . . . . . . . . . . . . . . . . . . . . . 18

3.3.4 FinTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.5 influxdb-comparisons . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Evaluation of existing benchmarks . . . . . . . . . . . . . . . . . . . 20

3.4.1 On scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.2 On representativeness . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 A new benchmark 23

4.1 Benchmark components . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Workload data set characteristics . . . . . . . . . . . . . . . 23

4.1.2 Workload query characteristics . . . . . . . . . . . . . . . . 24

4.1.3 Measurement characteristics . . . . . . . . . . . . . . . . . . 24

4.2 Design of a representative data workload . . . . . . . . . . . . . . . 25

4.2.1 A baseline workload . . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 A financial time series workload . . . . . . . . . . . . . . . . 26

4.2.3 A rating system workload . . . . . . . . . . . . . . . . . . . 27

4.2.4 An IoT workload . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.5 Workload data set overview . . . . . . . . . . . . . . . . . . 29

4.2.6 Data set pre-processing . . . . . . . . . . . . . . . . . . . . . 29

4.3 Design of a representative query workload . . . . . . . . . . . . . . 30

4.3.1 Queries for the baseline workload . . . . . . . . . . . . . . . 30

4.3.2 Queries for the financial workload . . . . . . . . . . . . . . . 31

4.3.3 Queries for the rating workload . . . . . . . . . . . . . . . . 31

4.3.4 Queries for the IoT workload . . . . . . . . . . . . . . . . . 32

4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Technical implementation . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.1 Test environment . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.2 Data ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5.3 Load and latency testing . . . . . . . . . . . . . . . . . . . . 34

xi

4.6 Design evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Results 36

5.1 Storage efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Data ingestion throughput . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Load testing with query workload . . . . . . . . . . . . . . . . . . . 40

5.4 Response latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Mean response size . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusions and future work 48

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A Detailed results 51

A.1 Data ingestion throughput . . . . . . . . . . . . . . . . . . . . . . . 51

A.2 Storage efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.3 Load testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.4 Response latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

A.5 Mean response size . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 54

List of Abbreviations 57

List of Figures 59

List of Tables 60

INTRODUCTION 1

Chapter 1

Introduction

Time series databases provide storage and interfacing for time series. In its simplest

form, time series data are just data with an attached timestamp. This subtype of

data has seen increasing interest in the last decade, especially with the rise of the

Internet of Things, which produces time series for everything from temperature

to sea levels. Other areas where time series are used are the financial industry

(e.g. historical analysis of stock performance), the DevOps industry (e.g. cap-

ture of metrics from a server fleet) and the analytics industry (e.g. tracking ad

performance over time).

Time Series Databases (TSDBs) are the fastest growing type of databases. When

selecting a TSDB, performance is one of the main considerations. Comparing

database performance is done using benchmarks, and for TSDBs, a number of

benchmarks already exist. However, these all use either random data or synthetic

data. Moreover, TSDBs have a wide range of applications, and representative

synthesized is only be valid for one domain. Thus, the data used for benchmarks is

either non-representative, or only representative for one use case or industry. Can

the results of performance tests with random or generated data be generalized to

the real world?

In this abstract, we will first analyze existing TSDB benchmarks. Then, properties

of time series data sets are analyzed. Finally, a new benchmark is proposed, which

compares representative workloads to non-representative workloads.

LITERATURE REVIEW 2

Chapter 2

Literature review

2.1 Databases

A database is a set of data, organized in a form that makes it easy to process.

2.1.1 Database Management Systems

A Database Management System (DBMS) is an application for management of

databases. Apart from the creation and deletion of databases, a DBMS allows

create, read, update and delete (CRUD) operations on these databases.

A database is the data itself and how it is organized. The term “database” is often

used instead of “DBMS”. In this dissertation, the two are used interchangeably.

2.1.2 Relational databases

Edgar Codd introduced the relational model in 1970 [1]. Relational databases

use this model to store data, which is represented by rows, attributes of this

data are organized in columns, and the data itself in tables. A relational DBMS

(RDBMS) will most often use Structured Query Language (SQL) for data retrieval

and manipulation.

3

2.1.3 Non-relational databases

As applications began to scale, companies started moving away from traditional

RDBMSs for the following reasons [2]:

• In traditional DBMSs, the focus on correctness leads to degraded perfor-

mance.

• The relational model was thought not to be the best way to store data.

• The DBMSs were often used as simple data stores. A full-blown DBMS was

overkill for such use cases.

These factors caused a move to so-called “NoSQL” databases. The term used to

refer to databases that do away with the relational structure of RDBMSs, but has

taken on the meaning of “Not only SQL” [3]. Catell [4] identifies six key features

of NoSQL DBMSs:

1. Horizontal scalability

2. Replication and partition of data over many machines

3. Simple interface (relative to SQL)

4. Weaker concurrency model (compared to ACID nature of relational DBMSs)

5. Distributed indexes used for data storage

6. Able to add new attributes to existing data

NoSQL databases generally do away with the correctness found in relational

databases. For example, transactions may not be available in NoSQL DBMSs, or

writes may take a while to propagate and show up in reads.

4

2.1.4 NewSQL databases

NewSQL databases try to bridge RDBMS and NoSQL DBMS differences by bring-

ing relational semantics to NoSQL DBMSs [3]. The aim is to have the best of both

worlds: the relational model of RDBMSs and the scalability and fault tolerance of

NoSQL DBMSs.

2.1.5 Time series databases

Time series databases (TSDBs) are databases optimised for storing time series.

Time series are represented in these databases as data points with a value, a

timestamp, and metadata, such as a metric name, tags, and geospatial information.

Time series databases can be relational (e.g. Timescale, a NewSQL DBMS) or

non-relational (e.g. InfluxDB, a NoSQL DBMS) databases.

Bader et al. [5] identified 75 TSDBs, of which 42 are open source and 33 are

proprietary.

2.2 Time series database benchmarks

There are a number of existing benchmarks tailored to TSDBs. This is a recent

development: most of these benchmarks were developed less than three years ago.

2.2.1 TS-Benchmark

TS-Benchmark is a benchmark specifically developed for TSDBs by Chen at the

Renmin University of China in December 2018. A new benchmark was modelled

based on a wind farm scenario: sensor data are appended and queried [6].

Databases tested in this benchmark are InfluxDB, IotDB, TimescaleDB, Druid,

and OpenTSDB. The benchmark is written in Java and uses no external depen-

dencies or frameworks.

5

Apart from a presentation, not much information is available on TS-Benchmark.

Metrics measured by TS-Benchmark:

• Load performance: The ingestion speed of the TSDB, which measured in

points loaded per second

• Throughput performance: new data points appended to an existing time

series (measured in points appended per second)

• Query performance: For both simple aggregation queries and time range,

read queries are performaed and two measurements are made: requests per

second and average response time.

• Stress test. Two stress tests are performed. In the first, data points are

appended while a constant number of queries are ran (performance measured

in points appended per second). In the second, queries are run while a

constant number of data points are appended (performance measured in

requests per second and average response time).

Load performance is different from throughput performance. The former measures

the importing of a big data set into the database, while the latter measures ap-

pending points in real-time. It is unclear if the benchmark uses special facilities

to test load performance (e.g. bulk or batch functionality from the TSDB) or if

importing is needed to test read queries.

2.2.2 IoTDB-benchmark

In preprint paper on arXiv, Liu and Yuan describe IoTDB-Benchmark [7]. The

features that set this benchmark apart from basic benchmarks are generation of

out-of-order data, measurement of system resources, next to database performance

metrics, and simulation of real-world conditions by running heterogeneous queries

concurrently. IotDB-benchmark is written in Java.

IotDB-benchmark has 10 types of queries, ranging from “latest data point” to

“time range query with value filter”. InfluxDB, OpenTSDB, KairosDB, and

6

TimescaleDB are targeted by IoTDB-benchmark. The benchmark also supports

Cloud Time Series Database (CTSDB), a TSDB created by Tencent Cloud1, but

this is not mentioned in the paper.

Metrics measured by IotDB-benchmark:

• Query latency: Statistical metrics, such as average, maximum, 95th per-

centile, etc. are calculated on the time the ten supported query types take.

• Throughput performance: Data points appended to an existing time

series, measured in points appended per second.

• Space consumption: The used disk space is measured.

• : System resources: System resources, such as CPU time, network, mem-

ory and I/O usage are measured.

2.2.3 TSDBBench

TSDBBench was created by Bader as part of his dissertation in 2016. It extends

the Yahoo! Cloud Serving Benchmark (YCSB) for use with time series databases in

a project called YCSB-TS. TSDBBench includes YCSB-TS, the benchmark itself,

and Overlord, a provisioning system written in Python that sets up databases to

test [5].

In practice, the benchmark seems unmaintained. The documentation is out of

date, necessary files are hosted on a defunct domain, and the database versions

tested are several years old.

Ten types of queries are supported, such as “insert”, “update”, “scan” and “sum”.

TSDBBench supports eighteen databases, which is the most of any TSDB bench-

mark.

Metrics measured by TSDBBench:

1Not much documentation on CTSDB is available, and all of it is in Chinese.

7

• Query latency: Statistical metrics, such as average, maximum, 95th per-

centile, etc., are calculated on the time the ten supported query types take.

• Space consumption: The used disk space is measured.

2.2.4 FinTime

FinTime was developed in 1999. It is not written in a specific language: FinTime

is merely a description of a benchmark. The benchmark describes two models,

including data model, queries, and operational characteristics [8]. They contain

nine queries run by five clients at once, and six queries run by fifty clients at once,

respectively.

Metrics measured by FinTime:

• Query latency (defined as “Response Time Metric”): The geometric mean

of query latencies.

• Throughput Metric: Average time that a complete set of queries take.

Every set (nine queries for the first model, six for the second) represents a

user.

• Cost metric: Defined as R×TTC

, where R is the response time metric, T is

the throughput metric, and TC is the total cost of the system in USD. This

metric provides insight in the cost-effectiveness of a system.

2.2.5 influxdb-comparisons

The project influxdb-comparisons is created by InfluxData, the company that

develops InfluxDB. It compares the InfluxDB TSDB to other databases. The

project is written in Go and was started in 2016.

At this moment, the benchmark supports InfluxDB, Elasticsearch, Cassandra,

MongoDB and OpenTSDB.

Metrics measured by influxdb-comparisons:

8

• Space consumption: After batch loading data, disk usage is measured.

• Load performance: Measured in time taken to load the data and average

ingestion rate.

• Query latency: Measured as queries per second.

2.2.6 STAC-M3

STAC-M3 is a closed-source benchmark that measures performance of TSDB

stacks, focused on high-speed applications. The publications, specification, and ap-

plication itself are only accessible to Securities Technology Analysis Center (STAC)

members.

At the moment, only results for the kdb+ database have been published publicly.

The following metrics are measured:

• Storage efficiency: The size of the original data set divided by the size of

the database.

• Mean and maximum response times for a variety of scenarios. For most

scenarios, minimum and median response times are also reported, as well as

the standard deviation.

2.3 Data sets

To study and create benchmarks for TSDBs, it is important to understand the

fields where time series are recorded and analyzed. Six existing repositories of

time series data sets were discovered.

Dau et al. maintain a repository of 128 time series data sets for data mining

and machine learning purposes [9]. The data sets range from electricity usage to

accellerometer data of performed gestures. Every data set is cleaned and docu-

mented.

9

The Center for Machine Learning and Intelligent Systems at the University of

California maintains a database of data sets for use with machine learning [10].

Ninety-two time series data sets are currently in their repository, with domains

ranging from stress detection and retail to electricity consumption and parking

occupancy rates.

Hyndman created the Time Series Data Library (TSDL), which contains about

eight hundred time series data sets [11]. TSDL spans many domains, from hydrol-

ogy and finance to crime and physics.

A “data catalog start-up”called data.world currently has thirty-four time series

data sets in its repository [12]. The data sets are mostly governmental statistics,

such as crime data and pollution indexes.

On Kaggle, 238 data sets show up when searching for time series databases. These

data sets are contributed by different authors.

Leskovec and Krevl maintain the Stanford Network Analysis Project (SNAP) data

sets [13]. These data sets are often graphs, but the online reviews and online

communities data sets contain time series data.

STATE OF THE ART 10

Chapter 3

State of the art

In this chapter, the various uses of time series databases will be examined. Then,

existing benchmarks are evaluated, and gaps in the state of the art are examined.

3.1 Uses of time series databases

3.1.1 TSDB usage as a data store

Some use cases do not exploit the full potential of time series databases, they

merely use a time series database as a data store for time-coupled data. While

the data could be stored in another data store, using a time series database offers

clear advantages:

• Compression: Since time series data arrives mostly in-order, high compres-

sion ratios can be achieved efficiently with delta coding, or more advanced

compression algorithms, such as SPRINTZ [14].

• Scalability: Most modern time series databases come with scalability built-

in, removing the need to worry about data migration when applications

become bigger or more data-intensive.

• Usage of inherent time functions when needed: Even if an application

makes no use of time series functions, they could do so at a later time, without

11

the need for data migration. This also holds true for arbitrary queries: when

engineers want to run time-based arbitrary queries, they can do so without

data transformation or migration.

Anomaly detection, forecasting and prediction are examples that usually use the

time series database as a data store: a separate application provides the processing.

3.1.2 Inherent time series database functions used

Most TSDBs are not simple data stores, but provide specialised functions to handle

times series analysis and aggregation. Bader et al. [5] describe the following time

series database capabilities:

• INS: Insertion of a single data point

• UPDATE: Update of one or more data points with a certain timestamp

• READ: Retrieval of one or more data points with a certain timestamp

• SCAN: Retrieval of rows in a timestamp range

• AVG: Calculates the average value in a time range

• SUM: Calculates the sum of values in a time range

• CNT: Counts the number of data points with a certain timestamp

• DEL: Deletes data points with a certain timestamp

• MAX: Calculates the maximum value in a time range

• MIN: Calculates the minimum value in a time range

Functions that calculate a value, such as SUM, can be aggregated in time peri-

ods. Time series databases provide first-class support for queries like “average of

temperature grouped in blocks of 7 minutes” and “highest CPU usage for every

hour”.

12

Visualisation is an example that relies heavily on these features. To provide users

with flexible visualisation options, the database needs to support, or at least facil-

itate, the above functions.

3.1.3 Common characteristics of time series data

While time series are used in different industries for a variety of use cases, in

general, time series data have the following characteristics:

• In-order data arrival: Data will, with rare exceptions, arrive with ascend-

ing time stamps.

• Updates are non-existent: Changing data points are rare and not part

of normal operations.

• Deletion is rare: It is uncommon for individual data points to be deleted,

but it may be common to retire a large amount of data points at a time, for

example, when data points are being retired as part of a retention policy.

• TSDB-specific functions may be heavily used, depending on the ap-

plication.

• Data values follow a pattern: There might be trends, cycles, seasonal and

non-seasonal cycles. It’s rare for time series data to be completely random.

3.1.4 Differing characteristics of time series data

While time series data have general characteristics, series may diverge on the

following properties:

• Regularity: In regular time series, data points are spaced evenly in time.

Irregular time series do not emit data points regularly. Irregular time series

are often the result of event triggers.

• Volume: High volume time series may emit hundreds of thousands of data

points a seconds, while low volume time series only emit one event a day.

13

• Data type: Traditionally, values of data points in a time series have been

integers or floating point numbers. But they can also be booleans, strings or

even custom data types.

• Tags: A time series data point may have one or more tags associated with

the timestamp and value. There may be no tags or a lot of tags. Tags may

hold special values, such as geospatial information.

• Tag value cardinality: The number of possible combinations the tag values

make. Three tags with two possible values each make a tag value cardinality

of six.

• Variation: While time series data usually follow a pattern, the variation

in a series may be very different. One series may describe a flat line, while

another may describe seasonal variations with daily spikes.

3.1.5 Industry use cases

Internet of Things and sensor data

The Internet of Things revolution has made it possible to connect devices to the

internet that were previously only available as offline systems. These devices can

be split up in two categories: actuators, to which commands can be sent to perform

an action, and sensors, which sense the current environment and translate physical

quantities into digital values.

The values sent from these sensors and the usual analyses performed upon them

are a natural fit for time series databases. Every data point generated by a sensor

is associated with a timestamp (the time at which it was produced). The frequency

of data generation depends on the application domain, common intervals are every

minute, every ten minutes and every hour.

Common operations on sensor data include getting the most recent data points,

averaging data points over time intervals and flexible visualisation. IoT data sets

are usually regular, low volume for small amounts of sensors, and often makes use

of geospatial tags.

14

Financial

Time series have long been a subject of study in financial disciplines. Stock in-

formation, exchange rates and portfolio valuations can all be represented as time

series, thus a time series database is a logical choice to store financial data points.

For example, kdb+, a time series database developed by Kx Systems, is often used

in high-frequency trading. kdb+ also explicitly presents other financial use cases,

such as algorithmic trading, forex trading, and regulatory management.

Financial time series are regular, but differ greatly in volume. Data points may

be produced every day (e.g. stock closing prices) to every few milliseconds (e.g.

high-frequency trading).

DevOps and machine monitoring applications

In the operations and DevOps industries, TSDBs are used extensively to monitor

computer systems and software applications. Common metrics include processor

load, memory usage and application response times. Metrics are usually aggre-

gated on the device they are collected from in one minute intervals before being

sent to a metrics collector.

The collected data are used for manual analysis (e.g. “What is the slowest com-

ponent in our stack?”), alerting (e.g. sending an alert when the average load is

above 90% for more than 5 minutes) and automatic anomaly detection.

Software monitoring and DevOps use cases produce regular time series that are

low volume for small amounts of machines and applications.

Asset tracking

Apart from software applications, time series databases are also often used to

monitor physical systems. Most time series databases include support for storing

and querying spatial data. This way, it is possible to associate location data.

15

Use cases include asset tracking (e.g. storing current location of vehicles at a point

in time) and geographical filtering (e.g. average of temperature for sensors within

a range).

Asset tracking use cases produce data points with geospatial information. Time

series produced can be regular (e.g. location is sent every minute), but is of-

ten irregular. Since asset tracking use cases involve tracking entities in a large

geographical area or in rough terrain, connectivity may be limited. This means

accurately determining position and transmitting that position may be impacted,

resulting in irregular time series.

Analytics

In analytics, time series may be used to monitor website visits, advertisement

clicks, or E-commerce orders.

Time series are used to track key perfomance indicators (KPIs) and infrastruc-

ture costs at Houghton Mifflin Harcourt [15]. KPIs can give an insight in the

performance of the business.

These use cases produce irregular time series, since they are based on events.

The volume may depend on various factors, such as the time (e.g. orders on a

Wednesday night compared to orders on Black Friday), the weather (e.g. umbrellas

sold in a convenience store), or other arbitrary factors (e.g. number of cars per

hour on a day with a train strike).

Physics experiment tracking

Time series databases have been used in physics experiments to capture and pro-

cess high volume data streams. For example, at CERN, the time series database

InfluxDB handles writes at a rate of over 700kHz [16].

Other use cases

Other use cases include game bot detection based on time series classification

[17], telecommunications forecasting based on usage pattern prediction and fraud

16

detection through pattern analysis.

3.2 A “good” benchmark

Chen et al.[18] consolidate the properties of a good benchmark based on previous

research as follows:

• Representative: Benchmarks must simulate real-world conditions, both

the input to a system and the system itself should be representative and

relevant.

• Relevant: Benchmarks must measure relevant metrics and technologies.

Results should be useful to compare widely-used solutions.

• Portable: Benchmarks should provide a fair comparison by being easily

extensible to competing solutions that solve comparable problems.

• Scalable: Benchmarks must be able to measure performance in a wide range

of scale. Not just single-node performance, but also cluster configurations.

• Verifiable: Benchmarks should be repeatable and independently verifiable.

• Simple: Benchmarks must be easily understandable, while making choices

that do not affect performance.

These properties can be used to put existing benchmarks to the test. Relevance

of individual benchmarks will not be evaluated. All of these benchmarks evaluate

time series databases. Since TSDBs are the fastest growing type of database [19],

we consider all benchmarks relevant.

3.3 Existing benchmarks

Here, existing benchmarks for time series databases are examined in more detail

and properties described in Section 3.2 are discussed.

17

3.3.1 TS-Benchmark

TS-Benchmark is a benchmark simulating a wind plant monitoring system.

• 3 Representative: TS-Benchmark uses a data model inspired by real world

applications. An ARIMA time series model is trained with real-world wind

power data [6].

• 3 Portable: TS-Benchmark targets InfluxDB, IoTDB, TimescaleDB, Druid

and OpenTSDB.

• 7 Scalable: Only single-node performance of database systems is tested. The

benchmark could be extended to perform on multi-node database systems.

• 3 Verifiable: The source code for TS-Benchmark was published on GitHub.

• 3 Simple: The benchmark follows a simple five-stage course, in which each

stage performs a single operation or test.

3.3.2 IoTDB-benchmark

In a recent paper, for now only published on ArXiv, Liu et al. describe IoTDB-

benchmark, a benchmark specifically designed for time series databases [7].

• 7 Representative: The data generator creates square waves, sine waves and

sawtooth waves with optional noise. Furthermore, constant values and ran-

dom values within a range can be generated. Care needs to be taken when

selecting a data generation function: rarely will real-world data follow a per-

fect sine function. This will have an effect on the compaction of data. To

ensure representativeness of data, the “random values within a range” func-

tion is the best approximation. However, depending on the use case, it will

still not be representative of most real-world data, where subsequent data

points may have a relatively low delta compared to other points close in time

instead of a completely random delta.

IoTDB-benchmark allows configuration of many data generation parameters,

such as the data type of fields, number of tags per device, etc.

18

• 3 Portable: IoTDB-Benchmark supports IoTDB, InfuxDB, OpenTSDB,

KairosDB, TimescaleDB, and CTSDB. The focus is on IoTDB, and not all

functions are supported in databases other than IoTDB. For example, gen-

eration and insertion of customized time series is only supported for IoTDB

at the moment.

• 7 Scalable: Only single-node performance of database systems is tested. The

benchmark could be extended to perform on multi-node database systems.

• 3 Verifiable: The source code for IoTDB-Benchmark was published on

GitHub.

• 3 Simple: The benchmark follows a simple six-stage course, in which each

stage performs a single operation or test.

3.3.3 TSDBBench/YCSB-TS

YCSB-TS, part of the TSDBBench benchmark, is a fork of YCSB that targets

time series databases, since these databases are not supported in YCSB.

• 7 Representative: YCSB-TS allows configuration of the workload used. Se-

lecting or creating a good workload is critical in ensuring that the benchmark

is representative. The standard workload is artificial and not based on real-

world data.

• 3 Portable: YCSB-TS supports InfluxDB, KairosDB, Blueflood, Druid,

NewTS, OpenTSDB and Rhombus.

• 3 Scalable: YCSB-TS has support for benchmarking multi-node set-ups.

Tests were performed with single-node set-ups and five-node set-ups[5].

• 3 Verifiable: The source code for all components of TSDBBench was pub-

lished on GitHub, along with instructions on how to replicate the benchmark.

• 7 Simple:

19

3.3.4 FinTime

FinTime is an older benchmark (it was proposed in 1999), but it still holds value

as a representative benchmark. It mimics financial industry use cases.

• 3 Representative: FinTime’s two models are based on real-world financial

use cases. Namely, it specifies data generation and queries for historical

financial market information and a tick database for financial instruments.

• 3 Portable: FinTime does not prescribe a query language. Implementations

have been created for SQL databases, but SQL is not required.

• 7 Scalable: The benchmark was performed on single-node database systems,

but could be extended to work on multi-node systems.

• 7 Verifiable: Only the source code for the data generation was published. It

is unclear how latency and throughput are measured.

• 7 Simple: Since FinTime is only a description of a data schema and queries

to be run, it requires manual implementation.

3.3.5 influxdb-comparisons

The influxdb-comparisons project is a benchmark created by InfluxData, vendor

of InfluxDB.

• 3 Representative: The influxdb-comparisons benchmark simulates a DevOps

use case, where a lot of different hosts send usage statistics (such as CPU

load, disk IO usage, etc.) to a time series database. This is a representative

benchmark for this scenario.

• 3 Portable: The benchmark currently supports seven different TSDBs.

• 7 Scalable: Only single-node performance is tested. The benchmark could

be extended to perform on multi-node database systems.

• 3 Verifiable: The source code for influxdb-comparisons is available under

the MIT licence on GitHub.

20

• 3 Simple: The benchmark follows a five-stage course, in which each stage

performs a single operation or test.

3.4 Evaluation of existing benchmarks

Table 3.1 shows the compiled evaluation of existing benchmarks.

Benchmark Representative Rev

elan

t

Por

table

Sca

lable

Ver

ifiab

le

Sim

ple

TS-Benchmark For IoT use cases 3 3 7 3 3

IoTDB-benchmark 7 3 3 7 3 3

TSDBBench 7 3 3 3 3 7

FinTime For financial use cases 3 3 7 7 7

influxdb-comparisons For DevOps use cases 3 3 7 3 3

Table 3.1: Evaluation of existing TSDB benchmarks

3.4.1 On scalability

Scalability is a gap in the current state of the art. Only one benchmark, TS-

DBBench, tests multi-node performance. Testing multi-node set-ups is often

harder due to either long manual or error-prone automated test set-up provision-

ing.

When TSDBS are actually deployed in the real world, multi-node setups are the

norm. Benchmarks should reflect this. Actually supporting multi-node setups in

a benchmark is usually not hard, but configuring, setting up, and comparing these

setups takes a lot of time.

Most benchmarks are able to test multi-node setups, due to the fact that most

distributed TSDBs present a single interface: the client application does not need

to be aware of the clustered nature of the TSDB.

21

3.4.2 On representativeness

As mentioned in Section 3.2, representativeness means that benchmarks must sim-

ulate real-world conditions, both the input to and the system itself. For the system

itself, this means no configuration tuning that would not be used in real produc-

tion systems, running benchmarks on system configurations that reflect systems

on which production databases would run, etc. For the input to the system, that

means real world data and real world queries, or data and queries comparable

to real world usage of them. Representativeness is import for generalisation pur-

poses: we can not generalize the results of a benchmark to real world usage if the

benchmark is not representative of real world usage.

TS-Benchmark, FinTime and influxdb-comparisons seem to be representative

benchmarks, but this is only true for specific domains. The results of FinTime are

only valid in financial contexts, for influxdb-comparisons only in specific DevOps

contexts. This leads to false generalisations: we can not make conclusions on the

performance of a database as a whole when a benchmark simulating a single use

case is used.

Tay [20] and Zhang et al. [21] have made the case for application-specific bench-

marking: instead of using generic micro-benchmarks, real world data are either

used directly to benchmark a system or used to construct a representative bench-

mark.

Since the use cases of time series databases are broad, it is necessary to develop

benchmarks that test a variety of representative scenarios. At the moment, no

such benchmarks exist.

3.5 Contribution

This dissertation discusses the design, technical implementation and results of a

representative benchmark. It compares three representative workloads to a base-

line. The representative workloads use existing real world time series data sets

22

and are chosen to simulate environments and use cases in whichk TSDBs are often

used.

Evaluation of the results of the benchmark will determine if representative bench-

marks are a necessity, or if non-representative benchmarks accurately predict perfo-

mance for representative workloads. If non-representative benchmarks can predict

real world performance, then representative workloads are not needed, which may

lead to simpler benchmarks. If non-representative benchmarks can not accurately

predict real world performance, validity of non-representative benchmarks can be

called into question.

A NEW BENCHMARK 23

Chapter 4

A new benchmark

In the previous chapter, current benchmarks have been examined, and their insuf-

ficient representativeness has been noted. This may present a problem for general-

isation of their results: do they accurately model real world performance? In this

chapter, a new benchmark will be described, with a focus on representativeness.

This benchmark will be used to test both representative and non-representative

workloads to examine differences in performance.

4.1 Benchmark components

A benchmark consists of multiple separate components. The workload data set

characteristics are the time series data characteristics described in Section 3.1.4.

Workload query characteristics are comprised of characteristics of the queries them-

selves, and the spread between query types. Finally, the metrics measurement

component will be considered.

4.1.1 Workload data set characteristics

Apart from the time series data characteristics discussed in Section 3.1.4, time

series data sets can be categorised as synthetic or real world and high existing

volume and low existing volume.

24

Synthetic workload data sets use tunable synthesizers that can generate workload

data sets [22]. These workloads may trade configurability for representativeness,

and care should be taken in their configuration. Real world data will be used as

the workload for this benchmark.

High existing data volumes may influence database performance. For big data

sets, a DBMS may need to scan large amounts of data.

4.1.2 Workload query characteristics

In Section 3.1.2, the functions of TSDBs were defined. These lead to possible

queries, such as reading single data points, averaging data points values within a

time range, and summation of all data point values with a certain tag.

Not only is the type of queries important, but the relative frequency of the query

type compared to all query types. For example, an application may frequently

insert new data, while calculating the maximum data point value is done infre-

quently.

Concurrency may play an important role when benchmarking queries. When mul-

tiple queries are run, performance may degrade, especially when read and write

queries are mixed. In this dissertation, mixed read and write queries are not con-

sidered. Write queries will be considered in an ingestion benchmark, and read

queries will be considered in load testing and latency testing benchmarks.

4.1.3 Measurement characteristics

The last benchmark factor is the measurement component. This component mea-

sures the effective performance of operations performed. The metrics surveyed

may be latentcies, network usage, storage requirements, etc. Care must be taken

that the measurement component minimally influences the benchmark results. For

example, an ingestion client could monitor the number of data points per second

sent to the database: this requires no instrumentation on the database server and

thus minimally disturbs it.

25

4.2 Design of a representative data workload

In Section 3.4.2, it was argued that representativeness is dependent on industry

and use cases. Therefore, as the workload data set for the time series database

benchmark, four different data sets will be considered. These are selected to be in

different domains, with different time series characteristics. To ensure representa-

tiveness, data sets with real data are used. These are selected to model real world

use cases for time series databases.

Of course, four different data sets do not cover every industry or use case. How-

ever, analysis of the results of benchmarks using these workload data sets will

allow comparisons that indicate if the considered use case has an influence on

performance.

4.2.1 A baseline workload

This is a non-representative workload, to be used as a baseline for comparison

with representative workloads. Data points are written to one metric with random

values and random tags.

• Metrics: Only one metric is tracked: “benchmark”. All data points belong

to this metric.

• Regularity: The time series is fully regular, with one data point being

produced every second.

• Volume: Low volume. There is only one metric where a data point is

produced every second. There are no spikes of traffic.

• Data type: For this benchmark, floating point numbers will be used to

represent the data point values.

• Tags: Every data point is tagged with two random tags. The possible values

of the first tag are TAG 1 00 to TAG 1 99 and the possible values of the second

tag are TAG 2 00 to TAG 2 99.

26

• Tag value cardinality: High. There are 10,000 (two tags with 100 possible

values each) possible tag combinations.

• Variation: High. The values are randomly generated for every data point.

Data point values bear no relationship to previous values. The values are

floating point numbers between 0 and 100 inclusive.

4.2.2 A financial time series workload

Time series data are often used in financial analysis. Prices of commodities, fu-

tures, assets, and other financial instruments produce time series [23]. This his-

torical data can then be used in performance calculations, price prediction, and

financial ratio calculation.

The data set used for this benchmark was created by Boris Marjanovic and pub-

lished on Kaggle [24]. It is licenced under CC01. The data set contains historical

data for 1344 Exchange-Traded Funds (ETFs) and 7195 stocks. For each stock

and ETF, it lists the open, high, low and closing prices, next to the volume2 and

open interest3 for every day the ETF or stock was trading.

• Metrics: Six different metrics are tracked: the opening, high, low and clos-

ing prices for the stock, and the volume and open interest for the stock.

• Regularity: Semi-regular. Every day, an update is published, except on

weekends and market closings (such as holidays). It is rare for new stocks to

be published or for existing stocks to be removed from the exchange.

• Volume: Low volume, with short bursts. Data are published at market

closing, which is the same time every day. This may lead to short spikes of

high traffic when a lot of stocks are tracked.

• Data type: Prices are represented by numbers with five digits past the

decimal points. Floating point numbers are sometimes not used to store these

1Creative Commons 1.0 Public Domain Dedication, which dedicates this work to the public

domain.2The total number of shares traded during a day.3The number of outstanding contracts that have not been fulfilled.

27

prices, due to the possible inaccuracies and high cost of processing floating

point operations. Instead, the prices are multiplied by 105 and saved as

integers. This does place a burden on client applications if the database does

not perform this conversion itself, therefore, they will be saved as floating

point numbers for this benchmark.

• Tags: Only a single tag is saved: the ticker symbol. Ticker symbols are

strings, for which no general format is specified: every exchange specifies

their own rules. In general, the symbol length is short (nine is the maximum

length in the data set), alphanumeric (and additionally contain no numbers

for the data set) and case-insensitive. As an example, Apple’s stock ticker

symbol is AAPL.

• Tag value cardinality: Medium. There are 7,164 possible tag values. For

the first one million data points, the tag cardinality is 143.

• Variation: Low. While stock prices are volatile, it is rare for stocks to have

high changes in the span of a day.

4.2.3 A rating system workload

Rating systems allow customers and consumers to rate their experiences of goods

and services. Users can like or dislike products, leave comments about a restaurant

visit, or leave a rating for sellers on online marketplaces. Commonly, this feedback

is represented as a five-star system, where half a star represents the lowest score,

and five stars represents the maximum score.

GroupLens Research created data sets of varying sizes from the MovieLens website,

which allows users to rate movies with a five-star system [25]. The MovieLens 20M

data set contains twenty million ratings and is the basis for this workload. The

data set comes with a custom license, allowing non-commercial use, but forbidding

redistribution.

• Metrics: Only one metric is tracked: ratings. The value of the data point

is the rating the user gave a movie, and the timestamp is when this rating

was published.

28

• Regularity: The time series is irregular. The data points are events, pro-

duced when a user leaves a review.

• Volume: Approximately one review was left every thirty seconds. This is

not a high level of activity, we can therefore qualify this time series as low

volume.

• Data type: The ratings are floating point numbers, between 0.5 and 5.0 in

0.5 increments. This leads to ten possible values.

• Tags: Five tags are associated with every data point: userId (integer, the

identifier of the user who left the review), title (string, the title of the

movie being reviewed), imdbId (integer, the identifier of the movie on the

Internet Movie Database4), tmdbId (integer, the identifier of the movie on

The Movie Database5), genres (string, a list of genres the movie belongs to

encoded as a string).

• Tag value cardinality: High. There are 138,493 different users, and 26,212

different movie titles. The rest of the tags are dependent on the movie title

(the title directly implies the genre and external identifiers). Since not every

user has rated every movie, the tag cardinality is not the multiplication of

these two figures. The tag cardinality of the complete data set was deter-

mined to be 20,000,262 and the tag cardinality of the first one million data

points was determined to be 1,000,000.

• Variation: Subsequent points do not relate to each other, since they are

ordered by timestamp and not the movie reviewed. This leads to a high

variation. However, the absolute variation is still small, since the maximum

absolute variation is 4.5.

4.2.4 An IoT workload

IoT applications, in particular sensor applications, produce a lot of data. This can

be temperature data, power consumption, location data, etc. IoT data are almost

4https://www.imdb.com/5https://www.themoviedb.org/

https://www.imdb.com/

https://www.themoviedb.org/

29

always temporally indexed, thus a time series database is a natural fit.

The UCI6 Machine Learning Repository [10] contains the Individual household

electric power consumption Data Set, a data set which records power information

for a house every minute. It was created by Georges Hebrail and Alice Berard and

released under the CC BY 4.07 license.

• Metrics: Seven metrics are tracked for the household: active and reac-

tive power, voltage, intensity (current), and three power meters for different

rooms.

• Regularity: The data set is regular. Every minute, a new data point is

emitted. Data are missing for a small period of time, and for these missing

data points, the values were filled in with zeroes.

• Volume: Only seven data points are emitted every minute. This makes the

data set low volume.

• Tags: The data contains no tags.

• Variation: Variation between subsequent data point values is low due to

the small sampling interval.

4.2.5 Workload data set overview

Table 4.1 shows an overview of all used workload data sets.

4.2.6 Data set pre-processing

The data sets were pre-processed using Python. All data sets are denormalized

as to provide one data point per line in the resulting file. Every line provides a

complete data point, including the timestamp, metric name, data point value, and

(potentially) tags.

6University of California, Irvine7Creative Commons Attribution 4.0 International

30

Baseline Financial Rating IoT

Metrics 1 6 1 7

Regularity Regular Semi-regular Irregular Regular

Volume Low volume Low volume Low volume Low volume

Tags 2 1 5 0

Tag value cardinality 10,000 7,164 20,000,262 0

Variation High Low High Low

Total data points 20,000,000 74,418,459 20,000,262 14,526,812

License CC0 Custom CC BY 4.0

Table 4.1: Overview of workload data sets

4.3 Design of a representative query workload

A representative data set is only part of the workload. Representative queries on

these data sets are the other. While real world data sets are readily available, in-

formation on data usage or queries performed on these data sets is not. Therefore,

for every data set, logical queries and patterns will be created. For a truly rep-

resentative query workload, existing TSDB systems should be surveyed and their

usage patterns monitored.

The implementation of query workloads was complicatied by the fact that every

database use a custom query language. These may have different semantics. For

example, when grouping a time range by week, some TSDBs will start the grouping

block on the start timestamp of the given range, others will align the groups by

the calendar (so the first block may not be a full week). Query results have to

be compared to ensure correctness. A standardized query language, as SQL is for

RDBMS, would speed up development of benchmarks and TSDB applications.

4.3.1 Queries for the baseline workload

The baseline workload is a non-representative workload to which others will be

compared. The query workload reflects this: there is only one query, requesting a

single data point between two timestamps with two specific tags.

31

4.3.2 Queries for the financial workload

The financial query workload simulates a stock information application which in-

forms stock traders of historical statistics. The following queries are run:

• Get all opening prices for a stock in a time range (relative frequency: 0.20)

• Get the minimum closing price for a stock (relative frequency: 0.25)

• Get the maximum opening price for a stock (relative frequency: 0.15)

• Get the mean high price for a stock grouped by week (relative frequency:

0.25)

• Get the total volume for a stock grouped by four weeks (relative frequency:

0.15)

4.3.3 Queries for the rating workload

The ratings query benchmarks simulates the backing database of a movie website.

The queries get the average rating for a movie with a title or IMDb identifier, get

ratings for a particular user and group average ratings for a movie by year:

• Get the mean rating for a movie with a specific title (relative frequency:

0.70)

• Get the mean rating for a movie with a specific IMDb identifier (relative

frequency: 0.10)

• Get all ratings by a specific user (relative frequency: 0.05)

• Get mean rating per year for a movie with a specific title (relative frequency:

0.15)

32

4.3.4 Queries for the IoT workload

The IoT query workload mimics an power consumption application. Mean power

for week (possibly grouped by day) and for a three month time range, active power

is grouped by week:

• Get mean active power for a one week time range (relative frequency: 0.4)

• Get mean active power for a two week time range grouped by day (relative

frequency: 0.4)

• Get mean active power for a twelve week time range grouped by week (rela-

tive frequency: 0.2)

4.4 Metrics

Ingestion throughput is the number of data points per second that can be

inserted into the database, possibly using a bulk loading mechanism. This metric

is especially important for OLAP applications, where data from a master database

is loaded into a TSDB for time series analytics processing.

Space consumption is the amount of storage required to store the database.

Storage efficiency is space consumption divided by the number of data points

stored. This metric shows how efficient the database engine is at compressing data

points. The measurement is taken after loading the database with a predefined

set of data points, and is expressed in bytes per data point.

Latency, expressed in mean, 95th, and 99th percentile response times, shows how

fast the database can answer queries. For user-facing applications, this is especially

important: applications need to render quickly, or users leave.

Load testing gives us the maximum number of requests per second a TSDB

can handle.

The mean response size is the average size in bytes of the returned TSDB

response body. This response body may contain, next to the requested data,

33

metadata, such as the number of data points used in calculation, aggregated tags,

etc. While this information may be useful to some applications, in general the

TSDB response is preferred to be small. This leads to faster responses, lower

network load and lower memory requirements, though the effects may be small.

4.5 Technical implementation

4.5.1 Test environment

The tests were run on homogenous machines containing two Quad core Intel E5520

(2.2GHz) CPUs, 12GB RAM and a 160GB harddisk. The devices were connected

via Gigabit Ethernet.

The versions of the databases used are as follows: OpenTSDB 2.3.1 (with HBase

1.4.4),InfluxDB 1.5.4, KairosDB 1.2.1 with either ScyllaDB 3.0.6 or Cassandra

3.11. These databases were minimally changed from their stock configuration. For

InfluxDB, the maximum number of series was increased, for OpenTSDB, chunked

requests were enabled, and for KairosDB (with ScyllaDB) the maximum batch

size was decreased to one hundred for the financial workload.

Databases used were run in Docker containers (one container for OpenTSDB

and InfluxDB, two containers for KairosDB with either underlying DBMS in a

docker-compose setup). When not under test, containers were stopped. Only

one container was under test at a given time and no other applications were active

on the database host, apart from basic monitoring software.

During tests, one machine acted as the database host, while the other loaded the

data or performed queries.

4.5.2 Data ingestion

Data loaders from the influxdb-comparisons project [26] were used. These load

the data sets, converted for use with a specific database, into that specific database.

Since no data loader was available for KairosDB, its Telnet API was used.

34

4.5.3 Load and latency testing

Vegeta [27], a load testing tool, is used to test latencies. Every second, a data

set-specific number of requests is made to the TSDB. There are twenty queries in

every query workload, and each one is translated to the query language of every

TSDB. The queries are cycled in a round-robin pattern, as to ensure determinism.

http load [28] is used to conduct load testing. The program is configured with

a thirty second timeout, ten parallel requests and a thirty second run time. The

same URLs as the latency testing measurement are loaded and ten requests are

made in parallel. When one finishes, another one starts. Afterwards, the number

of requests per second is reported.

4.6 Design evaluation

In Section 3.2, properties of a good benchmark were discussed. Now, these will be

applied to the benchmark described in this chapter.

• Representative: Through the use of multiple use cases, real world, non-

synthetic data sets, and balanced query workloads, this is a very representa-

tive benchmark.

• Relevant: This benchmark evaluates TSDBs. As the fastest growing type

of database [19], this can be considered a relevant benchmark. The met-

rics measured are based on other database and TSDB benchmarks, and are

comparable with them.

• Portable: To add a new database to the benchmark, the following com-

ponents are necessary: a Docker container containing the database, a data

formatter, a data ingestion loader, and a set of queries as HTTP requests.

Most open source databases have existing Docker containers, and creating a

data formatter is a few hours of work. A data ingestion loader is more time-

consuming, but many databases have existing ingestion loaders. The last

component presents a challenge: not every database has an HTTP interface.

For example, TSDBs that rely on an existing RDBMS, such as Timescale

35

(built on PostgresQL), do not include an HTTP API. To benchmark this

kind of databases, the benchmark would need to be extended to include

other measurement tools. This does make comparison of results harder.

• Scalable: Both the ingestion and querying component of the benchmark

are able to accept a list of different URLs to spread the load. This makes

the ingestion and measuring component of the benchmark scalable. However,

tests were only conducted on single-node TSDB setups. Multi-node database

setups are hard to set up right, and it is especially hard to fairly compare

heterogenous DBMSs, such as TSDBs.

• Verifiable: The data sets used are available under open licenses (Section

4.2.5), the tools to ingest are available under the MIT licence [26], and the

tool to test latencies and response size is available under the MIT license.

The components to denormalize the data sets, to transform them to specific

database formats, and the database setup components will be made available

as open source when the embargo on this master’s dissertation ends.

• Simple: The benchmark was kept as simple as possible, with distinct parts

doing a single thing. This leads to an architecture where one component can

easily be swapped with another, e.g. the data generator could be switched

with a generator from another benchmark.

RESULTS 36

Chapter 5

Results

The results of the data ingestion of the data set workloads described in Section

4.2 and the query workload upon those data sets described in Section 4.3 are

presented here. The metrics reported are described in Section 4.4. The results are

analysed to examine possible performance differences between non-representative

and representative workloads.

5.1 Storage efficiency

One million data points were inserted in TSDBs. Afterwards, the database was

shut down, and the size of the data directory of every TSDB was measured. This

includes raw database files and write-ahead logs. As a comparison, the storage

efficiency of CSV files is included. Figure 5.1 shows the results graphically.

InfluxDB performs nearly as well as CSV for the baseline data set workload, but

is much less efficient for the representative data set workloads.

OpenTSDB and KairosDB require at least one tag to be present on the data points.

Therefore, the tag notags with the string value "true" was added on the IoT data

set, which contains no tags otherwise. This may influence storage consumption,

but both TSDBs performs very well for this data set workload nonetheless.

OpenTSDB outperforms every other TSDB for all representative data set work-

37


101

102

103

10482.5

9

114.

84

171.

35 583.

34

138.

45

8.91

46.1

2

71.5

7154.

3

77.0

7 182.

72

264.

24

1,07

4.73

1,07

4.73

1,07

4.73

57.1

8

34.4

8

29.4

4 80.3

2

byte

s

InfluxDB OpenTSDB KairosDB-CassandraKairosDB-ScyllaDB CSV

Figure 5.1: Storage efficiency of different TSDBs in bytes per data point. Data points

contain a timestamp, a value, and may contain tags, depending on the data

set.


10

20

30

40

1.44 3.

33 5.82 7.26

2.42

0.26 1.57

0.892.

7

2.24

6.21

3.29

18.7

9

31.1

7 36.5

1

01 1 1 1

rela

tive

size

InfluxDB OpenTSDB KairosDB-CassandraKairosDB-ScyllaDB CSV

Figure 5.2: Relative storage efficiency of different TSDBs per data point compared to

the CSV source format.

38

loads. It shows exceptional performance for the IoT dataset, where it is able to

store data nearly four times as efficient as the CSV input data set. This is likely

a result of the low tag value cardinality: there is only one tag and one tag value1.

KairosDB (with Cassandra) performs well for the IoT data set workload, but does

not do better than the CSV source data set. It always uses at least twice as much

storage space as the source data set.

KairosDB (with ScyllaDB) was unable to complete data loading for the ratings

data set. For the other data sets, it used exactly 1074.73 bytes per data point

to store all data, regardless of the data size. To ensure these three measurements

were correct, they were repeated, and the same values were found. The fact that

the persisted data size is so large is remarkable, since the ScyllaDB uses the same

storage format as Cassandra [29].

When comparing the relative storage efficiency compared to CSV (graphically

displayed in Figure 5.2), the impact of high tag value cardinality becomes clear.

Tag value cardinality is the number of possible combinations tag values can make.

InfluxDB in particular requires relatively more storage space to store higher tag

cardinality for the representative data set workloads. Other TSDBs display no

such dependency on tag value cardinality. Variation may also play an important

role. The representative workloads have lower data point value variation than

the baseline, especially the IoT data set. This may enable OpenTSDB to more

efficiently store the time series.

It is clear that representative data set workloads allow to see patterns not uncov-

ered by a traditional data set workload. A non-representative benchmark might

appoint InfluxDB the winner of a storage efficiency test, while it is clear that, on

the given representative domains, OpenTSDB has much better storage efficiency.

1The original data set does not contain any tags, but since OpenTSDB requires at least one

tag for data points, the tag notags with the string value "true" was used.

39

5.2 Data ingestion throughput

The data ingestion throughput or data ingestion rate is the number of data points

a TSDB can ingest per second in a bulk loading pattern. Ingestion rate tests were

performed with data sets with one million data points and the results are shown

in Figure 5.3.

InfluxDB outperforms the other TSDBs for all data set workload ingestion tests,

but performs exceptionally worse at the intake of the ratings data set workload,

where its ingestion is seven times slower than KairosDB (with Cassandra) and

nearly five times slower than OpenTSDB. This is likely due to the high series

cardinality.

OpenTSDB performs better than KairosDB, but is still significantly slower than

InfluxDB. For the non-representative baseline data set workload, OpenTSDB is

nearly five times slower than InfluxDB. For representative data set workloads, this

gap shrinks. InfluxDB is just over twice as fast as OpenTSDB for the IoT and

financial data set workloads. For the ratings data set workload ingestion test,

OpenTSDB is nearly five times as fast as InfluxDB.

KairosDB (with ScyllaDB) was unable to complete for the ratings data set work-

load. The ingestion speed was 33,340 data points per second, but since not all

data points were successfully saved, this result is excluded.

The differences between KairosDB with Cassandra and KairosDB with ScyllaDB

not huge, but ScyllaDB consistently outperforms Cassandra. For the baseline data

set workload, ScyllaDB performs 8.10% better, for the IoT and financial workloads

5.55% and 12.95% respectively.

For the IoT and financial data set workloads, relative performance is comparable

to the baseline. InfluxDB comes in first, OpenTSDB second, followed by KairosDB

with ScyllaDB and KairosDB with Cassandra, respectively. However, for the rat-

ings data set workload, we see a different pattern. Here, InfluxDB has the slowest

ingestion speed, and KairosDB with Cassandra the highest, with OpenTSDB in the

middle. The reason for this is unclear. High tag value cardinality has been known

40

to slow down InfluxDB performance through high memory usage, but InfluxDB

performed well on the baseline, which also has high tag value cardinality. This

performance may be caused by the high size of the data points, and the large

amount of tags.

The use of real world, representative data sets revealed a performance degradation

of InfluxDB compared to the other TSDBs for the ratings data set.

5.3 Load testing with query workload

The maximum number of queries per second was determined for every TSDB-data

set tuple. The results are shown in Figure 5.4.

InfluxDB significantly outperforms all other TSDBs for every query workload. In

the non-representative query workload, it outperforms the next runner-up

(OpenTSDB) by a factor of 18. In the representative query workloads, this factor

is different. For the IoT query workload, InfluxDB performs 8.5 times better than

OpenTSDB, for the financial query workload 15 times better, and for the ratings

query workload nearly 37 times better. Clearly, the query workload has a big

impact on performance.

KairosDB with Cassandra was not able to complete the ratings workload due

to memory constraints. KairosDB with ScyllaDB was not able to complete this

query workload because the not all the data could be loaded (see Section 5.2). For

the other query workloads, ScyllaDB outperforms Cassandra every time. In the

baseline query workload, it outperforms by just over 20%. For the representative

IoT query workload, it achieves 36.49% more requests per second, and for the

financial query workload a 22% improvement.

It is remarkable how KairosDB performs much better on the representative work-

loads. The baseline workload requests just one data point, a very simple query

which is easily cacheable, and yet, KairosDB performs two times better on the more

representative, but much more complex IoT and financial benchmarks. There re-

ally is no clear explanation why KairosDB would perform so much worse for a

41


104

105

1064.

82·1

05

3.18·1

05

1.56·1

05

4,34

2

89,3

60 1.62·1

05

86,5

78

21,1

96

54,7

92

87,4

13

78,1

98

29,9

1359,2

31

98,7

36

82,5

35

Dat

ap

oints

per

seco

nd

InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB

Figure 5.3: Data points ingested per second. Data sets used were one million data

points each.

much simpler workload. If anything, it would expected to perform a lot faster

than the IoT workload, since the baseline only requests a single data point (which

can be cached) and requires no aggregation or calculations.

OpenTSDB performance is good in the baseline and the IoT query workload, but

is degraded in the financial and ratings query workload. This may have to do

with the fact that the data ranges to scan are much bigger in these last two query

workloads, while the first two only require data from relatively narrow time ranges.

5.4 Response latency

The mean latency, shown graphically in Figure 5.5, is the mean time it takes to

receive a response from the TSDB. The 95th percentile response time, displayed

graphically in Figure 5.6. This metric displays what the maximum latency for 95%

of requests is. One in twenty requests will have a longer latency than this.

42


100

101

102

103

104

105

6,40

0.36

997.

57

235.

73

78.3

3347.

37

117.

3

15.5

7

2.13

12.8 29

.43

26.8

3

15.4

7 40.1

7

28.5

7

Req

ues

tsp

erse

cond


Figure 5.4: Maximum requests per second. Tests were performed on data sets one

million data points in size.

The tests were performed with a constant rate of requests. This rate was de-

termined by choosing the lowest maximum requests per second for every query

workload. Empirically, the request rate was increased until timeouts were ob-

served, this request rate was then rounded down. It was found that some TSDBs

were able to handle more requests per second than the load testing showed when

the number of parallel requests was increased while leading to little increase in

latency. Ultimately, the used rates were rounded to 10 requests per second for the

baseline query workload, 20 requests per second for the IoT query workload, 30

requests per second for the financial query workload, and 2 requests per second

for the ratings query workload.

KairosDB with Cassandra and KairosDB with ScyllaDB were not able was not

able to complete the ratings workload due to memory constraints and because not

all data could be loaded (see Section 5.3).

InfluxDB is the clear winner when it comes to latency. The TSDB is able to

handle requests and send a response in less than 2ms for the baseline, and queries

43

for the complex ratings query workload take on average just over 100ms. InfluxDB

outperforms all other TSDBs tested when it comes to latency, both mean latency

and 95th percentile.

OpenTSDB shows good performance for the baseline and IoT query workloads, but

like the load testing, has trouble with the financial and ratings query workloads. As

mentioned in Section 5.3, this may have to do with the big time ranges the TSDB

has to scan to aggregate data points. The latencies for the last two workloads are

high: the average latency is over two and a half seconds.

KairosDB with ScyllaDB shows greater performance than KairosDB with Cassan-

dra for every query workload. For the first two workloads, it performs nearly twice

as fast when comparing mean latency. For the financial workload, the difference

(ScyllaDB 4.63% faster) is small.

5.5 Mean response size

In Figure 5.7, the mean response size is shown graphically. This mean is clearly

coupled to the data set. Overall, InfluxdDB has the most verbose responses.

After inspecting a few responses, the main reason for this seems to be due to the

fact that InfluxDB encodes timestamps as strings in responses, while KairosDB

uses numbers, and OpenTSDB uses numbers encoded as strings. Compare these

encodings:

• KairosDB encodes the time as 1189641600000, representing the number of

milliseconds since January 1, 1970. This takes 14 bytes to encode in JSON.

• InfluxDB encodes the time as "2007-09-13T00:00:00Z", which takes 23

bytes to encode. However, this format is able to add more precision, adding

seconds, milliseconds and even nanoseconds.

• OpenTSDB encodes the time as "1189641600", representing the number of

seconds since January 1, 1970, as a string. This takes 13 bytes to encode in

JSON, but is not as precise as the other encodings.

44


100

101

102

103

1041.

27

7.64

57.8

8

104.

41

12.5

6

18.2

9

2,68

1.44

2,56

3.02

862.

64

155.

91

106.

85

136.

83

74.4

9

102.

12

Lat

ency

(ms)


Figure 5.5: Mean latency per request.


100

101

102

103

104

1.4

22.0

5 87.2

8

462.

79

12.8

5 44.9

3

3,99

1.06

2,78

6.61

124.

64 334

133.

17

70.5

3 193.

67

127.

22

Lat

ency

(ms)


Figure 5.6: 95th percentile of latency per request.

45

Other factors influence the response size. For example, OpenTSDB and KairosDB

will return a list of tags used on data points. For large response size, such as for

the representative query workloads, the timestamp encoding is the deciding factor.

Both KairosDB TSDBs experienced timeouts for the baseline data set. For

KairosDB on Cassandra, 27 timeouts were encountered, and KairosDB on Scyl-

laDB encountered 4 timeouts. These were ignored when calculating the mean

response size.

KairosDB on Cassandra and on ScyllaDB both return the same number of bytes for

the IoT and financial workload since the underlying databases are interchangeable.

Given the same data, KairosDB should deliver the same response, and this result

gives confidence that it does2.

5.6 Evaluation

When comparing storage efficiency (Section 5.1), representative data sets showed

that storage efficiency varies heavily between use cases, and so does relative stor-

age efficiency. It showed that the results of the non-representative benchmark can

not be generalised to relative storage efficiency in representative workloads. Tag

value cardinality and data point value variation were identified as possible param-

eters that have a high impact on storage efficiency. Real world data usually has

low variation, while non-representative benchmarks often use random values (high

variation). These non-representative benchmarks may become more representative

of real world use cases through the use of random walks instead, which have lower

variation and more closely model real world data.

The use of representative data sets and query workloads for ingestion speed testing

(Section 5.2) showed performance problems when ingesting the complex ratings

data set, especially for InfluxDB.

In the load testing benchmark (Section 5.3), it was discovered that OpenTSDB

2Some ndividual queries were compared to further confirm that KairosDB with either Cas-

sandra ScyllaDB give the same response for the same query.

46


102

103

104

10518

5

507.

35

33,1

86.1

1,25

0.35

126

350.

45

28,8

54.3

5

390.

65

202

459.

4

23,1

17.7

5

202

459.

4

23,1

17.7

5

Res

pon

sesi

ze(b

yte

s)InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB

Figure 5.7: Mean size in bytes of the TSDB response.

performed well for the baseline and IoT query workloads, but not for the financial

and ratings query workloads.

For the response latency (Section 5.4), the use of representative benchmarks again

showed a performance degradation for OpenTSDB for the financial and ratings

query workloads, which use broad time ranges. Otherwise, the baseline is a good

predictor for relative performance in the representative benchmarks.

When testing the mean response size (Section 5.5), the encoding of timestamps was

shown to be the deciding factor when it comes to query workloads which return a

large response.

These results make it clear that representative data set workloads and query work-

loads may lead to important differences in benchmark results. They shed doubt

on the real world applicability of benchmarks using random or synthetic data sets

and/or non-representative query workloads.

The fact that not all representative workloads show performance impact (e.g. only

47

the ratings workload showed the performance degradation for InfluxDB in the data

ingestion test) highlights the importance of using multiple representative workloads

- just one representative workload may not be enough to highlight possible devia-

tions or performance degradations. It is impractical to create a workload for every

use case, but it is possible to generalize workloads into categories (e.g. volume, tag

value cardinality, data type, ...). Further testing is needed to confirm that data

sets with the same workload parameters will yield comparable results.

CONCLUSIONS AND FUTURE WORK 48

Chapter 6

Conclusions and future work

6.1 Conclusions

Compared to a baseline non-representative workload, representative workloads

showed significant performance differences when it came to storage efficiency, data

ingestion speed for complex data, latency and maximum request rate (when broad

time ranges are used). Storage efficiency is lower for data sets with low tag value

cardinality and low variation. Non-representative benchmarks using random data

will have high variation, while real world data often displays low variation. Using

random walks instead of random values may make a benchmark more represen-

tative. Data ingestion throughput testing highlighted performance problems for

data sets with large data points and high tag cardinality. Latency and load testing

showed that some databases perform significantly worse when they need to scan

a large amount of data. This illustrates the importance of using representative

workloads.

A number of TSDB benchmarks have been studied, but none of them use repre-

sentative workloads. Three existing TSDB benchmarks use nearly representative

workloads, but none of them use real world data sets. Instead, they use random

or synthetised data. Considering that my benchmark, which uses representative,

real world workloads, sheds a different light on TSDB performance, the relevance

of these existing benchmarks may be called into question.

49

While representative workloads uncovered significant performance differences com-

pared to non-representative workloads, it is unpractical to create or test represen-

tative workloads for every use case imaginable, but TSDB workloads can be cat-

egorized with workload parameters (number of metrics, regularity, volume, data

type, number of tags, tag value data type, tag value cardinality, variation). Fur-

ther research is needed to determine if these parameters are enough to accurately

describe a TSDB workload and thus generalize results of one workload to another

with the same workload parameters.

Benchmark TSDBs is a complex endeavour due to the absence of standardized

query languages, data models, or capabilities (such as aggregators or functions).

The proliferation of TSDB models has the advantage of specialisation: instead

of optimizing for the general case, individual TSDBs may seek to specialise in a

niche, e.g. geo-spatial data querying, nanosecond timestamp resolution, or real-

time streaming queries. The disadvantage is that it is much harder to compare

different TSDBs. The varying support for operations makes it so that not all

TSDBs can be compared to each other, semantic differences in query languages

require careful comparison of results to ensure they are valid, and different database

interfacing methods may lead to more difficult interpretation of benchmark results.

6.2 Future work

This dissertation has proven the relevance of representative benchmarks. The

experiments and tests that were run for this dissertation took a lot of time to

prepare and execute, and therefore, a lot of extensions have been left for the

future. Several possible lines of research could be pursued:

• The hypothesis that workloads with the same data set characteristics yield

comparable benchmark results could be tested. Analysis might produce an-

other, non-obvious workload parameter.

• The benchmark described in this dissertation can be extended to use more

TSDBs. Currently, four TSDBs are tested, but more can be added. Another

approach would be to extend another existing TSDB benchmark to be more

representative.

50

• The query workload could be extended to include data mutations (such as

create, update and delete queries). Benchmarks using this query workload

might produce even more representative results. However, query spread

should be carefully studied: for most query workloads, create queries will

heavily outnumber update and delete queries.

• A comparison of TSDB query languages might yield interesing results on

their construction and capabilities. Perhaps a unifying query language could

be constructed, which would facilitate research into different TSDB families.

• In production environments, TSDBs are often used in multi-node setups.

This scalability aspect is only addressed in one existing benchmark. The

benchmark in thes dissertation could be extended to test clustered TSDBs.

• This dissertation has focused on TSDBs, a specialized type of database.

Representative benchmarking could be studied in different domains as well,

such as relational databases and specialized non-relational databases (such

as graph, triple or document stores).

DETAILED RESULTS 51

Appendix A

Detailed results

This appendix lists detailed results discussed and displayed graphically in Chap-

ter 5.

A.1 Data ingestion throughput

Table A.1 lists the detailed resuls for Section 5.2.

InfluxDB OpenTSDB KairosDB KairosDB

Data set Cassandra ScyllaDB

Baseline 481818 89360 54792 59231

IoT 317999 162473 87413 98736

Financial 156498 86578 78198 82535

Ratings 4342 21196 29913 NA

Table A.1: Data ingestion speed in point per second.

A.2 Storage efficiency

Table A.2 list the detailed results for Section 5.1.

52

CSV InfluxDB OpenTSDB KairosDB KairosDB


Baseline 1.0 1.4443 2.4213 2.6983 18.7948

IoT 1.0 3.3308 0.2585 2.2353 31.1704

Financial 1.0 5.8211 1.5668 6.2073 36.5115

Ratings 1.0 7.2624 0.891 3.2897 NA

Table A.2: Storage efficiency in bytes per data point.

A.3 Load testing

Table A.3 lists the detailed resuls for Section 5.3. Tests were performed using ten

requests in parallel, with a thirty second timeout.



Baseline 6400.36 347.367 12.8 15.4667

IoT 997.567 117.3 29.4333 40.1667

Financial 235.733 15.5664 26.8332 28.5667

Ratings 78.3333 2.13333 NA NA

Table A.3: Maximum requests per second performed using representative queries.

A.4 Response latency

Table A.4 shows the mean latency and Table A.5 for TSDB responses. Table A.6

shows the number of timeout that occurred during the latency and response size

tests. These results are discussed in Section 5.4

A.5 Mean response size

Table A.7 lists the detailed results for Section 5.5.

53



Baseline 1.266 12.559 862.643 230.125

IoT 7.636 18.293 155.91 74.49

Financial 57.88 2681.441 106.85 122.25

Ratings 104.41 2563.02 NA NA

Table A.4: TSDB mean request latency size for representative queries.



Baseline 1.399121 12.8529 124.640305 70.532287

IoT 22.049411 44.926242 333.99535 193.673539

Financial 87.277173 3991.057059 133.174898 127.224758

Ratings 462.786983 2786.607644 NA NA

Table A.5: TSDB 95th percentile request latency size for representative queries.



baseline 0 0 27 4

IoT 0 0 0 0

Financial 0 0 0 0

Ratings 0 0 NA NA

Table A.6: Number of timeouts during the latency and response size tests.



Baseline 185.0 126.0 202.0 202.0

IoT 507.35 350.45 459.4 459.4

Financial 33186.1 28854.35 23117.75 23117.75

Ratings 1250.35 390.65 NA NA

Table A.7: TSDB mean response size for representative queries.

BIBLIOGRAPHY 54

Bibliography

[1] E. F. Codd. A Relational Model of Data for Large Shared Data Banks.

Commun. ACM, 13(6):377–387, June 1970.

[2] Andrew Pavlo and Matthew Aslett. What’s Really New with NewSQL? SIG-

MOD Rec., 45(2):45–55, September 2016.

[3] Katarina Grolinger, Wilson A. Higashino, Abhinav Tiwari, and Miriam AM

Capretz. Data management in cloud environments: NoSQL and NewSQL data

stores. Journal of Cloud Computing: Advances, Systems and Applications,

2(1):22, December 2013.

[4] Rick Cattell. Scalable SQL and NoSQL data stores. ACM SIGMOD Record,

39(4):12, May 2011.

[5] Andreas Bader, Oliver Kopp, and Michael Falkenthal. Survey and Comparison

of Open Source Time Series Databases. Gesellschaft fur Informatik e.V., 2017.

[6] Yueguo Chen. TS-Benchmark: A benchmark for time series databases. http:

//prof.ict.ac.cn/Bench18/chenyueguo.pdf, June 2018.

[7] Rui Liu and Jun Yuan. Benchmark Time Series Database with IoTDB-

Benchmark for IoT Scenarios. arXiv:1901.08304 [cs], January 2019.

[8] Kaippallimalil J. Jacob and Dennis Shasha. FinTime: A financial time series

benchmark. SIGMOD Record, 28:42–48, 1999.

[9] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh,

Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping,

Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo

http://prof.ict.ac.cn/Bench18/chenyueguo.pdf

http://prof.ict.ac.cn/Bench18/chenyueguo.pdf

55

Batista. The UCR Time Series Classification Archive. October 2018. https:

//www.cs.ucr.edu/∼eamonn/time series data 2018/.

[10] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. http://

archive.ics.uci.edu/ml, 2017.

[11] R.J. Hyndman. Time Series Data Library. https://datamarket.com/data/list/

?q=provider:tsdl.

[12] Time-series data on data.world: 34 datasets. https://data.world/datasets/

time-series.

[13] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network

Dataset Collection. June 2014.

[14] Davis Blalock, Samuel Madden, and John Guttag. Sprintz: Time Series Com-

pression for the Internet of Things. Proc. ACM Interact. Mob. Wearable Ubiq-

uitous Technol., 2(3):93:1–93:23, September 2018.

[15] Robert Allen. Case Study: How Houghton Mifflin Harcourt gets real-time

views into their AWS spend with InfluxData, October 2017.

[16] Adam Wegrzynek. Towards the integrated ALICE Online-Offline monitor-

ing subsystem. https://indico.cern.ch/event/587955/contributions/2937431/

attachments/1678739/2706702/CHEP-2018.pdf, September 2018.

[17] Mario Luca Bernardi, Marta Cimitile, Fabio Martinelli, and Francesco Mer-

caldo. A Time Series Classification Approach to Game Bot Detection. In

Proceedings of the 7th International Conference on Web Intelligence, Mining

and Semantics, WIMS ’17, pages 6:1–6:11, New York, NY, USA, 2017. ACM.

[18] Yanpei Chen, Francois Raab, and Randy Katz. From TPC-C to Big Data

Benchmarks: A Functional Workload Model. In David Hutchison, Takeo

Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C.

Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen,

Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard

Weikum, Tilmann Rabl, Meikel Poess, Chaitanya Baru, and Hans-Arno Ja-

cobsen, editors, Specifying Big Data Benchmarks, volume 8163, pages 28–43.

Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.

https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

https://datamarket.com/data/list/?q=provider:tsdl

https://datamarket.com/data/list/?q=provider:tsdl

https://data.world/datasets/time-series

https://data.world/datasets/time-series

https://indico.cern.ch/event/587955/contributions/2937431/attachments/1678739/2706702/CHEP-2018.pdf

https://indico.cern.ch/event/587955/contributions/2937431/attachments/1678739/2706702/CHEP-2018.pdf

56

[19] DB-Engines Ranking per database model category. https://db-engines.com/

en/ranking categories.

[20] Y C Tay. Data Generation for Application-Specific Benchmarking. VLDB,

Challenges and Visions, 7:4, 2011.

[21] Zhang, Xiaolan and Seltzer, and Margo. Application-Specific Benchmarking.

Harvard University, 2001.

[22] Ajay Joshi, Lieven Eeckhout, and Lizy John. The Return of Synthetic Bench-

marks. In 2008 SPEC Benchmark Workshop, pages 1–11, 2008.

[23] A. Chakraborti, M. Patriarca, and M. S. Santhanam. Financial time-series

analysis: A brief overview. arXiv:0704.1738 [physics, q-fin], pages 51–67,

2007.

[24] Boris Marjanovic. Huge Stock Market Dataset. https://kaggle.com/

borismarjanovic/price-volume-data-for-all-us-stocks-etfs.

[25] F. Maxwell Harper and Joseph A. Konstan. The MovieLens Datasets: History

and Context. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December

2015.

[26] Code for comparison write ups of InfluxDB and other solutions:

Influxdata/influxdb-comparisons. InfluxData, May 2019.

[27] Tomas Senart. HTTP load testing tool and library. tsenart/vegeta. https:

//github.com/tsenart/vegeta, May 2019.

[28] Jef Poskanzer. Http load. https://acme.com/software/http load/.

[29] NoSQL data store using the seastar framework, compatible with Apache Cas-

sandra: Scylladb/scylla. https://github.com/scylladb/scylla, May 2019.

https://db-engines.com/en/ranking_categories

https://db-engines.com/en/ranking_categories

https://kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

https://kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

https://github.com/tsenart/vegeta

https://github.com/tsenart/vegeta

https://acme.com/software/http_load/

https://github.com/scylladb/scylla

LIST OF ABBREVIATIONS 57

List of Abbreviations

ACID Atomicity, Consistency, Isolation, Durability

API Application Programming Interface

ARIMA AutoRegressive Integrated Moving Average

CAP Consistency, Availability and Partition Tolerance

CERN European Organization for Nuclear Research

CPU Central Processing Unit

CRUD Create, Read, Update and Delete

CSV Comma-separated values

CTSDB Cloud Time Series Database

DBMS Database management system

ETF Exchange-Traded Fund

HTTP HyperText Transfer Protocol

IBDb Internet Movie Database

IEEE Institute of Electrical and Electronics Engineers

IoT Internet of Things

JSON JavaScript Object Notation

KPI Key Performance Indicator

MIT Massachusetts Institute of Technology

NoSQL Not Only SQL

OLAP Online Analytical Processing

58

RAM Random Acces Memory

RDBMS Relational Database Management System

REST Representational State Transfer

SNAP Stanford Network Analysis Project

SQL Structured Query Language

STAC Securities Technology Analysis Center

TPC Transaction Processing Performance Council

TS Time Series

TSDB Time Series Database

TSDL Time Series Data Library

UCI University of California, Irvine

UDP User Datagram Protocol

URL Uniform Resource Locator

USD United States Dollar

YCSB Yahoo! Cloud Serving Benchmark

LIST OF FIGURES 59

List of Figures

5.1 Storage efficiency of different TSDBs in bytes per data point. Data

points contain a timestamp, a value, and may contain tags, depend-

ing on the data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Relative storage efficiency of different TSDBs per data point com-

pared to the CSV source format. . . . . . . . . . . . . . . . . . . . 37

5.3 Data points ingested per second. Data sets used were one million

data points each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Maximum requests per second. Tests were performed on data sets

one million data points in size. . . . . . . . . . . . . . . . . . . . . . 42

5.5 Mean latency per request. . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 95th percentile of latency per request. . . . . . . . . . . . . . . . . . 44

5.7 Mean size in bytes of the TSDB response. . . . . . . . . . . . . . . 46

LIST OF TABLES 60

List of Tables

3.1 Evaluation of existing TSDB benchmarks . . . . . . . . . . . . . . . 20

4.1 Overview of workload data sets . . . . . . . . . . . . . . . . . . . . 30

A.1 Data ingestion speed in point per second. . . . . . . . . . . . . . . . 51

A.2 Storage efficiency in bytes per data point. . . . . . . . . . . . . . . 52

A.3 Maximum requests per second performed using representative queries. 52

A.4 TSDB mean request latency size for representative queries. . . . . . 53

A.5 TSDB 95th percentile request latency size for representative queries. 53

A.6 Number of timeouts during the latency and response size tests. . . . 53

A.7 TSDB mean response size for representative queries. . . . . . . . . . 53

towards a representative benchmark for time series databases

Documents