senior software engineer - the database …...why not use a simple relational database table?...

71
Building a scalable time-series database on PostgreSQL Erik Nordström, Ph.D. Senior Software Engineer [email protected] · github.com/timescale

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Building a scalable time-series database on PostgreSQL

Erik Nordström, Ph.D. Senior Software Engineer

[email protected] · github.com/timescale

@timescale

5 months of rapid growth

4/1/17 5/1/17 6/1/17 7/1/17 8/1/17 9/1/17

Industrial Machines

Transportation & Logistics

Datacenter & DevOps

Financial& Marketing

Time-series Data is Emerging Everywhere

Industrial Machines

Datacenter & DevOps

Financial& Marketing

Transportation & Logistics

Time-series Data is Emerging Everywhere

Industrial Machines

Datacenter & DevOps

Financial& Marketing

Transportation & Logistics

Time-series Data is Emerging Everywhere

Industrial Machines

Datacenter & DevOps

Financial& Marketing

Transportation & Logistics

Time-series Data is Emerging Everywhere

Fastest growing database category

Source: DB-Engines.com

Why time-series data?

Past Present Future

Historical analysis

Real-time monitoring, troubleshooting

Predict & avoid problems

…and there’s a lot of time-series data

71%25GB44ZBdata collected from IoT devices by 2020

(IDC)

data collected per hour by connected cars

(McKinsey)

of global businessesnow collecting IoT data

(451 Research)

What is time-series data?

Older

Older

Older

Older

@timescale

time

1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05

measurements

device_id

0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2

metadata

device_type

floor_sensorroof_sensorfloor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor

Many common scenarios

cpu_avg

1005185 40 88 36 82 10 80 30

@timescale

time

1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05

device_id

0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2

temperature

703571 60 72 60 73 -5 70 60

Many common scenarios

measurements metadata

device_type

floor_sensorroof_sensorfloor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor

location_id

439523124 124 439 632 370 621 439 523

humidity

0.60.30.6 0.3 0.6 0.3 0.6 0.3 0.6 0.3

cpu_avg

1005185 40 88 36 82 10 80 30

@timescale

time

1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05

temperature

703571 60 72 60 73 -5 70 60

Common scenario: Measurements + Metadata

measurements metadata

cargo

electronicsflowers

toys food

furniture toys

car parts frozen food electronics

food

location_id

439523124 124 439 632 370 621 439 523

humidity

0.60.30.6 0.3 0.6 0.3 0.6 0.3 0.6 0.3

primary key

type

40S40R40S 40S 40H 20S 45H 40R 40S 40S

container

0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2

container

0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2

What database for time-series data?

Relational

NoSQL

https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/

32%

68%

Why so much NoSQL?

Why not use a simple relational database table?

Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

Challenge in scaling up

• Time is primary dimension:– B-tree index on time column for time-

oriented queries (rollups, group by time)– Separate B-tree for each secondary index

• As table grows large:– Data and indexes no longer fit in memory– Reads/writes to random locations in B-tree

+

Key-value store with indexed key lookup at high-write rates

NoSQL champion: Log-Structured Merge Trees

• Range queries over sorted keys on disk

• Compressed data storage

• Common approach for time series: use key <name,tags,field,time>

+

NoSQL + LSMTs Come at a Cost

• Requires other (in-memory) index to efficiently map tags to keys => high-cardinality issue

• Poor/no secondary index support

• Less powerful queries

• No JOINS

• Loss of SQL ecosystem

Is there a better way?

TimescaleDB: Scalable SQL database for time-seriesPackaged as a PostgreSQL extension

How?

Time-series workloads are different.

Time-series• Primarily UPDATEs

• Writes randomly distributed

• Transactions to multiple primary keys

• Primarily INSERTs

• Writes to recent time interval

• Writes associated with a timestamp and primary key

OLTP

How it works

Time(older)

Multi-dimensional Partitioning(for both scaling up & out)

Chunk (sub-table)

Space

Time(older)

The Hypertable Abstraction

Chunks

Hypertable • Indexes • Triggers • Constraints • UPSERTs • Table mgmt

Familiar Management$ psqlpsql (9.6.3)Type "help" for help.

tsdb=# CREATE TABLE data ( time TIMESTAMP WITH TIME ZONE NOT NULL, device_id TEXT NOT NULL, temperature NUMERIC NULL, humidity NUMERIC NULL );

tsdb=# SELECT create_hypertable (‘data’,’time’,’device_id’,8);

tsdb=# INSERT INTO data (SELECT * FROM old_data);

Creating/migratingis easy

Automatic Multi-dimensional Partitioning

How EXACTLY do we partition by time?

Static, fixed duration?

Challenge: Data volumes can change

Fixed target size?

• Early data can create too long intervals• Bulk inserts expensive

Adaptive chunking

• New approach: Adaptive intervals• Partitions created with fixed time interval, but

interval adapts to changes in data volumes

Chunking benefits

Chunks are “right-sized”

Recent (hot) chunks fit in memory

Efficient retention policies

Drop chunks, don’t delete rows ⇒ avoids vacuuming

Common mechanisms for scalingup & out

• Chunks spread across many disks

• Faster inserts

• Parallel queries

Common mechanisms for scalingup & out

• Chunks spread across servers

• Chunk-aware query optimizations (push-down LIMITs and aggregates, server-wide merge appends, etc.)

Chunk-aware query optimizations

SELECT time, temp FROM data WHERE time > now() - interval ‘7 days AND device_id = ‘12345’

Avoid querying chunks via constraint exclusion

Avoid querying chunks via constraint exclusion

SELECT time, device_id, temp FROM data WHERE time > ‘2017-08-22 18:18:00+00’

Avoid querying chunks via constraint exclusion

SELECT time, device_id, temp FROM data WHERE time > now() - interval ’24 hours’ Plain Postgres

won’t exclude chunks

Benefit at any size 5 MILLION ROWS

Benefit at any size 5 BILLION ROWS1.13M

METRICS / S

Why run two databases when you can run one?

Simplify your stack

VS

TimescaleDB (sub-table)

RDBMS NoSQL

APPLICATION APPLICATION

Democratize your data Empower your entire org to analyze data

psql (9.6.3) Type “help” for help.

postgres=#

Geospatial Temporal Analysis

“As a bit of an update, we’ve scaled to beyond 500 billion rows now and things

are still working very smoothly.

The performance we’re seeing is monstrous, almost 70% faster queries!”

- Software Engineer, Large Enterprise Computer Company

Performance benefits

INSERT performance

Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

INSERT performance

Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

144KMETRICS / S

14.4KINSERTS / S

INSERT performance

Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)

>20x

1.11MMETRICS / S

Query Performance

https://blog.timescale.com/timescaledb-vs-6a696248104e

Speedup

Table scans, simple column rollups 0-20%

GROUPBYs 20-200%

Time-ordered GROUPBYs 400-10000x

DELETEs 2000x

High-Level Differences from Plain Postgres

• Insert performance

• Automatic partitioning

• Easier management

• Functions for time-based analysis

• Time-aware query optimizations

• Much faster deletes

High-Level Differences from NoSQL

• Secondary indexes

• Lower memory requirements

• No high cardinality problems

• Un-siloed data (JOINs!)

• SQL

We’re open-source! Apache 2.0 license

github.com/timescale/timescaledb

• High write rates

• Time-oriented optimizations

• Fast complex queries

• 10s billions rows / node

Scalable

• Inherits 20+ years of PostgreSQL reliability and ecosystem

• Streaming replication, backups, HA clustering

Reliable

• Supports full SQL

• Time-oriented features

• Connectors: Just speaks PostgreSQL

• 1 DB for relational & time-series data

Easy to Use