Building a Custom Data Warehouse Using PostgreSQL
TOASTing an Elephant
Illustration by Zoe Lubitz
David Kohn
Chief Elephant Toaster and Data Engineer at Moat
We measure attention online, for both advertisers and publishers.
We don’t track cookies/ip addresses.
Rather we process billions of events per day that allow us to measure how many people saw an ad or interacted with it.
We are a neutral third party and our metrics are used by both advertisers and publishers to measure their performance and agree on a fair price.
Those billions of events are aggregated in our realtime system and end up as millions of rows per day added to our stats databases.
Moat Interface
Moat Interface
Moat Interface
tuple client filter1date filterN metric1 metricN
Partition Keys Filters (~10 text) Metrics (~170 int8)
Production queries have single client.
Production queries sum all of these.Subset(s) are hierarchical.
Basic Row Structure
tuple client filter1date filterN metric1 metricN
Partition Keys Filters (~10 text) Metrics (~170 int8)
Production queries have single client.
Production queries sum all of these.Subset(s) are hierarchical.
Basic Row Structure
SELECT filter1, filter2 … SUM(metric1), SUM(metric2) … SUM(metricN) FROM rollup_table_name WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’
GROUP BY filter1, filter2 …
Typical Query
Moat Interface
Client Filters Date Range
Metrics
Moat Interface
Client Filters Date Range
Metrics (And there’s a lot more of them you can choose)
Lots of Data
Sum large amounts of data quickly (but only a small fraction of total data, easily partition-able)
Sum all columns of very wide rows
Compress data (for storage and i/o reasons)
Support medium read concurrency (or at least degrade predictably) ie 4-12 requests/second some of which can take minutes to finish
Data is derivative and structured to meet needs of client-facing app high read/aggregation throughput for clients
ETL quickly, some bulk delete/redo operations, once per day
Requirements
Should we choose a row store or a column store?
Old Systems
• 2 masters + 2 replicas each
• Handled last 7 days
• High concurrency
• Highly disk bound
• Heavily partitioned
• Shield for column stores
• ~3 mos/cluster (30 TB license - 8 nodes - $$$)
• Fast, but slowed down under concurrency
• Performance degradation unpredictable
• Projections can lead to slow ETL
• 1 cluster (8 nodes, spinning disk)
• 2012-Present
• No roll up tables, too big
• Incredibly slow for client facing queries (many columns)
• Bulk Insert ETL, delete/update hard
Postgres Vertica Redshift
page
tuple tuple
tuple tuple
header
tuple header attr attr
attr attr
attr
attr
attr attr attrattr attr
table (on disk)
page page page page
page page page page
page page page page
Row Store
A table is a collection of rows, each row split into columns/attrs
Each row must fit into a page.
page
tuple tuple
tuple tuple
header
tuple header attr attr
attr attr
attr
attr
attr attr attrattr attr
table (on disk)
page page page page
page page page page
page page page page
Row Store• Accesses small subsets of rows
quickly
• Little penalty for many columns
selected
• Great for individual inserts,
updates and deletes
• Often normalize data structure
• OLTP workloads
• High concurrency, less
throughput per user
• Data stored uncompressed,
unless too large for a block
pagecompressed values (possibly with surrogate keys)
table (on disk)
attr Apage page page page
page page page
attr Bpage page page page
page page page page
page page
Column Store
A table is a collection of columns.
Each column split into values position corresponds to row.
Values in columns often compressed.
pagecompressed values (possibly with surrogate keys)
table (on disk)
attr Apage page page page
page page page
attr Bpage page page page
page page page page
page page
Column Store• Scans and aggregates large
numbers of rows quickly
• Best when selecting a subset of
columns
• Great for bulk inserts, harder to
delete or update
• Often denormalized data
structure
• OLAP workloads
• Lower concurrency, much higher
throughput per user
• Data can be compressed
page
tuple tuple
tuple tuple
tuple header attr attr
attr attr
attr
attr
attr attr attrattr attr
attr
What happens when an attr is too big to fit in a page?
?
header
TOAST tablepage
The Oversized Attribute StorageTechnique
tuple header attr pointer
attr attr
attr
attr
attr attr attrattr attr
attr
tuple id
compressed attr
segment
LZIP
pagetuple id
compressed attr
segment
Project Marjory
Project Marjory
Moat Interface
Client Filters Date Range
Metrics
tuple client filter1date filterN metric1 metricN
Partition Keys Filters (~10 text) Metrics (~170 int8)
Original Row
tuple client filter1date filterN metric1 metricN
Partition Keys Filters (~10 text) Metrics (~170 int8)
Original Row
Subtype
subtype filter1 filterN metric1 metricN
tuple client filter1date filterN metric1 metricN
Partition Keys Filters (~10 text) Metrics (~170 int8)
Original Row
Subtype
subtype filter1 filterN metric1 metricN
tuple array
MegaRow
clientdate
Partition Keys
subtype
Array of Composite Type (~5000 rows/array)
subtype subtype subtype
subtype subtype subtype subtype
segment
INSERT INTO array_table_name SELECT date, client, segment, ARRAY_AGG( (filter1, filter2 … metric1, metric2 … metricN)::subtype) FROM temp_table_for_etl GROUP BY date, client, segment
Typical ETL Query
INSERT INTO array_table_name SELECT date, client, segment, ARRAY_AGG( (filter1, filter2 … metric1, metric2 … metricN)::subtype) FROM temp_table_for_etl GROUP BY date, client, segment
Typical ETL Query
Reporting Query SELECT a.date, a.client,
s.filter1 ... s.filterN SUM(s.metric1)... SUM(s.metricN)
FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’
1 Client, 10 days, ~150,000 rows/day (~1.5m rows total)
MarjoryRedshift
1 Client, 10 days, ~3,000,000 rows/day (~30m rows total)
MarjoryRedshift
1 Client, 4 months, ~150,000 rows/day (~18m rows total)
MarjoryRedshift
1 Client, 4 months, ~150,000 rows/day (~18m rows total)
MarjoryRedshift
1 Client, 4 months, ~150,000 rows/day (~18m rows total)
MarjoryRedshift
• Performs quite well on our typical queries (lots of columns, large subset of rows)
• Sort order matters less than in column stores
• Query time scales with number of rows unpacked and aggregated, lightly depends on number of columns
• Utilizes resources efficiently for concurrency (Postgres’ stinginess can serve us well)
• 8-10x compression for our data (with a bit of extra tuning of our composite type)
• All done in PL/PGSQL etc, no C-code required.
• Doesn’t do as well on general SQL queries, have to unpack all of the rows
• Not getting you much compared to a column store if you’re accessing only a few columns (one might be able to design it differently though)
• Doesn’t dynamically scale number of workers for size of query (Postgres’ stinginess doesn’t serve us well for more typical BI cases, but that wasn’t what we optimized for)
• Isn’t going to do as well when scanning very large numbers of rows (ie more typical BI)
• All done in PL/PGSQL etc, no C-code required.
The Good The Not-So-Good
Trade generality for fit to our use case.
I’ll Drink to That!
Illustration by Zoe Lubitz
Rollups
SELECT filter1, filter2, filterN, SUM(metric1), SUM(metricN) GROUP BY GROUPING SETS(filter1, filter2 ... filterN-1, filterN), (filter1 ... filterN-1, filterN), ... (filter1, filter2), (filter1)
INSERT INTO byfilter1 ... INSERT INTO byfilter2 ...
tuple arrayarrayarrayarrayclientdate
Partition Keys
subtype
subtype
subtype
subtype
subtype
subtype
subtype
subtype
subtype subtype
segment
subtype
subtype
subtype
subtype
subtypesubtype
subtype
subtype
byfilter4[ ]byfilter3[ ]byfilter2[ ]byfilter1[ ]
MegaRow
tuple arrayarrayclientdate
Partition Keys
subtype
subtype
subtype
subtype
subtype
subtype
subtype
subtype
segmentsubtype
subtype
subtype
byfilter4[ ]byfilter3[ ]byfilter2[ ]byfilter1[ ]
MegaRow
array
subtype
subtypearray
subtype
subtype
subtype
subtype
subtype
subtype
subtype
subtype
tuple arrayarrayclientdate
Partition Keys
subtype
subtype
subtype
subtype
subtype
subtype
subtype
subtype
segmentsubtype
subtype
subtype
byfilter4[ ]byfilter3[ ]byfilter2[ ]byfilter1[ ]
MegaRow
NULL NULL
tuple clientdate
Partition Keys
segment
Rollup Arrays
Summary Statistics
total_rows metadata
Summary Statistics
tuple clientdate
Partition Keys
segment
Rollup Arrays
Summary Statistics
total_rows metadata
Summary Statistics
SELECT date, client, SUM(total_rows) as rows_per_day FROM array_table_name GROUP BY date, client
Count Rows/Day by Client
Partition Keys Rollup Arrays
Distinct ListsSummary Stats
clientdate segment
total rows metadata arrayarray
val val
val val
val val val
val val
val
Distinct Filter Values
Partition Keys Rollup Arrays
Distinct ListsSummary Stats
Targeted Reporting Query
clientdate segment
total rows metadata
SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’
arrayarray
val val
val val
val val val
val val
val
Distinct Filter Values
Partition Keys Rollup Arrays
Distinct ListsSummary Stats
Targeted Reporting Query
clientdate segment
total rows metadata
SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’
arrayarray
val val
val val
val val val
val val
val
Distinct Filter Values
SELECT a.date, a.client, s.filter1, s.filter2, … s.metricN FROM array_table_name a, LATERAL UNNEST(subtype[]) s (filter1, filter2, … metricN) WHERE client = ‘foo’ AND date >= ‘bar’ AND date <= ‘baz’ AND s.filter1 = ‘fizz’ AND a.distinct_filter1 @> ‘[fizz]’::text[]
Stats
• Marjory (All data since 2012) has about the same on disk footprint as Elmo (last 33ish days)
• ~20x compression compared to normal format Postgres (~10x TOAST + ~2x avoided storage of rollups)
• 5 Marjory instances, each with all of the data for all time (on local store spinning disk drives) have basically taken over what we had on our Vertica and Redshift instances (at least 16 instances)
• Overall tradeoff is I/O for CPU, so had to do some tuning to get parallel to planning/running properly
ALTER TABLE array_table_name ALTER client SET STATISTICS 10000; ALTER TABLE array_table_name ALTER byfilter1 SET STATISTICS 0; ALTER TABLE array_table_name ALTER byfilter2 SET STATISTICS 0; ... ALTER TABLE array_table_name ALTER byfilterN SET STATISTICS 0;
Only Do Meaningful Statistics (But Make Them Good)
Useful Tuning Tips
ALTER TABLE array_table_name ALTER client SET STATISTICS 10000; ALTER TABLE array_table_name ALTER byfilter1 SET STATISTICS 0; ALTER TABLE array_table_name ALTER byfilter2 SET STATISTICS 0; ... ALTER TABLE array_table_name ALTER byfilterN SET STATISTICS 0;
Only Do Meaningful Statistics (But Make Them Good)
Useful Tuning Tips
Make Data-Type Specific Functions For Unnest With Proper StatsCREATE FUNCTION unnest(byfilter4) RETURNS SET OF array_subtype as $func$ ... $func$ LANGUAGE PLPGSQL ROWS 5000 COST 5000;
ALTER TABLE array_table_name ALTER client SET STATISTICS 10000; ALTER TABLE array_table_name ALTER byfilter1 SET STATISTICS 0; ALTER TABLE array_table_name ALTER byfilter2 SET STATISTICS 0; ... ALTER TABLE array_table_name ALTER byfilterN SET STATISTICS 0;
Only Do Meaningful Statistics (But Make Them Good)
Useful Tuning Tips
Make Data-Type Specific Functions For Unnest With Proper StatsCREATE FUNCTION unnest(byfilter4) RETURNS SET OF array_subtype as $func$ ... $func$ LANGUAGE PLPGSQL ROWS 5000 COST 5000;
min_parallel_relation_size parallel_setup_cost parallel_tuple_cost max_worker_processes max_parallel_workers_per_gather cpu_operator_cost?
Futz With Parallelization Parameters Until They Work
Yep. CPU Bound
Illustration by Zoe Lubitz
We’re hiring!
http://grnh.se/os4er71