memsql db class, ankur goyal

15-415/615 1

Ankur Goyal3/17/2016

1 Based on a lecture given at Carnegie Mellon University.

(c) Ankur Goyal

Ques%ons We Will Answer• What is an in-memory database?

• Why do they ma3er?

• How do you build one?

• How do people use MemSQL?

(c) Ankur Goyal

Topics• In-Memory Databases

• In-Memory Architecture

• MemSQL in the Wild

• Q/A

(c) Ankur Goyal

Ankur Goyal• CMU SCS (2008-2011), PDL (2010-2011)

• Microso7 (2010)

• VP of Engineering @ MemSQL (2011-)

• I ❤ databases

(c) Ankur Goyal

Live Demo

(c) Ankur Goyal

What is an in-memory database?

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

(c) Ankur Goyal


• Do not (need to) save data on disk

(c) Ankur Goyal



• Put the whole dataset in memory

(c) Ankur Goyal



• Put the whole dataset in memory

Well, some)mes...

(c) Ankur Goyal

Wikipedia says...

In-memory databases primarily rely on main-memory for storage.

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

(c) Ankur Goyal


• Can spill on disk or pin data in-memory (and take advantage of it)

(c) Ankur Goyal



• Tradeoffs are suited to systems with lots of memory

(c) Ankur Goyal




• Tend to be distributed systems

(c) Ankur Goyal




• Tend to be distributed systems

• Have a different set of boClenecks

(c) Ankur Goyal

Bold Claim

(c) Ankur Goyal

All database workloads will be running on in-memory databases

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

(c) Ankur Goyal


• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

(c) Ankur Goyal



• In-memory databases leverage SSD (no random writes)

(c) Ankur Goyal




• NVRAM is coming (and could be cheaper than SSD)

(c) Ankur Goyal




• NVRAM is coming (and could be cheaper than SSD)

In-memory databases are tuned to modern hardware and modern workloads

(c) Ankur Goyal

In-Memory Architecture

(c) Ankur Goyal

Architecture Topics• In-Memory Storage

• Transac3ons and Concurrency Control

• Crash Recovery and Replica3on

• Code Genera3on

• Distributed Execu3on

(c) Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

(c) Ankur Goyal


• Atomic writes as granular as a byte

(c) Ankur Goyal



• Working space is precious (RAM)

(c) Ankur Goyal



• Working space is precious (RAM)

• Very different for rowstores and columnstores

(c) Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

(c) Ankur Goyal


• Datasets are usually small < 10 TB

(c) Ankur Goyal



Solu%on: keep the whole dataset in memory

(c) Ankur Goyal



Solu%on: keep the whole dataset in memory

• Use memory op+mized data structures (skip list)

(c) Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

(c) Ankur Goyal


• Expected O(log(n)) lookup, insert, delete

(c) Ankur Goyal


• Expected O(log(n)) lookup, insert, delete

• No pages

(c) Ankur Goyal

(c) Ankur Goyal

Common Concerns• Memory overhead

(c) Ankur Goyal

(c) Ankur Goyal

Skip List Struct Layout

struct Table_Row { int col_a; char* col_b; … Tower* idx_1_ptrs; Tower* idx_2_ptrs;};

(c) Ankur Goyal


• Scan performance

(c) Ankur Goyal

(c) Ankur Goyal

Inefficient Skip List

(c) Ankur Goyal

Efficient Skip List

(c) Ankur Goyal



• Reverse Itera6on

(c) Ankur Goyal



• Reverse Itera6on (HW Assignment)

(c) Ankur Goyal

Concurrency Control

(c) Ankur Goyal

Concurrency Control• No pages => No latches

(c) Ankur Goyal


• Skip list in MemSQL is lockfree

(c) Ankur Goyal



• Every node is a lock-free linked list

(c) Ankur Goyal




• Row locks are implemented with futexes (4 bytes)

(c) Ankur Goyal




• Row locks are implemented with futexes (4 bytes)

• Read-commiGed and snapshot isolaHon

(c) Ankur Goyal

In-Memory Columnstore

(c) Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

(c) Ankur Goyal


• Huge immutable vectors of data

(c) Ankur Goyal


• Huge immutable vectors of data

Solu%on: Cache dataset in memory

(c) Ankur Goyal

How do columnstores benefit from in-memory?

(c) Ankur Goyal

Have a lock-free skip list handy?

(c) Ankur Goyal

Have a lock-free skip list handy?• Keep metadata in-memory

• Use sidecar rowstore for fast small-batch writes

(c) Ankur Goyal

(c) Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

(c) Ankur Goyal

(c) Ankur Goyal


• Tunable tradeoffs for read/write amplifica=on

(c) Ankur Goyal



• Enables fast writes to a sorted columnstore

(c) Ankur Goyal



• Enables fast writes to a sorted columnstore

• Smallest sorted run is a skip list

(c) Ankur Goyal

(c) Ankur Goyal

Crash Recovery

(c) Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

(c) Ankur Goyal


• There is always a hierarchy

(c) Ankur Goyal



• E.g. EBS -> S3 -> Glacier

(c) Ankur Goyal



• E.g. EBS -> S3 -> Glacier

• To operate at in-memory speed, all disk I/O must be sequenHal

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

(c) Ankur Goyal


• Reconstruct indexes on the fly during recovery

(c) Ankur Goyal



• Only need to log PK data

(c) Ankur Goyal




• Take full database snapshots periodically

(c) Ankur Goyal




• Take full database snapshots periodically

• Tunable to be sync/async

(c) Ankur Goyal

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

(c) Ankur Goyal


• Segments are huge (several KB or even MB)

(c) Ankur Goyal



• Read/wri=en sequen?ally

(c) Ankur Goyal




• Columnstore segments synchronously wri=en to disk

(c) Ankur Goyal




• Columnstore segments synchronously wri=en to disk

• Memory-speed writes go to sidecar rowstore

(c) Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

(c) Ankur Goyal


• No par7ally wri9en state on disk, so no undos

(c) Ankur Goyal



• Columnstore just replays metadata

(c) Ankur Goyal



• Columnstore just replays metadata

• Replica7on == Con7nuous replay over the network

(c) Ankur Goyal

Code Genera*on

(c) Ankur Goyal

class Row(object): def __init__(self, a): self.a = a

t = [Row(x) for x in range(1000000)]

class State(object): def __init__(self): self.agg_sum = 0

def loop(state, row): state.agg_sum += row.a + 1

def query(): state = State() for r in t: loop(state, r) return state

if __name__ == '__main__': start = time.time() state = query() end = time.time() print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))

(c) Ankur Goyal

struct Row int main(void) { { Row(int a_arg) : a(a_arg) { } std::vector<Row> rows; int a; for (int i = 0; i < 1000000; i++)}; { rows.emplace_back(i);struct State }{ State() : agg_sum(0) { } clock_t start = clock(); int64_t agg_sum; State state = query(rows);}; clock_t end = clock();

inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %g\n", { state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC); state.agg_sum += row.a + 1; }}

inline State query(std::vector<Row>& rows){ State s; for (Row& r : rows) { loop(s, r); } return s;}

(c) Ankur Goyal

Comparison

$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

(c) Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049


37x difference in execu+on(c) Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049


37x difference in execu+on1.37x even with compila+on +me(c) Ankur Goyal

Code Genera*on• Expression execu.on

(c) Ankur Goyal


• Inline scans

(c) Ankur Goyal


• Inline scans

• Need a powerful plan cache

(c) Ankur Goyal


• Inline scans

• Need a powerful plan cache

• OLTP vs. data explora.on

(c) Ankur Goyal

Plancache Example (1)

SELECT * FROM users WHERE id = 5SELECT * FROM users WHERE id = 8

=>

SELECT * FROM users WHERE id = @

(c) Ankur Goyal

Plancache Example (2)SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)

=>

SELECT * FROM users WHERE id IN (@) OR a IN (@)

(c) Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

(c) Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

No plancache match !

(c) Ankur Goyal

Let's look at some generated code

(c) Ankur Goyal

Old Code Genera,onmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

bool overflow = false;VarCharTemp result1("foo", 3, threadId);VarCharTemp result2("bar", 3, threadId);opt<TemporaryImmutableString> result3;op_Concat(result3, result1, result2, overflow, threadId);

(c) Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

(c) Ankur Goyal


• Fast execu:on :me

• Fast compile :me

• Fast development :me

(c) Ankur Goyal





• E.g. Assembly, C++, Python

(c) Ankur Goyal





• E.g. Assembly, C++, Python

• JIT compilers turned this on its head

(c) Ankur Goyal

MemSQL Compiler Pipeline

(c) Ankur Goyal

Expression Snippet (MPL)memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

declare outRow3 <- OutRowInit()OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))OutRowSend(&outRow3)

(c) Ankur Goyal

MBC SnippetOutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))

0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified0x010c UpdateCollation local=&local_6 coll=utf8_general_ci0x0118 Concat local=&local_4 local=&local_5 local=&local_60x0128 OutRowString local=&outRow local=&local_4 target=0x018c0x0138 OptStringFree local=&local_40x0140 OptStringFree local=&local_60x0148 OptStringFree local=&local_5

(c) Ankur Goyal

MBC Snippet0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified

(c) Ankur Goyal

Distributed Query Execu0on

(c) Ankur Goyal

(c) Ankur Goyal

First, some terminology

(c) Ankur Goyal

(c) Ankur Goyal

Much easier to reason in terms of shipping SQL

(c) Ankur Goyal

(c) Ankur Goyal

SELECT supp_nation, cust_nation, l_year, Sum(volume) AS revenue FROM (SELECT n1.n_name AS supp_nation, n2.n_name AS cust_nation, Extract(year FROM l_shipdate) AS l_year, l_extendedprice * ( 1 - l_discount ) AS volume FROM supplier, lineitem, orders, customer, nation n1, nation n2, WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ( ( n1.n_name = 'CANADA' AND n2.n_name = 'UNITED STATES' ) OR ( n1.n_name = 'RUSSIA' AND n2.n_name = 'UNITED STATES' ) ) AND l_shipdate BETWEEN Date('1995-01-01') AND Date('1996-12-31')) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year;

(c) Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

(c) Ankur Goyal


• Layers of primi9ve opera9ons glued together

(c) Ankur Goyal


• Layers of primi9ve opera9ons glued together

• Full SQL on leaves

• REMOTE tables

• RESULT tables

(c) Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

(c) Ankur Goyal


• Hook into global transac9onal state

(c) Ankur Goyal



• Full SQL on a single par99on

(c) Ankur Goyal



• Full SQL on a single par99on

• Access to rowstores and columnstores

(c) Ankur Goyal

Primi%ves (SQL)Example query the aggregator can send to the leaf:

SELECT t.a, t.b, SUM(t.price)FROM t -- This will scan a physical table on the leafWHERE t.c = 1000 -- This will use a local indexGROUP BY t.a, t.b -- This will produce 1 row per group

(c) Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

(c) Ankur Goyal


• SQL interface + custom shard key

(c) Ankur Goyal


• SQL interface + custom shard key

• Parallel execu<on primi<ves

• Reshuffling

• Merging on group keys

• Merging data from joins (e.g. leE joins)

(c) Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- The row in s where s_net.b = t.a may not -- be on the same node as the local t. REMOTE(s) -- addresses the table across the cluster.

t, REMOTE(s) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- This is a reshuffle operation. It relies on t -- being sharded on (t.a) and type(t.a) == type(s.b). -- It will only pull rows in s.b that match the -- shard key's local values of (t.a).

t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

(c) Ankur Goyal


• Shares scans/computa9ons across readers

(c) Ankur Goyal



• Supports streaming seman9cs

(c) Ankur Goyal




• Technically an op9miza9on

(c) Ankur Goyal




• Technically an op9miza9on

• Similar to an RDD in Spark

(c) Ankur Goyal

Primi%ves (Result Tables)

CREATE RESULT TABLE t_reshuffled ASSELECT t.a, t.b, SUM(t.price) FROM t GROUP BY t.a, t.b SHARD BY t.a, t.b

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

(c) Ankur Goyal


• Index selec0on, Sor0ng/Grouping

(c) Ankur Goyal



• SQL -> SQL rewrites

(c) Ankur Goyal




• Cost-based distributed op0mizer

(c) Ankur Goyal





• Broadcast vs. Reshuffling

(c) Ankur Goyal





• Broadcast vs. Reshuffling

• and many, many more

(c) Ankur Goyal

MemSQL in the Wild

(c) Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

(c) Ankur Goyal


• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

(c) Ankur Goyal


• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

• Top ver'cals:Financial Services, Webscale, Telco, Federal, Media

(c) Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

(c) Ankur Goyal


• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

(c) Ankur Goyal


• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

• Fast appends, even one row at a 8me, in the columnstore

• E.g. 1 GB/s on 16 EC2 nodes

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

(c) Ankur Goyal


• No compromises, e.g. limited SQL, limited windows

(c) Ankur Goyal



• Real-2me means fast reads as well

(c) Ankur Goyal




• Subsecond queries for dashboards

(c) Ankur Goyal




• Subsecond queries for dashboards

• Millisecond queries for applica2ons

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

(c) Ankur Goyal


• Hundreds of nodes for perf and HA

(c) Ankur Goyal



• True "produc.on" workloads

(c) Ankur Goyal




• Exis.ng OLTP databases lack scalability and SQL perf

(c) Ankur Goyal




• Exis.ng OLTP databases lack scalability and SQL perf

• Exis.ng OLAP databases lack opera.onal features

(c) Ankur Goyal

Logos

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

(c) Ankur Goyal


• In-memory Databases are databases built to modern tradeoffs

(c) Ankur Goyal



• Old problems with new solu<ons

(c) Ankur Goyal




• Real-<me analy<cs and Large-scale applica<ons == New projects

(c) Ankur Goyal




• Real-<me analy<cs and Large-scale applica<ons == New projects

• We are hiring and ❤ Waterloo.

• Come visit us in SF: email [email protected]

(c) Ankur Goyal

Ques%ons

(c) Ankur Goyal

memsql db class, ankur goyal

Data & Analytics