memsql db class, ankur goyal

174
15-415/615 1 Ankur Goyal 3/17/2016 1 Based on a lecture given at Carnegie Mellon University. (c) Ankur Goyal

Upload: memsql

Post on 15-Apr-2017

927 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: MemSQL DB Class, Ankur Goyal

15-415/615 1

Ankur Goyal3/17/2016

1 Based on a lecture given at Carnegie Mellon University.

(c) Ankur Goyal

Page 2: MemSQL DB Class, Ankur Goyal

Ques%ons We Will Answer• What is an in-memory database?

• Why do they ma3er?

• How do you build one?

• How do people use MemSQL?

(c) Ankur Goyal

Page 3: MemSQL DB Class, Ankur Goyal

Topics• In-Memory Databases

• In-Memory Architecture

• MemSQL in the Wild

• Q/A

(c) Ankur Goyal

Page 4: MemSQL DB Class, Ankur Goyal

Ankur Goyal• CMU SCS (2008-2011), PDL (2010-2011)

• Microso7 (2010)

• VP of Engineering @ MemSQL (2011-)

• I ❤ databases

(c) Ankur Goyal

Page 5: MemSQL DB Class, Ankur Goyal

Live Demo

(c) Ankur Goyal

Page 6: MemSQL DB Class, Ankur Goyal

What is an in-memory database?

(c) Ankur Goyal

Page 7: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

(c) Ankur Goyal

Page 8: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

(c) Ankur Goyal

Page 9: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

(c) Ankur Goyal

Page 10: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

(c) Ankur Goyal

Page 11: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

(c) Ankur Goyal

Page 12: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

(c) Ankur Goyal

Page 13: MemSQL DB Class, Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

Well, some)mes...

(c) Ankur Goyal

Page 14: MemSQL DB Class, Ankur Goyal

Wikipedia says...

In-memory databases primarily rely on main-memory for storage.

(c) Ankur Goyal

Page 15: MemSQL DB Class, Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

(c) Ankur Goyal

Page 16: MemSQL DB Class, Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

(c) Ankur Goyal

Page 17: MemSQL DB Class, Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

(c) Ankur Goyal

Page 18: MemSQL DB Class, Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

• Tend to be distributed systems

(c) Ankur Goyal

Page 19: MemSQL DB Class, Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

• Tend to be distributed systems

• Have a different set of boClenecks

(c) Ankur Goyal

Page 20: MemSQL DB Class, Ankur Goyal

Bold Claim

(c) Ankur Goyal

Page 21: MemSQL DB Class, Ankur Goyal

All database workloads will be running on in-memory databases

(c) Ankur Goyal

Page 22: MemSQL DB Class, Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

(c) Ankur Goyal

Page 23: MemSQL DB Class, Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

(c) Ankur Goyal

Page 24: MemSQL DB Class, Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

(c) Ankur Goyal

Page 25: MemSQL DB Class, Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

• NVRAM is coming (and could be cheaper than SSD)

(c) Ankur Goyal

Page 26: MemSQL DB Class, Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

• NVRAM is coming (and could be cheaper than SSD)

In-memory databases are tuned to modern hardware and modern workloads

(c) Ankur Goyal

Page 27: MemSQL DB Class, Ankur Goyal

In-Memory Architecture

(c) Ankur Goyal

Page 28: MemSQL DB Class, Ankur Goyal

Architecture Topics• In-Memory Storage

• Transac3ons and Concurrency Control

• Crash Recovery and Replica3on

• Code Genera3on

• Distributed Execu3on

(c) Ankur Goyal

Page 29: MemSQL DB Class, Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

(c) Ankur Goyal

Page 30: MemSQL DB Class, Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

(c) Ankur Goyal

Page 31: MemSQL DB Class, Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

• Working space is precious (RAM)

(c) Ankur Goyal

Page 32: MemSQL DB Class, Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

• Working space is precious (RAM)

• Very different for rowstores and columnstores

(c) Ankur Goyal

Page 33: MemSQL DB Class, Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

(c) Ankur Goyal

Page 34: MemSQL DB Class, Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

(c) Ankur Goyal

Page 35: MemSQL DB Class, Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

Solu%on: keep the whole dataset in memory

(c) Ankur Goyal

Page 36: MemSQL DB Class, Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

Solu%on: keep the whole dataset in memory

• Use memory op+mized data structures (skip list)

(c) Ankur Goyal

Page 37: MemSQL DB Class, Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

(c) Ankur Goyal

Page 38: MemSQL DB Class, Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

• Expected O(log(n)) lookup, insert, delete

(c) Ankur Goyal

Page 39: MemSQL DB Class, Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

• Expected O(log(n)) lookup, insert, delete

• No pages

(c) Ankur Goyal

Page 40: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 41: MemSQL DB Class, Ankur Goyal

Common Concerns• Memory overhead

(c) Ankur Goyal

Page 42: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 43: MemSQL DB Class, Ankur Goyal

Skip List Struct Layout

struct Table_Row { int col_a; char* col_b; … Tower* idx_1_ptrs; Tower* idx_2_ptrs;};

(c) Ankur Goyal

Page 44: MemSQL DB Class, Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

(c) Ankur Goyal

Page 45: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 46: MemSQL DB Class, Ankur Goyal

Inefficient Skip List

(c) Ankur Goyal

Page 47: MemSQL DB Class, Ankur Goyal

Efficient Skip List

(c) Ankur Goyal

Page 48: MemSQL DB Class, Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

• Reverse Itera6on

(c) Ankur Goyal

Page 49: MemSQL DB Class, Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

• Reverse Itera6on (HW Assignment)

(c) Ankur Goyal

Page 50: MemSQL DB Class, Ankur Goyal

Concurrency Control

(c) Ankur Goyal

Page 51: MemSQL DB Class, Ankur Goyal

Concurrency Control• No pages => No latches

(c) Ankur Goyal

Page 52: MemSQL DB Class, Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

(c) Ankur Goyal

Page 53: MemSQL DB Class, Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

(c) Ankur Goyal

Page 54: MemSQL DB Class, Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

• Row locks are implemented with futexes (4 bytes)

(c) Ankur Goyal

Page 55: MemSQL DB Class, Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

• Row locks are implemented with futexes (4 bytes)

• Read-commiGed and snapshot isolaHon

(c) Ankur Goyal

Page 56: MemSQL DB Class, Ankur Goyal

In-Memory Columnstore

(c) Ankur Goyal

Page 57: MemSQL DB Class, Ankur Goyal

In-Memory Columnstore

(c) Ankur Goyal

Page 58: MemSQL DB Class, Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

(c) Ankur Goyal

Page 59: MemSQL DB Class, Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

• Huge immutable vectors of data

(c) Ankur Goyal

Page 60: MemSQL DB Class, Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

• Huge immutable vectors of data

Solu%on: Cache dataset in memory

(c) Ankur Goyal

Page 61: MemSQL DB Class, Ankur Goyal

How do columnstores benefit from in-memory?

(c) Ankur Goyal

Page 62: MemSQL DB Class, Ankur Goyal

Have a lock-free skip list handy?

(c) Ankur Goyal

Page 63: MemSQL DB Class, Ankur Goyal

Have a lock-free skip list handy?• Keep metadata in-memory

• Use sidecar rowstore for fast small-batch writes

(c) Ankur Goyal

Page 64: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 65: MemSQL DB Class, Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

(c) Ankur Goyal

Page 66: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 67: MemSQL DB Class, Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

(c) Ankur Goyal

Page 68: MemSQL DB Class, Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

• Enables fast writes to a sorted columnstore

(c) Ankur Goyal

Page 69: MemSQL DB Class, Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

• Enables fast writes to a sorted columnstore

• Smallest sorted run is a skip list

(c) Ankur Goyal

Page 70: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 71: MemSQL DB Class, Ankur Goyal

Crash Recovery

(c) Ankur Goyal

Page 72: MemSQL DB Class, Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

(c) Ankur Goyal

Page 73: MemSQL DB Class, Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

(c) Ankur Goyal

Page 74: MemSQL DB Class, Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

• E.g. EBS -> S3 -> Glacier

(c) Ankur Goyal

Page 75: MemSQL DB Class, Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

• E.g. EBS -> S3 -> Glacier

• To operate at in-memory speed, all disk I/O must be sequenHal

(c) Ankur Goyal

Page 76: MemSQL DB Class, Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

(c) Ankur Goyal

Page 77: MemSQL DB Class, Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

(c) Ankur Goyal

Page 78: MemSQL DB Class, Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

(c) Ankur Goyal

Page 79: MemSQL DB Class, Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

• Take full database snapshots periodically

(c) Ankur Goyal

Page 80: MemSQL DB Class, Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

• Take full database snapshots periodically

• Tunable to be sync/async

(c) Ankur Goyal

Page 81: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 82: MemSQL DB Class, Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

(c) Ankur Goyal

Page 83: MemSQL DB Class, Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

(c) Ankur Goyal

Page 84: MemSQL DB Class, Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

(c) Ankur Goyal

Page 85: MemSQL DB Class, Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

• Columnstore segments synchronously wri=en to disk

(c) Ankur Goyal

Page 86: MemSQL DB Class, Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

• Columnstore segments synchronously wri=en to disk

• Memory-speed writes go to sidecar rowstore

(c) Ankur Goyal

Page 87: MemSQL DB Class, Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

(c) Ankur Goyal

Page 88: MemSQL DB Class, Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

(c) Ankur Goyal

Page 89: MemSQL DB Class, Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

• Columnstore just replays metadata

(c) Ankur Goyal

Page 90: MemSQL DB Class, Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

• Columnstore just replays metadata

• Replica7on == Con7nuous replay over the network

(c) Ankur Goyal

Page 91: MemSQL DB Class, Ankur Goyal

Code Genera*on

(c) Ankur Goyal

Page 92: MemSQL DB Class, Ankur Goyal

class Row(object): def __init__(self, a): self.a = a

t = [Row(x) for x in range(1000000)]

class State(object): def __init__(self): self.agg_sum = 0

def loop(state, row): state.agg_sum += row.a + 1

def query(): state = State() for r in t: loop(state, r) return state

if __name__ == '__main__': start = time.time() state = query() end = time.time() print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))

(c) Ankur Goyal

Page 93: MemSQL DB Class, Ankur Goyal

struct Row int main(void) { { Row(int a_arg) : a(a_arg) { } std::vector<Row> rows; int a; for (int i = 0; i < 1000000; i++)}; { rows.emplace_back(i);struct State }{ State() : agg_sum(0) { } clock_t start = clock(); int64_t agg_sum; State state = query(rows);}; clock_t end = clock();

inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %g\n", { state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC); state.agg_sum += row.a + 1; }}

inline State query(std::vector<Row>& rows){ State s; for (Row& r : rows) { loop(s, r); } return s;}

(c) Ankur Goyal

Page 94: MemSQL DB Class, Ankur Goyal

Comparison

$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

(c) Ankur Goyal

Page 95: MemSQL DB Class, Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

37x difference in execu+on(c) Ankur Goyal

Page 96: MemSQL DB Class, Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

37x difference in execu+on1.37x even with compila+on +me(c) Ankur Goyal

Page 97: MemSQL DB Class, Ankur Goyal

Code Genera*on• Expression execu.on

(c) Ankur Goyal

Page 98: MemSQL DB Class, Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

(c) Ankur Goyal

Page 99: MemSQL DB Class, Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

• Need a powerful plan cache

(c) Ankur Goyal

Page 100: MemSQL DB Class, Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

• Need a powerful plan cache

• OLTP vs. data explora.on

(c) Ankur Goyal

Page 101: MemSQL DB Class, Ankur Goyal

Plancache Example (1)

SELECT * FROM users WHERE id = 5SELECT * FROM users WHERE id = 8

=>

SELECT * FROM users WHERE id = @

(c) Ankur Goyal

Page 102: MemSQL DB Class, Ankur Goyal

Plancache Example (2)SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)

=>

SELECT * FROM users WHERE id IN (@) OR a IN (@)

(c) Ankur Goyal

Page 103: MemSQL DB Class, Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

(c) Ankur Goyal

Page 104: MemSQL DB Class, Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

No plancache match !

(c) Ankur Goyal

Page 105: MemSQL DB Class, Ankur Goyal

Let's look at some generated code

(c) Ankur Goyal

Page 106: MemSQL DB Class, Ankur Goyal

Expression Snippetmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.00 sec)

(c) Ankur Goyal

Page 107: MemSQL DB Class, Ankur Goyal

Old Code Genera,onmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

bool overflow = false;VarCharTemp result1("foo", 3, threadId);VarCharTemp result2("bar", 3, threadId);opt<TemporaryImmutableString> result3;op_Concat(result3, result1, result2, overflow, threadId);

(c) Ankur Goyal

Page 108: MemSQL DB Class, Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

(c) Ankur Goyal

Page 109: MemSQL DB Class, Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

(c) Ankur Goyal

Page 110: MemSQL DB Class, Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

• E.g. Assembly, C++, Python

(c) Ankur Goyal

Page 111: MemSQL DB Class, Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

• E.g. Assembly, C++, Python

• JIT compilers turned this on its head

(c) Ankur Goyal

Page 112: MemSQL DB Class, Ankur Goyal

MemSQL Compiler Pipeline

(c) Ankur Goyal

Page 113: MemSQL DB Class, Ankur Goyal

Expression Snippet (MPL)memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

declare outRow3 <- OutRowInit()OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))OutRowSend(&outRow3)

(c) Ankur Goyal

Page 114: MemSQL DB Class, Ankur Goyal

MBC SnippetOutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))

0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified0x010c UpdateCollation local=&local_6 coll=utf8_general_ci0x0118 Concat local=&local_4 local=&local_5 local=&local_60x0128 OutRowString local=&outRow local=&local_4 target=0x018c0x0138 OptStringFree local=&local_40x0140 OptStringFree local=&local_60x0148 OptStringFree local=&local_5

(c) Ankur Goyal

Page 115: MemSQL DB Class, Ankur Goyal

MBC Snippet0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified

(c) Ankur Goyal

Page 116: MemSQL DB Class, Ankur Goyal

Distributed Query Execu0on

(c) Ankur Goyal

Page 117: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 118: MemSQL DB Class, Ankur Goyal

First, some terminology

(c) Ankur Goyal

Page 119: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 120: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 121: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 122: MemSQL DB Class, Ankur Goyal

Much easier to reason in terms of shipping SQL

(c) Ankur Goyal

Page 123: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 124: MemSQL DB Class, Ankur Goyal

(c) Ankur Goyal

Page 125: MemSQL DB Class, Ankur Goyal

SELECT supp_nation,        cust_nation,        l_year,        Sum(volume) AS revenue FROM   (SELECT n1.n_name AS supp_nation,                n2.n_name AS cust_nation,                Extract(year FROM l_shipdate) AS l_year,                l_extendedprice * ( 1 - l_discount ) AS volume         FROM   supplier,                lineitem,                orders,                customer,                nation n1,                nation n2,        WHERE  s_suppkey = l_suppkey                AND o_orderkey = l_orderkey                AND c_custkey = o_custkey                AND s_nationkey = n1.n_nationkey                AND c_nationkey = n2.n_nationkey                AND ( ( n1.n_name = 'CANADA'                        AND n2.n_name = 'UNITED STATES' )                       OR ( n1.n_name = 'RUSSIA'                            AND n2.n_name =  'UNITED STATES' ) )                AND l_shipdate BETWEEN Date('1995-01-01') AND Date('1996-12-31'))        AS shipping GROUP  BY supp_nation,           cust_nation,           l_year ORDER  BY supp_nation,           cust_nation,           l_year; 

(c) Ankur Goyal

Page 126: MemSQL DB Class, Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

(c) Ankur Goyal

Page 127: MemSQL DB Class, Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

• Layers of primi9ve opera9ons glued together

(c) Ankur Goyal

Page 128: MemSQL DB Class, Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

• Layers of primi9ve opera9ons glued together

• Full SQL on leaves

• REMOTE tables

• RESULT tables

(c) Ankur Goyal

Page 129: MemSQL DB Class, Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

(c) Ankur Goyal

Page 130: MemSQL DB Class, Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

(c) Ankur Goyal

Page 131: MemSQL DB Class, Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

• Full SQL on a single par99on

(c) Ankur Goyal

Page 132: MemSQL DB Class, Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

• Full SQL on a single par99on

• Access to rowstores and columnstores

(c) Ankur Goyal

Page 133: MemSQL DB Class, Ankur Goyal

Primi%ves (SQL)Example query the aggregator can send to the leaf:

SELECT t.a, t.b, SUM(t.price)FROM t -- This will scan a physical table on the leafWHERE t.c = 1000 -- This will use a local indexGROUP BY t.a, t.b -- This will produce 1 row per group

(c) Ankur Goyal

Page 134: MemSQL DB Class, Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

(c) Ankur Goyal

Page 135: MemSQL DB Class, Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

• SQL interface + custom shard key

(c) Ankur Goyal

Page 136: MemSQL DB Class, Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

• SQL interface + custom shard key

• Parallel execu<on primi<ves

• Reshuffling

• Merging on group keys

• Merging data from joins (e.g. leE joins)

(c) Ankur Goyal

Page 137: MemSQL DB Class, Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- The row in s where s_net.b = t.a may not -- be on the same node as the local t. REMOTE(s) -- addresses the table across the cluster.

t, REMOTE(s) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Page 138: MemSQL DB Class, Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- This is a reshuffle operation. It relies on t -- being sharded on (t.a) and type(t.a) == type(s.b). -- It will only pull rows in s.b that match the -- shard key's local values of (t.a).

t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Page 139: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

(c) Ankur Goyal

Page 140: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

(c) Ankur Goyal

Page 141: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

(c) Ankur Goyal

Page 142: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

• Technically an op9miza9on

(c) Ankur Goyal

Page 143: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

• Technically an op9miza9on

• Similar to an RDD in Spark

(c) Ankur Goyal

Page 144: MemSQL DB Class, Ankur Goyal

Primi%ves (Result Tables)

CREATE RESULT TABLE t_reshuffled ASSELECT t.a, t.b, SUM(t.price) FROM t GROUP BY t.a, t.b SHARD BY t.a, t.b

(c) Ankur Goyal

Page 145: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

(c) Ankur Goyal

Page 146: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

(c) Ankur Goyal

Page 147: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

(c) Ankur Goyal

Page 148: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

(c) Ankur Goyal

Page 149: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

• Broadcast vs. Reshuffling

(c) Ankur Goyal

Page 150: MemSQL DB Class, Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

• Broadcast vs. Reshuffling

• and many, many more

(c) Ankur Goyal

Page 151: MemSQL DB Class, Ankur Goyal

MemSQL in the Wild

(c) Ankur Goyal

Page 152: MemSQL DB Class, Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

(c) Ankur Goyal

Page 153: MemSQL DB Class, Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

(c) Ankur Goyal

Page 154: MemSQL DB Class, Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

• Top ver'cals:Financial Services, Webscale, Telco, Federal, Media

(c) Ankur Goyal

Page 155: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

(c) Ankur Goyal

Page 156: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

(c) Ankur Goyal

Page 157: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

• Fast appends, even one row at a 8me, in the columnstore

• E.g. 1 GB/s on 16 EC2 nodes

(c) Ankur Goyal

Page 158: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

(c) Ankur Goyal

Page 159: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

(c) Ankur Goyal

Page 160: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

(c) Ankur Goyal

Page 161: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

• Subsecond queries for dashboards

(c) Ankur Goyal

Page 162: MemSQL DB Class, Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

• Subsecond queries for dashboards

• Millisecond queries for applica2ons

(c) Ankur Goyal

Page 163: MemSQL DB Class, Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

(c) Ankur Goyal

Page 164: MemSQL DB Class, Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

(c) Ankur Goyal

Page 165: MemSQL DB Class, Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

(c) Ankur Goyal

Page 166: MemSQL DB Class, Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

• Exis.ng OLTP databases lack scalability and SQL perf

(c) Ankur Goyal

Page 167: MemSQL DB Class, Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

• Exis.ng OLTP databases lack scalability and SQL perf

• Exis.ng OLAP databases lack opera.onal features

(c) Ankur Goyal

Page 168: MemSQL DB Class, Ankur Goyal

Logos

(c) Ankur Goyal

Page 169: MemSQL DB Class, Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

(c) Ankur Goyal

Page 170: MemSQL DB Class, Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

(c) Ankur Goyal

Page 171: MemSQL DB Class, Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

(c) Ankur Goyal

Page 172: MemSQL DB Class, Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

• Real-<me analy<cs and Large-scale applica<ons == New projects

(c) Ankur Goyal

Page 173: MemSQL DB Class, Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

• Real-<me analy<cs and Large-scale applica<ons == New projects

• We are hiring and ❤ Waterloo.

• Come visit us in SF: email [email protected]

(c) Ankur Goyal

Page 174: MemSQL DB Class, Ankur Goyal

Ques%ons

(c) Ankur Goyal