memsql db class, ankur goyal

Post on 15-Apr-2017

927 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

15-415/615 1

Ankur Goyal3/17/2016

1 Based on a lecture given at Carnegie Mellon University.

(c) Ankur Goyal

Ques%ons We Will Answer• What is an in-memory database?

• Why do they ma3er?

• How do you build one?

• How do people use MemSQL?

(c) Ankur Goyal

Topics• In-Memory Databases

• In-Memory Architecture

• MemSQL in the Wild

• Q/A

(c) Ankur Goyal

Ankur Goyal• CMU SCS (2008-2011), PDL (2010-2011)

• Microso7 (2010)

• VP of Engineering @ MemSQL (2011-)

• I ❤ databases

(c) Ankur Goyal

Live Demo

(c) Ankur Goyal

What is an in-memory database?

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

(c) Ankur Goyal

In-Memory Databases...• Use memory instead of disk

• Do not (need to) save data on disk

• Put the whole dataset in memory

Well, some)mes...

(c) Ankur Goyal

Wikipedia says...

In-memory databases primarily rely on main-memory for storage.

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

• Tend to be distributed systems

(c) Ankur Goyal

In-Memory Databases• Are durable to disk (and respect ACID)

• Can spill on disk or pin data in-memory (and take advantage of it)

• Tradeoffs are suited to systems with lots of memory

• Tend to be distributed systems

• Have a different set of boClenecks

(c) Ankur Goyal

Bold Claim

(c) Ankur Goyal

All database workloads will be running on in-memory databases

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

• NVRAM is coming (and could be cheaper than SSD)

(c) Ankur Goyal

Why?• Memory is ge,ng cheaper (about 40% every year)

• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)

• In-memory databases leverage SSD (no random writes)

• NVRAM is coming (and could be cheaper than SSD)

In-memory databases are tuned to modern hardware and modern workloads

(c) Ankur Goyal

In-Memory Architecture

(c) Ankur Goyal

Architecture Topics• In-Memory Storage

• Transac3ons and Concurrency Control

• Crash Recovery and Replica3on

• Code Genera3on

• Distributed Execu3on

(c) Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

(c) Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

(c) Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

• Working space is precious (RAM)

(c) Ankur Goyal

In-Memory Storage Mo/va/on• Insanely fast random reads & writes

• Atomic writes as granular as a byte

• Working space is precious (RAM)

• Very different for rowstores and columnstores

(c) Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

(c) Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

(c) Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

Solu%on: keep the whole dataset in memory

(c) Ankur Goyal

In-Memory Rowstore• Rowstores have lots of random reads/writes

• Datasets are usually small < 10 TB

Solu%on: keep the whole dataset in memory

• Use memory op+mized data structures (skip list)

(c) Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

(c) Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

• Expected O(log(n)) lookup, insert, delete

(c) Ankur Goyal

What is a Skip List• Invented in 1989 by William Pugh

• Expected O(log(n)) lookup, insert, delete

• No pages

(c) Ankur Goyal

(c) Ankur Goyal

Common Concerns• Memory overhead

(c) Ankur Goyal

(c) Ankur Goyal

Skip List Struct Layout

struct Table_Row { int col_a; char* col_b; … Tower* idx_1_ptrs; Tower* idx_2_ptrs;};

(c) Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

(c) Ankur Goyal

(c) Ankur Goyal

Inefficient Skip List

(c) Ankur Goyal

Efficient Skip List

(c) Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

• Reverse Itera6on

(c) Ankur Goyal

Common Concerns• Memory overhead

• Scan performance

• Reverse Itera6on (HW Assignment)

(c) Ankur Goyal

Concurrency Control

(c) Ankur Goyal

Concurrency Control• No pages => No latches

(c) Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

(c) Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

(c) Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

• Row locks are implemented with futexes (4 bytes)

(c) Ankur Goyal

Concurrency Control• No pages => No latches

• Skip list in MemSQL is lockfree

• Every node is a lock-free linked list

• Row locks are implemented with futexes (4 bytes)

• Read-commiGed and snapshot isolaHon

(c) Ankur Goyal

In-Memory Columnstore

(c) Ankur Goyal

In-Memory Columnstore

(c) Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

(c) Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

• Huge immutable vectors of data

(c) Ankur Goyal

Columnstore Review• Big sequen+al scans and writes

• Huge immutable vectors of data

Solu%on: Cache dataset in memory

(c) Ankur Goyal

How do columnstores benefit from in-memory?

(c) Ankur Goyal

Have a lock-free skip list handy?

(c) Ankur Goyal

Have a lock-free skip list handy?• Keep metadata in-memory

• Use sidecar rowstore for fast small-batch writes

(c) Ankur Goyal

(c) Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

(c) Ankur Goyal

(c) Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

(c) Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

• Enables fast writes to a sorted columnstore

(c) Ankur Goyal

Columnstore LSM• Log-Structured Merge of sorted runs

• Tunable tradeoffs for read/write amplifica=on

• Enables fast writes to a sorted columnstore

• Smallest sorted run is a skip list

(c) Ankur Goyal

(c) Ankur Goyal

Crash Recovery

(c) Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

(c) Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

(c) Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

• E.g. EBS -> S3 -> Glacier

(c) Ankur Goyal

Durability in an In-Memory System?• Memory is not a reliable medium (yet)

• There is always a hierarchy

• E.g. EBS -> S3 -> Glacier

• To operate at in-memory speed, all disk I/O must be sequenHal

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

• Take full database snapshots periodically

(c) Ankur Goyal

Durability in the Rowstore• Indexes are not materialized on disk

• Reconstruct indexes on the fly during recovery

• Only need to log PK data

• Take full database snapshots periodically

• Tunable to be sync/async

(c) Ankur Goyal

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

• Columnstore segments synchronously wri=en to disk

(c) Ankur Goyal

Durability in the Columnstore• Metadata uses ordinary rowstore mechanism

• Segments are huge (several KB or even MB)

• Read/wri=en sequen?ally

• Columnstore segments synchronously wri=en to disk

• Memory-speed writes go to sidecar rowstore

(c) Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

(c) Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

(c) Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

• Columnstore just replays metadata

(c) Ankur Goyal

Crash Recovery• Replay latest snapshot, and then every log file since

• No par7ally wri9en state on disk, so no undos

• Columnstore just replays metadata

• Replica7on == Con7nuous replay over the network

(c) Ankur Goyal

Code Genera*on

(c) Ankur Goyal

class Row(object): def __init__(self, a): self.a = a

t = [Row(x) for x in range(1000000)]

class State(object): def __init__(self): self.agg_sum = 0

def loop(state, row): state.agg_sum += row.a + 1

def query(): state = State() for r in t: loop(state, r) return state

if __name__ == '__main__': start = time.time() state = query() end = time.time() print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))

(c) Ankur Goyal

struct Row int main(void) { { Row(int a_arg) : a(a_arg) { } std::vector<Row> rows; int a; for (int i = 0; i < 1000000; i++)}; { rows.emplace_back(i);struct State }{ State() : agg_sum(0) { } clock_t start = clock(); int64_t agg_sum; State state = query(rows);}; clock_t end = clock();

inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %g\n", { state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC); state.agg_sum += row.a + 1; }}

inline State query(std::vector<Row>& rows){ State s; for (Row& r : rows) { loop(s, r); } return s;}

(c) Ankur Goyal

Comparison

$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

(c) Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

37x difference in execu+on(c) Ankur Goyal

Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049

$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745

37x difference in execu+on1.37x even with compila+on +me(c) Ankur Goyal

Code Genera*on• Expression execu.on

(c) Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

(c) Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

• Need a powerful plan cache

(c) Ankur Goyal

Code Genera*on• Expression execu.on

• Inline scans

• Need a powerful plan cache

• OLTP vs. data explora.on

(c) Ankur Goyal

Plancache Example (1)

SELECT * FROM users WHERE id = 5SELECT * FROM users WHERE id = 8

=>

SELECT * FROM users WHERE id = @

(c) Ankur Goyal

Plancache Example (2)SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)

=>

SELECT * FROM users WHERE id IN (@) OR a IN (@)

(c) Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

(c) Ankur Goyal

Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;

SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;

No plancache match !

(c) Ankur Goyal

Let's look at some generated code

(c) Ankur Goyal

Expression Snippetmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.00 sec)

(c) Ankur Goyal

Old Code Genera,onmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

bool overflow = false;VarCharTemp result1("foo", 3, threadId);VarCharTemp result2("bar", 3, threadId);opt<TemporaryImmutableString> result3;op_Concat(result3, result1, result2, overflow, threadId);

(c) Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

(c) Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

(c) Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

• E.g. Assembly, C++, Python

(c) Ankur Goyal

Code Genera*on is Hard• Old compilers adage: Pick 2 of 3

• Fast execu:on :me

• Fast compile :me

• Fast development :me

• E.g. Assembly, C++, Python

• JIT compilers turned this on its head

(c) Ankur Goyal

MemSQL Compiler Pipeline

(c) Ankur Goyal

Expression Snippet (MPL)memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)

declare outRow3 <- OutRowInit()OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))OutRowSend(&outRow3)

(c) Ankur Goyal

MBC SnippetOutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))

0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified0x010c UpdateCollation local=&local_6 coll=utf8_general_ci0x0118 Concat local=&local_4 local=&local_5 local=&local_60x0128 OutRowString local=&outRow local=&local_4 target=0x018c0x0138 OptStringFree local=&local_40x0140 OptStringFree local=&local_60x0148 OptStringFree local=&local_5

(c) Ankur Goyal

MBC Snippet0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified

(c) Ankur Goyal

Distributed Query Execu0on

(c) Ankur Goyal

(c) Ankur Goyal

First, some terminology

(c) Ankur Goyal

(c) Ankur Goyal

(c) Ankur Goyal

(c) Ankur Goyal

Much easier to reason in terms of shipping SQL

(c) Ankur Goyal

(c) Ankur Goyal

(c) Ankur Goyal

SELECT supp_nation,        cust_nation,        l_year,        Sum(volume) AS revenue FROM   (SELECT n1.n_name AS supp_nation,                n2.n_name AS cust_nation,                Extract(year FROM l_shipdate) AS l_year,                l_extendedprice * ( 1 - l_discount ) AS volume         FROM   supplier,                lineitem,                orders,                customer,                nation n1,                nation n2,        WHERE  s_suppkey = l_suppkey                AND o_orderkey = l_orderkey                AND c_custkey = o_custkey                AND s_nationkey = n1.n_nationkey                AND c_nationkey = n2.n_nationkey                AND ( ( n1.n_name = 'CANADA'                        AND n2.n_name = 'UNITED STATES' )                       OR ( n1.n_name = 'RUSSIA'                            AND n2.n_name =  'UNITED STATES' ) )                AND l_shipdate BETWEEN Date('1995-01-01') AND Date('1996-12-31'))        AS shipping GROUP  BY supp_nation,           cust_nation,           l_year ORDER  BY supp_nation,           cust_nation,           l_year; 

(c) Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

(c) Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

• Layers of primi9ve opera9ons glued together

(c) Ankur Goyal

Abstrac(ons• Distributed Query Plan created on aggregator

• Layers of primi9ve opera9ons glued together

• Full SQL on leaves

• REMOTE tables

• RESULT tables

(c) Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

(c) Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

(c) Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

• Full SQL on a single par99on

(c) Ankur Goyal

Primi%ves (SQL)• Queries over physical indexes

• Hook into global transac9onal state

• Full SQL on a single par99on

• Access to rowstores and columnstores

(c) Ankur Goyal

Primi%ves (SQL)Example query the aggregator can send to the leaf:

SELECT t.a, t.b, SUM(t.price)FROM t -- This will scan a physical table on the leafWHERE t.c = 1000 -- This will use a local indexGROUP BY t.a, t.b -- This will produce 1 row per group

(c) Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

(c) Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

• SQL interface + custom shard key

(c) Ankur Goyal

Primi%ves (Remote Tables)• Address data across leaves

• SQL interface + custom shard key

• Parallel execu<on primi<ves

• Reshuffling

• Merging on group keys

• Merging data from joins (e.g. leE joins)

(c) Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- The row in s where s_net.b = t.a may not -- be on the same node as the local t. REMOTE(s) -- addresses the table across the cluster.

t, REMOTE(s) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM

-- This is a reshuffle operation. It relies on t -- being sharded on (t.a) and type(t.a) == type(s.b). -- It will only pull rows in s.b that match the -- shard key's local values of (t.a).

t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net

WHERE t.a = s_net.bGROUP BY t.a

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

• Technically an op9miza9on

(c) Ankur Goyal

Primi%ves (Result Tables)• Shared, cached results of SQL queries

• Shares scans/computa9ons across readers

• Supports streaming seman9cs

• Technically an op9miza9on

• Similar to an RDD in Spark

(c) Ankur Goyal

Primi%ves (Result Tables)

CREATE RESULT TABLE t_reshuffled ASSELECT t.a, t.b, SUM(t.price) FROM t GROUP BY t.a, t.b SHARD BY t.a, t.b

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

• Broadcast vs. Reshuffling

(c) Ankur Goyal

Op#miza#ons• Single-machine op0miza0ons

• Index selec0on, Sor0ng/Grouping

• SQL -> SQL rewrites

• Cost-based distributed op0mizer

• Broadcast vs. Reshuffling

• and many, many more

(c) Ankur Goyal

MemSQL in the Wild

(c) Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

(c) Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

(c) Ankur Goyal

Horizontals and Ver/cals• Real-'me data processing is everywhere

• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons

• Top ver'cals:Financial Services, Webscale, Telco, Federal, Media

(c) Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

(c) Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

(c) Ankur Goyal

Real-&me Analy&cs• High volumes of data, processed in real-8me

• Fast updates in the rowstore

• INSERT ... ON DUPLICATE KEY UPDATE

• E.g. 2M update transac8ons/sec on 10 nodes

• Fast appends, even one row at a 8me, in the columnstore

• E.g. 1 GB/s on 16 EC2 nodes

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

• Subsecond queries for dashboards

(c) Ankur Goyal

Real-&me Analy&cs• Converging with mainline analy2cs

• No compromises, e.g. limited SQL, limited windows

• Real-2me means fast reads as well

• Subsecond queries for dashboards

• Millisecond queries for applica2ons

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

• Exis.ng OLTP databases lack scalability and SQL perf

(c) Ankur Goyal

Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons

• Hundreds of nodes for perf and HA

• True "produc.on" workloads

• Exis.ng OLTP databases lack scalability and SQL perf

• Exis.ng OLAP databases lack opera.onal features

(c) Ankur Goyal

Logos

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

• Real-<me analy<cs and Large-scale applica<ons == New projects

(c) Ankur Goyal

Take-Aways• In-memory Database != All-memory Database

• In-memory Databases are databases built to modern tradeoffs

• Old problems with new solu<ons

• Real-<me analy<cs and Large-scale applica<ons == New projects

• We are hiring and ❤ Waterloo.

• Come visit us in SF: email ankur@memsql.com

(c) Ankur Goyal

Ques%ons

(c) Ankur Goyal

top related