memsql db class, ankur goyal
TRANSCRIPT
15-415/615 1
Ankur Goyal3/17/2016
1 Based on a lecture given at Carnegie Mellon University.
(c) Ankur Goyal
Ques%ons We Will Answer• What is an in-memory database?
• Why do they ma3er?
• How do you build one?
• How do people use MemSQL?
(c) Ankur Goyal
Topics• In-Memory Databases
• In-Memory Architecture
• MemSQL in the Wild
• Q/A
(c) Ankur Goyal
Ankur Goyal• CMU SCS (2008-2011), PDL (2010-2011)
• Microso7 (2010)
• VP of Engineering @ MemSQL (2011-)
• I ❤ databases
(c) Ankur Goyal
Live Demo
(c) Ankur Goyal
What is an in-memory database?
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
• Do not (need to) save data on disk
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
• Do not (need to) save data on disk
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
(c) Ankur Goyal
In-Memory Databases...• Use memory instead of disk
• Do not (need to) save data on disk
• Put the whole dataset in memory
Well, some)mes...
(c) Ankur Goyal
Wikipedia says...
In-memory databases primarily rely on main-memory for storage.
(c) Ankur Goyal
In-Memory Databases• Are durable to disk (and respect ACID)
(c) Ankur Goyal
In-Memory Databases• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
(c) Ankur Goyal
In-Memory Databases• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
(c) Ankur Goyal
In-Memory Databases• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
• Tend to be distributed systems
(c) Ankur Goyal
In-Memory Databases• Are durable to disk (and respect ACID)
• Can spill on disk or pin data in-memory (and take advantage of it)
• Tradeoffs are suited to systems with lots of memory
• Tend to be distributed systems
• Have a different set of boClenecks
(c) Ankur Goyal
Bold Claim
(c) Ankur Goyal
All database workloads will be running on in-memory databases
(c) Ankur Goyal
Why?• Memory is ge,ng cheaper (about 40% every year)
(c) Ankur Goyal
Why?• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)
(c) Ankur Goyal
Why?• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)
• In-memory databases leverage SSD (no random writes)
(c) Ankur Goyal
Why?• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)
• In-memory databases leverage SSD (no random writes)
• NVRAM is coming (and could be cheaper than SSD)
(c) Ankur Goyal
Why?• Memory is ge,ng cheaper (about 40% every year)
• Cache is the new RAM (RAM is the new disk, disk is the new tape, etc)
• In-memory databases leverage SSD (no random writes)
• NVRAM is coming (and could be cheaper than SSD)
In-memory databases are tuned to modern hardware and modern workloads
(c) Ankur Goyal
In-Memory Architecture
(c) Ankur Goyal
Architecture Topics• In-Memory Storage
• Transac3ons and Concurrency Control
• Crash Recovery and Replica3on
• Code Genera3on
• Distributed Execu3on
(c) Ankur Goyal
In-Memory Storage Mo/va/on• Insanely fast random reads & writes
(c) Ankur Goyal
In-Memory Storage Mo/va/on• Insanely fast random reads & writes
• Atomic writes as granular as a byte
(c) Ankur Goyal
In-Memory Storage Mo/va/on• Insanely fast random reads & writes
• Atomic writes as granular as a byte
• Working space is precious (RAM)
(c) Ankur Goyal
In-Memory Storage Mo/va/on• Insanely fast random reads & writes
• Atomic writes as granular as a byte
• Working space is precious (RAM)
• Very different for rowstores and columnstores
(c) Ankur Goyal
In-Memory Rowstore• Rowstores have lots of random reads/writes
(c) Ankur Goyal
In-Memory Rowstore• Rowstores have lots of random reads/writes
• Datasets are usually small < 10 TB
(c) Ankur Goyal
In-Memory Rowstore• Rowstores have lots of random reads/writes
• Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
(c) Ankur Goyal
In-Memory Rowstore• Rowstores have lots of random reads/writes
• Datasets are usually small < 10 TB
Solu%on: keep the whole dataset in memory
• Use memory op+mized data structures (skip list)
(c) Ankur Goyal
What is a Skip List• Invented in 1989 by William Pugh
(c) Ankur Goyal
What is a Skip List• Invented in 1989 by William Pugh
• Expected O(log(n)) lookup, insert, delete
(c) Ankur Goyal
What is a Skip List• Invented in 1989 by William Pugh
• Expected O(log(n)) lookup, insert, delete
• No pages
(c) Ankur Goyal
(c) Ankur Goyal
Common Concerns• Memory overhead
(c) Ankur Goyal
(c) Ankur Goyal
Skip List Struct Layout
struct Table_Row { int col_a; char* col_b; … Tower* idx_1_ptrs; Tower* idx_2_ptrs;};
(c) Ankur Goyal
Common Concerns• Memory overhead
• Scan performance
(c) Ankur Goyal
(c) Ankur Goyal
Inefficient Skip List
(c) Ankur Goyal
Efficient Skip List
(c) Ankur Goyal
Common Concerns• Memory overhead
• Scan performance
• Reverse Itera6on
(c) Ankur Goyal
Common Concerns• Memory overhead
• Scan performance
• Reverse Itera6on (HW Assignment)
(c) Ankur Goyal
Concurrency Control
(c) Ankur Goyal
Concurrency Control• No pages => No latches
(c) Ankur Goyal
Concurrency Control• No pages => No latches
• Skip list in MemSQL is lockfree
(c) Ankur Goyal
Concurrency Control• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
(c) Ankur Goyal
Concurrency Control• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
• Row locks are implemented with futexes (4 bytes)
(c) Ankur Goyal
Concurrency Control• No pages => No latches
• Skip list in MemSQL is lockfree
• Every node is a lock-free linked list
• Row locks are implemented with futexes (4 bytes)
• Read-commiGed and snapshot isolaHon
(c) Ankur Goyal
In-Memory Columnstore
(c) Ankur Goyal
In-Memory Columnstore
(c) Ankur Goyal
Columnstore Review• Big sequen+al scans and writes
(c) Ankur Goyal
Columnstore Review• Big sequen+al scans and writes
• Huge immutable vectors of data
(c) Ankur Goyal
Columnstore Review• Big sequen+al scans and writes
• Huge immutable vectors of data
Solu%on: Cache dataset in memory
(c) Ankur Goyal
How do columnstores benefit from in-memory?
(c) Ankur Goyal
Have a lock-free skip list handy?
(c) Ankur Goyal
Have a lock-free skip list handy?• Keep metadata in-memory
• Use sidecar rowstore for fast small-batch writes
(c) Ankur Goyal
(c) Ankur Goyal
Columnstore LSM• Log-Structured Merge of sorted runs
(c) Ankur Goyal
(c) Ankur Goyal
Columnstore LSM• Log-Structured Merge of sorted runs
• Tunable tradeoffs for read/write amplifica=on
(c) Ankur Goyal
Columnstore LSM• Log-Structured Merge of sorted runs
• Tunable tradeoffs for read/write amplifica=on
• Enables fast writes to a sorted columnstore
(c) Ankur Goyal
Columnstore LSM• Log-Structured Merge of sorted runs
• Tunable tradeoffs for read/write amplifica=on
• Enables fast writes to a sorted columnstore
• Smallest sorted run is a skip list
(c) Ankur Goyal
(c) Ankur Goyal
Crash Recovery
(c) Ankur Goyal
Durability in an In-Memory System?• Memory is not a reliable medium (yet)
(c) Ankur Goyal
Durability in an In-Memory System?• Memory is not a reliable medium (yet)
• There is always a hierarchy
(c) Ankur Goyal
Durability in an In-Memory System?• Memory is not a reliable medium (yet)
• There is always a hierarchy
• E.g. EBS -> S3 -> Glacier
(c) Ankur Goyal
Durability in an In-Memory System?• Memory is not a reliable medium (yet)
• There is always a hierarchy
• E.g. EBS -> S3 -> Glacier
• To operate at in-memory speed, all disk I/O must be sequenHal
(c) Ankur Goyal
Durability in the Rowstore• Indexes are not materialized on disk
(c) Ankur Goyal
Durability in the Rowstore• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
(c) Ankur Goyal
Durability in the Rowstore• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
(c) Ankur Goyal
Durability in the Rowstore• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
• Take full database snapshots periodically
(c) Ankur Goyal
Durability in the Rowstore• Indexes are not materialized on disk
• Reconstruct indexes on the fly during recovery
• Only need to log PK data
• Take full database snapshots periodically
• Tunable to be sync/async
(c) Ankur Goyal
(c) Ankur Goyal
Durability in the Columnstore• Metadata uses ordinary rowstore mechanism
(c) Ankur Goyal
Durability in the Columnstore• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
(c) Ankur Goyal
Durability in the Columnstore• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
(c) Ankur Goyal
Durability in the Columnstore• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
• Columnstore segments synchronously wri=en to disk
(c) Ankur Goyal
Durability in the Columnstore• Metadata uses ordinary rowstore mechanism
• Segments are huge (several KB or even MB)
• Read/wri=en sequen?ally
• Columnstore segments synchronously wri=en to disk
• Memory-speed writes go to sidecar rowstore
(c) Ankur Goyal
Crash Recovery• Replay latest snapshot, and then every log file since
(c) Ankur Goyal
Crash Recovery• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
(c) Ankur Goyal
Crash Recovery• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
• Columnstore just replays metadata
(c) Ankur Goyal
Crash Recovery• Replay latest snapshot, and then every log file since
• No par7ally wri9en state on disk, so no undos
• Columnstore just replays metadata
• Replica7on == Con7nuous replay over the network
(c) Ankur Goyal
Code Genera*on
(c) Ankur Goyal
class Row(object): def __init__(self, a): self.a = a
t = [Row(x) for x in range(1000000)]
class State(object): def __init__(self): self.agg_sum = 0
def loop(state, row): state.agg_sum += row.a + 1
def query(): state = State() for r in t: loop(state, r) return state
if __name__ == '__main__': start = time.time() state = query() end = time.time() print "Answer: %d, Time (s): %g" % (state.agg_sum, (end-start))
(c) Ankur Goyal
struct Row int main(void) { { Row(int a_arg) : a(a_arg) { } std::vector<Row> rows; int a; for (int i = 0; i < 1000000; i++)}; { rows.emplace_back(i);struct State }{ State() : agg_sum(0) { } clock_t start = clock(); int64_t agg_sum; State state = query(rows);}; clock_t end = clock();
inline void loop(State& state, const Row& row) printf("Answer: %lld, Time (s): %g\n", { state.agg_sum, (end-start) * 1.0 / CLOCKS_PER_SEC); state.agg_sum += row.a + 1; }}
inline State query(std::vector<Row>& rows){ State s; for (Row& r : rows) { loop(s, r); } return s;}
(c) Ankur Goyal
Comparison
$ python test.pyAnswer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745
(c) Ankur Goyal
Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745
37x difference in execu+on(c) Ankur Goyal
Comparison$ python test.pyAnswer: 500000500000, Time (s): 0.251049
$ time g++ test.cpp -o test-cpp -std=c++0xreal 0m0.176suser 0m0.150ssys 0m0.023s$ ./test-cppAnswer: 500000500000, Time (s): 0.006745
37x difference in execu+on1.37x even with compila+on +me(c) Ankur Goyal
Code Genera*on• Expression execu.on
(c) Ankur Goyal
Code Genera*on• Expression execu.on
• Inline scans
(c) Ankur Goyal
Code Genera*on• Expression execu.on
• Inline scans
• Need a powerful plan cache
(c) Ankur Goyal
Code Genera*on• Expression execu.on
• Inline scans
• Need a powerful plan cache
• OLTP vs. data explora.on
(c) Ankur Goyal
Plancache Example (1)
SELECT * FROM users WHERE id = 5SELECT * FROM users WHERE id = 8
=>
SELECT * FROM users WHERE id = @
(c) Ankur Goyal
Plancache Example (2)SELECT * FROM users WHERE id IN (1,2,3,4,5) OR a IN (3,5,7)SELECT * FROM users WHERE id IN (20) OR a IN (1,2,3,4)
=>
SELECT * FROM users WHERE id IN (@) OR a IN (@)
(c) Ankur Goyal
Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;
SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;
(c) Ankur Goyal
Drill Down ExampleSELECT SELECT SELECT region, SUM(price) rep, SUM(price) rep, SUM(price) FROM sales => FROM sales => FROM sales GROUP BY region WHERE region="northeast" WHERE region=^ GROUP BY rep; GROUP BY rep;
SELECT SELECT product, SUM(price) product, SUM(price) => FROM sales => FROM sales WHERE region="northwest" WHERE region=^ GROUP BY product; GROUP BY product;
No plancache match !
(c) Ankur Goyal
Let's look at some generated code
(c) Ankur Goyal
Expression Snippetmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)
memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.00 sec)
(c) Ankur Goyal
Old Code Genera,onmemsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)
bool overflow = false;VarCharTemp result1("foo", 3, threadId);VarCharTemp result2("bar", 3, threadId);opt<TemporaryImmutableString> result3;op_Concat(result3, result1, result2, overflow, threadId);
(c) Ankur Goyal
Code Genera*on is Hard• Old compilers adage: Pick 2 of 3
(c) Ankur Goyal
Code Genera*on is Hard• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
(c) Ankur Goyal
Code Genera*on is Hard• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
• E.g. Assembly, C++, Python
(c) Ankur Goyal
Code Genera*on is Hard• Old compilers adage: Pick 2 of 3
• Fast execu:on :me
• Fast compile :me
• Fast development :me
• E.g. Assembly, C++, Python
• JIT compilers turned this on its head
(c) Ankur Goyal
MemSQL Compiler Pipeline
(c) Ankur Goyal
Expression Snippet (MPL)memsql> select concat("foo", "bar");+----------------------+| concat("foo", "bar") |+----------------------+| foobar |+----------------------+1 row in set (0.81 sec)
declare outRow3 <- OutRowInit()OutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))OutRowSend(&outRow3)
(c) Ankur Goyal
MBC SnippetOutRowString(&outRow3, &Concat(UpdateCollation(OptString("foo"),2), UpdateCollation(OptString("bar"),2)))
0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified0x010c UpdateCollation local=&local_6 coll=utf8_general_ci0x0118 Concat local=&local_4 local=&local_5 local=&local_60x0128 OutRowString local=&outRow local=&local_4 target=0x018c0x0138 OptStringFree local=&local_40x0140 OptStringFree local=&local_60x0148 OptStringFree local=&local_5
(c) Ankur Goyal
MBC Snippet0x0048 OutRowInit local=&outRow0x0050 InitString local=&local_2 data=0 i64=3 coll=unspecified0x0068 UpdateCollation local=&local_2 coll=utf8_general_ci0x0074 InitString local=&local_3 data=1 i64=3 coll=unspecified0x008c UpdateCollation local=&local_3 coll=utf8_general_ci0x0098 Concat local=&local local=&local_2 local=&local_30x00a8 OutRowString local=&outRow local=&local target=0x01ac0x00b8 OptStringFree local=&local0x00c0 OptStringFree local=&local_30x00c8 OptStringFree local=&local_20x00d0 InitString local=&local_5 data=2 i64=3 coll=unspecified0x00e8 UpdateCollation local=&local_5 coll=utf8_general_ci0x00f4 InitString local=&local_6 data=3 i64=3 coll=unspecified
(c) Ankur Goyal
Distributed Query Execu0on
(c) Ankur Goyal
(c) Ankur Goyal
First, some terminology
(c) Ankur Goyal
(c) Ankur Goyal
(c) Ankur Goyal
(c) Ankur Goyal
Much easier to reason in terms of shipping SQL
(c) Ankur Goyal
(c) Ankur Goyal
(c) Ankur Goyal
SELECT supp_nation, cust_nation, l_year, Sum(volume) AS revenue FROM (SELECT n1.n_name AS supp_nation, n2.n_name AS cust_nation, Extract(year FROM l_shipdate) AS l_year, l_extendedprice * ( 1 - l_discount ) AS volume FROM supplier, lineitem, orders, customer, nation n1, nation n2, WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ( ( n1.n_name = 'CANADA' AND n2.n_name = 'UNITED STATES' ) OR ( n1.n_name = 'RUSSIA' AND n2.n_name = 'UNITED STATES' ) ) AND l_shipdate BETWEEN Date('1995-01-01') AND Date('1996-12-31')) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year;
(c) Ankur Goyal
Abstrac(ons• Distributed Query Plan created on aggregator
(c) Ankur Goyal
Abstrac(ons• Distributed Query Plan created on aggregator
• Layers of primi9ve opera9ons glued together
(c) Ankur Goyal
Abstrac(ons• Distributed Query Plan created on aggregator
• Layers of primi9ve opera9ons glued together
• Full SQL on leaves
• REMOTE tables
• RESULT tables
(c) Ankur Goyal
Primi%ves (SQL)• Queries over physical indexes
(c) Ankur Goyal
Primi%ves (SQL)• Queries over physical indexes
• Hook into global transac9onal state
(c) Ankur Goyal
Primi%ves (SQL)• Queries over physical indexes
• Hook into global transac9onal state
• Full SQL on a single par99on
(c) Ankur Goyal
Primi%ves (SQL)• Queries over physical indexes
• Hook into global transac9onal state
• Full SQL on a single par99on
• Access to rowstores and columnstores
(c) Ankur Goyal
Primi%ves (SQL)Example query the aggregator can send to the leaf:
SELECT t.a, t.b, SUM(t.price)FROM t -- This will scan a physical table on the leafWHERE t.c = 1000 -- This will use a local indexGROUP BY t.a, t.b -- This will produce 1 row per group
(c) Ankur Goyal
Primi%ves (Remote Tables)• Address data across leaves
(c) Ankur Goyal
Primi%ves (Remote Tables)• Address data across leaves
• SQL interface + custom shard key
(c) Ankur Goyal
Primi%ves (Remote Tables)• Address data across leaves
• SQL interface + custom shard key
• Parallel execu<on primi<ves
• Reshuffling
• Merging on group keys
• Merging data from joins (e.g. leE joins)
(c) Ankur Goyal
Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM
-- The row in s where s_net.b = t.a may not -- be on the same node as the local t. REMOTE(s) -- addresses the table across the cluster.
t, REMOTE(s) AS s_net
WHERE t.a = s_net.bGROUP BY t.a
(c) Ankur Goyal
Primi%ves (Remote Tables)SELECT t.a, SUM(s_net.c)FROM
-- This is a reshuffle operation. It relies on t -- being sharded on (t.a) and type(t.a) == type(s.b). -- It will only pull rows in s.b that match the -- shard key's local values of (t.a).
t, REMOTE(s) WITH (shard_key=(s.b)) AS s_net
WHERE t.a = s_net.bGROUP BY t.a
(c) Ankur Goyal
Primi%ves (Result Tables)• Shared, cached results of SQL queries
(c) Ankur Goyal
Primi%ves (Result Tables)• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
(c) Ankur Goyal
Primi%ves (Result Tables)• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
• Supports streaming seman9cs
(c) Ankur Goyal
Primi%ves (Result Tables)• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
• Supports streaming seman9cs
• Technically an op9miza9on
(c) Ankur Goyal
Primi%ves (Result Tables)• Shared, cached results of SQL queries
• Shares scans/computa9ons across readers
• Supports streaming seman9cs
• Technically an op9miza9on
• Similar to an RDD in Spark
(c) Ankur Goyal
Primi%ves (Result Tables)
CREATE RESULT TABLE t_reshuffled ASSELECT t.a, t.b, SUM(t.price) FROM t GROUP BY t.a, t.b SHARD BY t.a, t.b
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
• Index selec0on, Sor0ng/Grouping
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
• Index selec0on, Sor0ng/Grouping
• SQL -> SQL rewrites
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
• Index selec0on, Sor0ng/Grouping
• SQL -> SQL rewrites
• Cost-based distributed op0mizer
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
• Index selec0on, Sor0ng/Grouping
• SQL -> SQL rewrites
• Cost-based distributed op0mizer
• Broadcast vs. Reshuffling
(c) Ankur Goyal
Op#miza#ons• Single-machine op0miza0ons
• Index selec0on, Sor0ng/Grouping
• SQL -> SQL rewrites
• Cost-based distributed op0mizer
• Broadcast vs. Reshuffling
• and many, many more
(c) Ankur Goyal
MemSQL in the Wild
(c) Ankur Goyal
Horizontals and Ver/cals• Real-'me data processing is everywhere
(c) Ankur Goyal
Horizontals and Ver/cals• Real-'me data processing is everywhere
• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons
(c) Ankur Goyal
Horizontals and Ver/cals• Real-'me data processing is everywhere
• Top use-cases:Real-Time Analy'cs and Large-Scale Applica'ons
• Top ver'cals:Financial Services, Webscale, Telco, Federal, Media
(c) Ankur Goyal
Real-&me Analy&cs• High volumes of data, processed in real-8me
(c) Ankur Goyal
Real-&me Analy&cs• High volumes of data, processed in real-8me
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
(c) Ankur Goyal
Real-&me Analy&cs• High volumes of data, processed in real-8me
• Fast updates in the rowstore
• INSERT ... ON DUPLICATE KEY UPDATE
• E.g. 2M update transac8ons/sec on 10 nodes
• Fast appends, even one row at a 8me, in the columnstore
• E.g. 1 GB/s on 16 EC2 nodes
(c) Ankur Goyal
Real-&me Analy&cs• Converging with mainline analy2cs
(c) Ankur Goyal
Real-&me Analy&cs• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
(c) Ankur Goyal
Real-&me Analy&cs• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
(c) Ankur Goyal
Real-&me Analy&cs• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
• Subsecond queries for dashboards
(c) Ankur Goyal
Real-&me Analy&cs• Converging with mainline analy2cs
• No compromises, e.g. limited SQL, limited windows
• Real-2me means fast reads as well
• Subsecond queries for dashboards
• Millisecond queries for applica2ons
(c) Ankur Goyal
Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons
(c) Ankur Goyal
Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
(c) Ankur Goyal
Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
(c) Ankur Goyal
Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
• Exis.ng OLTP databases lack scalability and SQL perf
(c) Ankur Goyal
Large-Scale Applica.ons• Large-scale opera.onal analy.cs and applica.ons
• Hundreds of nodes for perf and HA
• True "produc.on" workloads
• Exis.ng OLTP databases lack scalability and SQL perf
• Exis.ng OLAP databases lack opera.onal features
(c) Ankur Goyal
Logos
(c) Ankur Goyal
Take-Aways• In-memory Database != All-memory Database
(c) Ankur Goyal
Take-Aways• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
(c) Ankur Goyal
Take-Aways• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
(c) Ankur Goyal
Take-Aways• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
• Real-<me analy<cs and Large-scale applica<ons == New projects
(c) Ankur Goyal
Take-Aways• In-memory Database != All-memory Database
• In-memory Databases are databases built to modern tradeoffs
• Old problems with new solu<ons
• Real-<me analy<cs and Large-scale applica<ons == New projects
• We are hiring and ❤ Waterloo.
• Come visit us in SF: email [email protected]
(c) Ankur Goyal
Ques%ons
(c) Ankur Goyal