class 5 column stores 2 - harvard university
TRANSCRIPT
column stores 2.0prof. Stratos Idreos
HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/
class 5
CS165, Fall 2017 Stratos Idreos /282
what just happened?where is my data?
email, cloud, social media, …
can we design systems that let us know what is going on?
worth thinking about…
CS165, Fall 2017 Stratos Idreos /283
cool papers 2.0
The Case for RodentStore: An Adaptive, Declarative Storage SystemPhilippe Cudré-Mauroux, Eugene Wu, Samuel Madden In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2009
Abstraction Without Regret in Database Systems Building: a ManifestoChristoph KochIEEE Data Eng. Bull. 37(1): 70-79 (2014)
dbTouch: Analytics at your FingertipsStratos Idreos and Erietta Liarou In Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2013
CS165, Fall 2017 Stratos Idreos /284
design doc think, design, create 1-2 page PDF doc and ask for feedback mandatory M1-M3, optional afterwards
submit through Canvas
do not worry about perfection: fail fast wrong ideas ok if you eventually find out they are wrong :) (holds for midterms as well)
CS165, Fall 2017 Stratos Idreos /285
Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations Award
disk100Kx Pluto
2 years
memory100x New York1.5 hours
on board cache10x this building
10 min
on chip cache2x this room
1 min
registers my head~0
CS165, Fall 2017 Stratos Idreos /28
the way we store data defines the possible (efficient) access methods
6
CS165, Fall 2017 Stratos Idreos /287
free_offset, N, offset1-length1, offset2-lenght2,…
free space
slotted page
scan null
update var length
…
CS165, Fall 2017 Stratos Idreos /288
row-store column-storeABC D A B C D
CS165, Fall 2017 Stratos Idreos /289
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6
c1 c2 c3 c4 c5 c6
virtual ids/ positional alignment
positional lookups/joinsA(i) = A + i * width(A)
tuple 1tuple 2tuple 3tuple 4tuple 5tuple 6
A B C
fixed-width + dense
columns do not need to have the
same width
CS165, Fall 2017 Stratos Idreos /28
todaycolumn-stores 2.0
10
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2811
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDsdisk memory
query plan = select -> fetch -> select -> fetch - > min
sequential access patterns, max 1 if
CS165, Fall 2017 Stratos Idreos /2812
working over fixed width & dense columns
for (i=0;i<size;i++) if column[i]>v inter1[j++]=i
no function calls, no indirections, no auxiliary data, min ifs easy to prefetch next data values
for (i=0;i<size;i++) inter2[j++]=column[inter1[i]]
select
fetch
with data being memory resident these become significant cost components
CS165, Fall 2017 Stratos Idreos /2813
B<20 minCA<10 IDs B CIDs
alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …
CS165, Fall 2017 Stratos Idreos /2813
B<20 minCA<10 IDs B CIDs
alt1) start with B alt2) scan A & B independently and merge alt3) store intermediates as bit vectors - not positions …
project: basic one + more if you decide to invest in this area midterm: basic one + 2-3 alternatives
CS165, Fall 2017 Stratos Idreos /2814
B<20 minCA<10 IDs B CIDs
late tuple reconstruction/materialization only reconstruct to present results
no need to assemble tuples minimize memory footprint minimize data we are moving up the memory hierarchy but requires new processing engine
CS165, Fall 2017 Stratos Idreos /2815
disk memoryA B C D
A
ABCrow-store
engineearly tuple
reconstruction/materialization
option1
option2
column-store
engine
CS165, Fall 2017 Stratos Idreos /2816
possible data flow patternstuple at a time block/vector at a time column at a time
B<20 minCA<10 IDs B CIDs
CS165, Fall 2017 Stratos Idreos /2817
select min(C) from R where A<10 & B<20
B<20 minCA<10A B C D IDs B CIDs
A B C D B<20 minCA<10 IDs B CIDs
column-
vector-
CS165, Fall 2017 Stratos Idreos /2818
CEO/Co-founder of Vectorwise (now Actian) now: “changing the world, one terabyte at a time” co-founder of Snowflake
the beer analogy
Marcin Zukowski, PhD
CS165, Fall 2017 Stratos Idreos /2819
registers
on chip cache
on board cache
memory
disk
CPU
chea
per
fast
erop1 op2
query plan
A B
A Bop3
A
size of vector
CS165, Fall 2017 Stratos Idreos /2820
tuple at a time - good for minimizing memory footprint bulk processing - good minimizing functional overhead
vectorized processing - somewhere in between
CS165, Fall 2017 Stratos Idreos /2821
history/timeline
~1960s
tuple at a time
1980s: ideas about block processing
2005: vectorwise
tuple at a time tuple at a time
>2010: industry adoption
CS165, Fall 2017 Stratos Idreos /28
project: column-at-a-time
bonus: vectorized processing
22
CS165, Fall 2017 Stratos Idreos /2823
update row7=(A=a,B=b,C=c,D=d)
row-store column-storeABCD A B C D
vs
which is better to update and why? how much does it cost to update a single row? (think about pages, data movement) how to update in column-stores? (query plan + algorithms)
CS165, Fall 2017 Stratos Idreos /28
A
24
A B C D
B C D
base data pending updates
updatequery
periodically
CS165, Fall 2017 Stratos Idreos /2825
A B C D
columns copy rows copy
fractured mirrors
ABCD
optimizer
query
A case for fractured mirrorsRavishankar Ramamurthy, David J. DeWitt, Qi Su Very Large Databases Journal, 12(2): 89-101, 2003
CS165, Fall 2017 Stratos Idreos /2826
column-stores great for analytics
row-stores great for transactions
still basic concepts are the same
hybrids possible
keep access patterns sequential
and simple (min ifs)
Notes to remember
CS165, Fall 2017 Stratos Idreos /2827
reading
Read: The Design and Implementation of Modern Column-store Database Systems (Sections: all -4.6 & 4.8)by D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden
Read: IEEE Data Engineering Bulletin, 35(1), March 2012 Special Issue on Column-stores (9 short overview papers)
CS165, Fall 2017 Stratos Idreos /2828
research papers
Read: Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz, Stefan Manegold, Martin Kersten In Proc. of the Very Large Databases Conference (VLDB), 1999
Browse: MonetDB/X100: Hyper-Pipelining Query Execution Peter A. Boncz, Marcin Zukowski, Niels NesIn Proc. of the Inter. Conference on Innovative Data Systems Research (CIDR), 2005Browse: Materialization Strategies in a Column-Oriented DBMSDaniel Abadi, Daniel Myers, David DeWitt, Samuel Madden In Proc. of the Inter. Conference on Data Engineering (ICDE), 2007
Browse: Self-organizing tuple reconstruction in column-storesStratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009
DATA SYSTEMSprof. Stratos Idreos
class 5
column-stores 2.0