faster column-oriented indexes

32
Faster Column-Oriented Indexes Daniel Lemire http://www.professeurs.uqam.ca/pages/lemire.daniel.htm blog: http://www.daniel-lemire.com/ Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc). February 10, 2010 Daniel Lemire Faster Column-Oriented Indexes

Upload: daniel-lemire

Post on 06-May-2015

3.133 views

Category:

Technology


1 download

DESCRIPTION

Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?

TRANSCRIPT

Page 1: Faster Column-Oriented Indexes

Faster Column-Oriented Indexes

Daniel Lemire

http://www.professeurs.uqam.ca/pages/lemire.daniel.htm

blog: http://www.daniel-lemire.com/

Joint work with Owen Kaser (UNB) and Kamel Aouiche (post-doc).

February 10, 2010

Daniel Lemire Faster Column-Oriented Indexes

Page 2: Faster Column-Oriented Indexes

Some trends in business intelligence (BI)

source: gooddata.com

Low-latency BI, Complex EventProcessing [Hyde, 2010]

Commotization, open source software:Pentaho, LucidDB(http://www.luciddb.org/)

Column-oriented databases ←

Daniel Lemire Faster Column-Oriented Indexes

Page 3: Faster Column-Oriented Indexes

Row Stores

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

name, date, age, sex, salary

Dominant paradigm

Transactional: Quick append and delete

Daniel Lemire Faster Column-Oriented Indexes

Page 4: Faster Column-Oriented Indexes

Column Stores

name date age sex salary

Goes back to StatCan in theseventies [Turner et al., 1979]

Made fashionable again in DataWarehousing byStonebraker [Stonebraker et al., 2005]

New: Oracle Exadata hybrid columnarcompression

Daniel Lemire Faster Column-Oriented Indexes

Page 5: Faster Column-Oriented Indexes

Vectorization

1 c o n s t i n t N = 2 0 4 8 ;2 i n t a [N] , b [N ] ;3 i n t i =0;4 f o r ( ; i<N; i ++)5 a [ i ] += b [ i ] ;

Modern superscalar CPUs supportvectorization (SSE)

This code is four times faster with-ftree-vectorize (GNU GCC)

Need long streams, same data type, andno branching.

Columns are good candidates!

Daniel Lemire Faster Column-Oriented Indexes

Page 6: Faster Column-Oriented Indexes

Main column-oriented indexes

(1) Bitmap indexes [O’Neil, 1989]

(2) Projection indexes [O’Neil and Quass, 1997]

Both are compressible.

Daniel Lemire Faster Column-Oriented Indexes

Page 7: Faster Column-Oriented Indexes

Bitmap indexes

SELECT * FROMT WHERE x=aAND y=b;

Vectors of booleans

Above, compute

{r | r is the row id of a row where x = a} ∩{r | r is the row id of a row where y = b}

Daniel Lemire Faster Column-Oriented Indexes

Page 8: Faster Column-Oriented Indexes

Other applications of the bitmaps/bitsets

The Java language has had a bitmap class since thebeginning: java.util.BitSet. (Sun’s implementation is basedon 64-bit words.)

Search engines use bitmaps to filter queries, e.g. ApacheLucene: org.apache.lucene.util.OpenBitSet.java.

Daniel Lemire Faster Column-Oriented Indexes

Page 9: Faster Column-Oriented Indexes

Bitmaps and fast AND/OR operations

Computing the union of two sets of integers between 1 and 64(eg row ids, trivial table). . .E.g., {1, 5, 8} ∪ {1, 3, 5}?Can be done in one operation by a CPU:BitwiseOR( 10001001, 10101000)

Extend to sets from 1..N using dN/64e operations.

To compute [a0, . . . , aN−1] ∨ [b0, b1, . . . , bN−1] :a0, . . . , a63 BitwiseOR b0, . . . , b63;a64, . . . , a127 BitwiseOR b64, . . . , b127;a128, . . . , a192 BitwiseOR b128, . . . , b192;. . .

It is a form of vectorization.

Daniel Lemire Faster Column-Oriented Indexes

Page 10: Faster Column-Oriented Indexes

What are bitmap indexes for?

Myth: bitmap indexes are for low cardinality columns (e.g.,SEX).

the Bitmap index is the conclusive choice for datawarehouse design for columns with high or lowcardinality [Zaker et al., 2008].

Daniel Lemire Faster Column-Oriented Indexes

Page 11: Faster Column-Oriented Indexes

Projection indexes

date

name

city

name

datecity

Write out the (normalized)column values sequentially.

It is a projection of the tableon a single column.

Best for low selectivity querieson few columns:SELECT sum(number*price)

FROM T;.

Daniel Lemire Faster Column-Oriented Indexes

Page 12: Faster Column-Oriented Indexes

How to compress column indexes?

Must handle long streams of identical values efficiently ⇒Run-length encoding? (RLE)

Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . .

So just encode the run lengths, e.g.,0001111100010111 →3, 5, 3, 1,1,3

It is a bit more complicated (more another day)

Daniel Lemire Faster Column-Oriented Indexes

Page 13: Faster Column-Oriented Indexes

What about other compression types?

With RLE, we can often process the data in compressed form

Hence, with RLE, compression saves both storage andCPU cycles!!!!

Not always true with other techniques such as Huffman,LZ77, Arithmetic Coding, . . .

Daniel Lemire Faster Column-Oriented Indexes

Page 14: Faster Column-Oriented Indexes

How do we improve performance?

Smaller indexes are faster.

In data warehousing: data is often updated in batches.

So spend time at construction time optimizing the index.

Daniel Lemire Faster Column-Oriented Indexes

Page 15: Faster Column-Oriented Indexes

Modelling the size of an index

Any formal result?

Tricky: There are many variations on RLE.

Use: number of runs of identical value in a column

AAABBBCCAA has 4 runs

Daniel Lemire Faster Column-Oriented Indexes

Page 16: Faster Column-Oriented Indexes

Improving compression by reordering the rows

RLE is order-sensitive:they compress sorted tables better;

But finding the best row ordering isNP-hard [Lemire et al., 2010].

Actually an instance of the Traveling Salesman Problem(TSP)

So we use heuristics:

lexicographicallyGray codesHilbert, . . .

Daniel Lemire Faster Column-Oriented Indexes

Page 17: Faster Column-Oriented Indexes

How many ways to sort? (1)

Lexicographic row sorting is

fast, even for very largetables.easy: sort is a Unix staple.

Substantial index-size reductions(often 2.5 times, benefits growwith table size)

a aa ba cb ab bb c

Daniel Lemire Faster Column-Oriented Indexes

Page 18: Faster Column-Oriented Indexes

How many ways to sort? (2)

Gray Codes are list of tupleswith successive (Hamming)distance of 1 [Knuth, 2005].

Reflected Gray Code order is

sometimes slightly betterthan lexicographical. . .

a aa ba cb cb bb a

Daniel Lemire Faster Column-Oriented Indexes

Page 19: Faster Column-Oriented Indexes

How many ways to sort? (3)

Reflected Gray Code order is notthe only Gray code.

Knuth also presents ModularGray-code.

a aa ba cb cb ab b

Daniel Lemire Faster Column-Oriented Indexes

Page 20: Faster Column-Oriented Indexes

How many ways to sort? (4)

Hilbert Index[Hamilton and Rau-Chaplin, 2007].

Also a Gray code(conditionnally)

Gives very bad results forcolumn-oriented indexes.

Daniel Lemire Faster Column-Oriented Indexes

Page 21: Faster Column-Oriented Indexes

Recursive orders

Lexicographical, reflected Gray code and modular Graycode belong to a larger class: recursive orders.

They sort on the first column, then the second and so on.

Not all Gray codes are recursive orders: Hilbert is not.

Daniel Lemire Faster Column-Oriented Indexes

Page 22: Faster Column-Oriented Indexes

Best column order?

Column order is important for recursive orders.We almost have this result [Lemire and Kaser, 2009]:

any recursive order

order the columns by increasing cardinality (small toLARGE)

Proposition

The expected number of runs is minimized (among all possiblecolumn orders).

Daniel Lemire Faster Column-Oriented Indexes

Page 23: Faster Column-Oriented Indexes

How do you know when the lexicographical order is goodenough?

Even though row reordering is NP-hard, we find it hard toimprove over recursive orders.

Sometimes, fancier alternatives (to be discussed another day)work better, but not always.

Daniel Lemire Faster Column-Oriented Indexes

Page 24: Faster Column-Oriented Indexes

Thankfully, we can detect cases where recursive orders aregood enough

We can bound the suboptimality of all recursive orders.

Proposition

Consider a table with n distinct rows and column cardinalities Ni

for i = 1, . . . , c. Recursive ordering is µ-optimal for the problem ofminimizing the runs where

µ =min(N1, n) + min(N1N2, n) + · · ·+ min(N1N2 · · ·Nc , n)

n.

Daniel Lemire Faster Column-Oriented Indexes

Page 25: Faster Column-Oriented Indexes

Bounding the optimality of sorting: the computation

How do you compute µ very fast so you know lexicographicalsort is good enough?

Trick is to determine n, the number of distinct rows withoutsorting the table.

Thankfully: n can be estimated quickly with probabilisticmethods [Aouiche and Lemire, 2007].

Daniel Lemire Faster Column-Oriented Indexes

Page 26: Faster Column-Oriented Indexes

Bounding the optimality of sorting: actual numbers

columns µ

Census-Income 4-D 4 2.63DBGEN 4-D 4 1.02Netflix 4 2.00Census1881 7 5.09

Daniel Lemire Faster Column-Oriented Indexes

Page 27: Faster Column-Oriented Indexes

Take away message

Column stores are good because of vectorization andRLE/sorting

Sorting is sometimes nearly optimal, but not always but wecan sometimes tell when sorting is optimal

Daniel Lemire Faster Column-Oriented Indexes

Page 28: Faster Column-Oriented Indexes

Future direction?

Minimizing the number of runs it the wrong problem! Wewant to maximize long runs!

Must study fancier row-reordering heuristics.

Daniel Lemire Faster Column-Oriented Indexes

Page 29: Faster Column-Oriented Indexes

Questions?

?

Daniel Lemire Faster Column-Oriented Indexes

Page 30: Faster Column-Oriented Indexes

Aouiche, K. and Lemire, D. (2007).A comparison of five probabilistic view-size estimationtechniques in OLAP.In DOLAP’07, pages 17–24.

Hamilton, C. H. and Rau-Chaplin, A. (2007).Compact Hilbert indices: Space-filling curves for domains withunequal side lengths.Information Processing Letters, 105(5):155–163.

Hyde, J. (2010).Data in flight.Commun. ACM, 53(1):48–52.

Knuth, D. E. (2005).The Art of Computer Programming, volume 4, chapter fascicle2.Addison Wesley.

Lemire, D. and Kaser, O. (2009).

Daniel Lemire Faster Column-Oriented Indexes

Page 31: Faster Column-Oriented Indexes

Reordering columns for smaller indexes.in preparation, available fromhttp://arxiv.org/abs/0909.1346.

Lemire, D., Kaser, O., and Aouiche, K. (2010).Sorting improves word-aligned bitmap indexes.Data & Knowledge Engineering, 69(1):3–28.

O’Neil, P. and Quass, D. (1997).Improved query performance with variant indexes.In SIGMOD ’97, pages 38–49.

O’Neil, P. E. (1989).Model 204 architecture and performance.In 2nd International Workshop on High PerformanceTransaction Systems, pages 40–59.

Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X.,Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.,O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S.(2005).

Daniel Lemire Faster Column-Oriented Indexes

Page 32: Faster Column-Oriented Indexes

C-store: a column-oriented DBMS.In VLDB’05, pages 553–564.

Turner, M. J., Hammond, R., and Cotton, P. (1979).A DBMS for large statistical databases.In VLDB’79, pages 319–327.

Zaker, M., Phon-Amnuaisuk, S., and Haw, S. (2008).An adequate design for large data warehouse systems: Bitmapindex versus B-Tree index.IJCC, 2(2).

Daniel Lemire Faster Column-Oriented Indexes