quasi-succinct indices - nii shonan meeting...• store the lower ℓ= log(u / n) bits explicitly...

26
Quasi-Succinct Indices or The Revenge of Elias and Fano Sebastiano Vigna Dipartimento di Informatica Università degli Studi di Milano Italia

Upload: others

Post on 25-Jul-2020

3 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Quasi-Succinct Indicesor

The Revenge of Elias and Fano

Sebastiano VignaDipartimento di Informatica

Università degli Studi di MilanoItalia

Page 2: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Inverted Indices• The backbone of search engines (and more)

• Main problem: store a sequence of increasing integers in little space so to be able to enumerate the list / pick the k-th integer / skip to the first integer larger than or equal to b quickly

• Maps to rank/selection/predecessor search

• For positions the problem is a bit more articulated (and complicated)

Page 3: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

The Classical Solution

• Middle 80s/start of 90s (apparently depends on who you talk to)

• Turn the sequence x0, x1, x2, ... into gaps x0, x1 - x0, x2 - x1,...

• Hope that the numbers will be small and well (predictably) distributed

• Use some instantaneous code to store the gaps

Page 4: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Lot Of Research

• Zillions of different codes and kinds of codes

• Problem: sequential decoding easy, rank/selection/predecessor very inefficient

• Solution: various kind of skip tables that make it possible to “jump” in the middle of the gap sequence

• In retrospective, it looks a little bit contrived, doesn’t it?

Page 5: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Why Gaps?

• Maybe we can approach the problem in a completely different way

• Maybe gaps were not a good idea in the first place

• Maybe there are nice, efficient ways of store sequences of integers that do not require gaps

• So, back (1975!) to the future (now)!

Page 6: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Elias–Fano Representation

• Elias developed in 1975 a quasi-succinct representation for monotone sequences (JACM); Fano discusses it in a report

• At that time, probably no more than a curiosity

• (My 2€¢: should be taught in the first year of any CS curriculum)

• Inspired several modern succinct data structures

Page 7: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

High Bits/Low Bits• Given n and u we have a monotone sequence

0 ≤ x0, x1, x2, ... , xn-1≤u

• Store the lower ℓ= log(u / n) bits explicitly

• Store the upper bits as a sequence of unary coded gaps (0k1 represents k)

• We use at most 2 + log(u / n) bits per element

• Close to the succinct bound: quasi-succinct!

• (Less than half a bit away, as Elias proves)

Page 8: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

5 8 8 15 32

1 10 10 11 100001 00 00 11 00

01 00 00 11 00

1 2 2 3 8

01 01 1 01 000001

1 ! 0 2 ! 1 2 ! 2 3 ! 2 8 ! 3

5, 8, 8, 15, 32 ≤ u = 36, ℓ = 2

Page 9: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Advantages• Almost optimal space usage

• Distribution-free

• Reading sequentially requires very few logical operations (you might be surprised)

• Restrict the rank/selection problem to a nice ~2n bits array with half zeroes, half ones

• It’s beautiful :-)

• So, what about rank/select?

Page 10: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Looking up (Selection)

• Suppose you want to get the k-th element quickly

• Just scan the upper bits, one word at a time, doing population counting (one clock)

• Cost of searching: 100ps/element (yes, that’s picoseconds) per element on an i7 @ 3.4GHz

• When you get to the right word, complete sequentially and pick the lower bits

Page 11: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Searching (Successor)

• It’s exactly the same: only, you count zeroes

• Zeroes tells you how much the upper bits are increasing, which is the important thing

• Just skip b >> ℓupper zeroes and complete

sequentially

• Due to the balance between ones and zeroes, on average always 100ps per element (this must be made more precise, see the paper)

Page 12: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

“Complete Sequentially”?• Not really

• There are broadword algorithms for selection (I wrote the first one in 2005; improved later by Simon Gog)

• Fixed number of operations to skip k unary codes

• Final phase at ~500ps/element

int select_in_word( const uint64_t x, const int k ) { uint64_t byte_sums = x - ((x & 0xaaaaaaaaaaaaaaaaULL) >> 1); byte_sums = (byte_sums & 0x3333333333333333ULL) + ((byte_sums >> 2)

& 0x3333333333333333ULL); byte_sums = (byte_sums + (byte_sums >> 4)) & 0x0f0f0f0f0f0f0f0fULL; byte_sums *= 0x0101010101010101ULL; const uint64_t k_step_8 = k * 0x0101010101010101ULL; const int place = ((((k_step_8 | 0x8080808080808080ULL) - byte_sums)

& 0x8080808080808080ULL) >> 7) * 0x0101010101010101ULL >> 53 & ~0x7; return place + select_in_byte[x >> place & 0xFF |

k - ((byte_sums << 8) >> place & 0xFF) << 8];}

Page 13: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Not Fast enough?• Fix a quantum q (I use 256)

• Store in a table the position of each q-th zero, or q-th one

• Go there in constant time and search from there

• On average, again constant time because of the balance between zeroes and ones

• Extreme locality: one memory access per skip

Page 14: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

0 1 0 1 1 0 1 0 0 0 0 0 1 01 00 00 11 00

0 1 2 3 4 5 6 7 8 9 10 11 12 13

• 5, 8, 8, 15, 32 ≤ u = 36, ℓ = 2

• We to skip to 22, so we skip 22 >> ℓ = 5 zeroes

• We getting to position 9, so we are in the middle of the unary code associated with the element of index 9 - 5 = 4

• A unary-code read (the dashed arrow) returns 3

• We now know that the upper bits of the current element (of index 4) are 3 + 5 = 8

• Since the block of lower bits of index 4 is zero, we return 32

• If we have skip pointers with q=4, we can start from the dotted arrow

Page 15: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Enough of Fun with Bits

• We want to store an inverted index

• There are document pointers, counts and positions

• For pointers we obviously use a quasi-succinct list with skips

• Counts? Positions?

• Important: we can store strictly monotone sequences quasi-succinctly by storing xi - i !

Page 16: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Using Duality Perversely• Instead of storing counts c0, c1, c2, ... , we store their

prefix sums (a.k.a. cumulative function) c0, c0 + c1, c0 + c1 + c2, ...

• Instead of storing positions, we store the prefix sums of their gaps

• Positions pi0, pi

1, pi2, ... , pi

ci – 1 are first turned into pi0 +

1, pi1 – pi

0, pi2 – pi

1, . . . , pici – 1 – pi

ci – 2

• All such sequences are concatenated and stored as a prefix sum

• Key observation: the counts cumulative function is the indexing function for positions

Page 17: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Fast & Compact• Decoding speed faster than other approaches

(but not for counts/positions!)

• Compression definitely better than other approaches, even for the smallest lists, except for very slow stuff like Golomb

• Locality of access definitely better than other approaches

• Note that reading sequentially and skipping mix well together

Page 18: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

It Works

• In April 2012 I visited Facebook

• They were working on their new feature—GraphSearch

• Problem: how do you find very quickly which of your friends like to cook?

• Hard intersection problem—small vs. big

Page 19: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Interaction

• I handed to Mike Curtiss (formerly at Google) my preprint

• I had the gut feeling that the Elias–Fano representation was exactly what they were looking for: fast intersection of small and big lists

• Note that Shuai Ding (PForDelta) works with Mike

Page 20: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Ten Months Later

• On January 27 I got an email from Mike

• “We wanted to let you know that we have open-sourced a C++ implementation of your index representation.  We are currently using this in production, because it is faster than any other approaches.”

• “Any other” includes Google’s GroupVarint and variants of PForDelta

• I guess they benchmarked the thing thoroughly...

Page 21: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Fantastic Feedback

• My code was completely rewritten by a competent engineer using stuff I’ve never seen for unaligned access

• Some more speed improvements

• Very clever idea: read directly cumulative unary codes by bit cancellation

• That stuff found its way back into MG4J (our Java search engine)

Page 22: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

What Now?• Let’s improve this, e.g., better implementations

• There’s decades of engineering and optimization on gaps, very little on this, yet it is faster and compresses better!

• Beautiful code by Philip Pronin (Facebook) on GitHub:

int64_t get_next_upper_bits() { while( word == 0 ) word = upper_bits[ ++curr ]; const int64_t upper_bits = curr * 64 +

__builtin_ctzll( word ) - index++; word &= word - 1; return upper_bits; }

Page 23: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Benchmarks• Benchmarks in the WSDM 2012 paper are

obsolete (code is now much better thanks to Philip)

• New benchmarks soon using Haswell, which has a single-instruction x &= x –1(thanks to Giuseppe Ottaviano for making me notice this)

• Difficult comparison with the literature, as the authors of the main paper about compression of positions refuse to give their code (the URL in the paper is fake)

Page 24: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

When Does It Shine?

• Heavy skipping: “Romeo and Juliet”

• Proximity-based search, like...

• find phrases

• find documents containing a bag of words within k positions

• In general, when skipping is more important than pure enumeration

• Particularly efficient when coupled with optimally lazy proximity algorithms [Boldi & Vigna 2006]

Page 25: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

When Does It Crawl?

• Enumeration oriented tasks (no skipping)

• TF-IDF-like scoring (e.g., BM25) on Boolean disjunctions (but you can have a separate fast count index)

• Everywhere appearing phrases: “home page”

Page 26: Quasi-Succinct Indices - NII Shonan Meeting...• Store the lower ℓ= log(u / n) bits explicitly • Store the upper bits as a sequence of unary coded gaps (0k1 represents k) •

Try It!

• On MG4J: http://mg4j.di.unimi.it/

• On WebGraph: http://webgraph.di.unimi.it/

• Facebook: https://github.com/facebook/folly/

• Lucene codec?

• Questions?