« performance of compressed inverted list caching in search engines » proceedings of the...

«Performance of Compressed Inverted List Caching in

Search Engines» Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

J. Zhang, X. Long, T. Suel

Paper presentation:

Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline

• Introduction• Technical background-compressed index• Technical background-caching in SE• Paper Contributions• Inverted index compression• List caching policies• Compression plus caching• Conclusion

Introduction

• Observation: in todays large SE it is crucial to improve query throughput and fast response time

• Authors study comparatively two important techniques to achieve that goal: invert index compression and inverted index caching (or list caching)

Compressed index organization

Two cases considered: (i) we have docIDs and frequencies, i.e., each posting is of the form (di, fi), and (ii) we also store positions, i.e., each posting is of the form(di, fi, pi,0,…, pi,freq-1). We use word-oriented positions, i.e., pi,j = k if a word is the k-th word in the document.

Figure 1: Disk-based inverted index structure with blocks for cachingand chunks for skipping. DocIDs and positions are shown after takingthe differences to the preceding values.

Caching in Search Engines

Result caching means that if a query is identical to another query (by possibly a different user) that was recently processed, then we can simply return the previous result. Thus, a result cache stores the top-k results of all queries that were recently computed by the engine

List caching (or inverted index caching) caches in main memory those parts of the inverted index that are frequently accessed. Here, list caching is based on the blocks. If a particular term occurs frequently in queries, then the blocks containing its inverted list are most likely already in cache and do not have to be fetched fromdisk. Cached data is kept in compressed form in memory (uncompressed data would typically be about 3 to 8 times larger depending on index format and compression method). So decompression speed must also be considered

Figure 2: Two-level architecture with result caching at the query integratorand inverted index caching in the memory of each server.

Paper Contributions(1) Detailed experimental evaluation of fast state-of-the-art index compression techniques, including Variable-Byte coding, Simple9 coding, Rice coding andPForDelta coding

(2) Introduce new variations of these techniques, including an extension of Simple9 called Simple16, which we discovered during implementation and experimental evaluation

(3) Compare the performance of several caching policies for inverted list caching on real search engine query traces (AOL and Excite). LRU is not a good policy in this case

(4) Study of the benefits of combining index compression and index caching to improve the resulting performance. Our conclusion is that for almost the entire range of system parameters, PForDelta compression with LFU caching achieves the best performance, except for small cache sizes and fairly slow disks where our optimized Rice code is slightly better

Inverted index compression

Algorithms for list compression (limit only to those that allow fast decompression):

1. Variable-byte encoding

2. Simple9 (S9) coding

3. Simple16 (S16) coding

4. PForDelta coding

5. Rice Coding

Figure 4: Total index size under different compression schemes. Resultsare shown for docIDs, frequencies, positions, an index with docIDsand positions only, and an index with all three fields.

Inverted index compression evaluation

Figure 5: Comparison of algorithms for compressing DocIDs on invertedlists with different lengths.

Figure 6: Times for decompressing the entire inverted index.

Figure 7: Average compressed size of the data required for processinga query from the Excite trace.

Figure 8: Average decompression time (CPU only) per query, basedon 100; 000 queries from the Excite trace.

Experimental setup

Data set: ~7.4M web pages (selected randomly from a crawl of 120M in Oct02)

Total compressed size: ~36GB

Total # of distinct words: ~20M (269 distinct words per doc)

Total # of postings: ~2B

Query log :~1,2M queries issued to Excite SE, ~36M issued to AOL

List caching policies

• the inverted index consists of 64KB blocks of compressed data (block is the basic caching unit

• objective is to minimize the following functions: (a) query-oriented, where we count a cache hit whenever all inverted lists for a query are in cache, and a miss otherwise, (b) list-oriented, where we count a cache hit for each inverted list needed for a query that is found in cache, and (c) block- or data size-oriented, where we count the number of blocks or amount of data served from cache versus fetched from disk during query processing (this is more appropriate for very large collections)

1) Least Recently Used (LRU): baseline approach

2) Least Frequently Used (LFU): depends on the length of history list that we keep

3) Optimized Landlord: when an object is inserted into the cache, it is assigned a deadline given by the ratio between its benefit and its size. When we evict an object, we choose the one with the smallest deadline. Whenever an object already in cache is accessed, it’s deadline is reset to some suitable value (in our case, every object has the same size and benefit since we have an unweighted caching problem. So, if we reset each deadline back to the same original value upon access, Landlord becomes identical to LRU).

4) Multi-Queue: we use m=8 different LRU queues, depending on the access # of blocks

5) Adaptive Replacement Cache: tries to balance between LRU and LFU

List caching policies

Comparision of list caching policies

Figure 9: Impact of history size on the performance of LFU, using anindex compressed with PForDelta.

Figure 10: Cache hit rates for different caching policies and relativecache sizes. The graphs for LFU and MQ are basically on top of eachother and largely indistinguishable. Optimal is the clairvoyant algorithmthat knows all future accesses and thus provides an upper boundon hit ratio. Each data point is based on a complete run over our querytrace, where we measure the hit ratio for the last 100,000 queries.

Compression plus caching

Figure 12: Comparison of compression algorithms with differentmemory size, using 100,000 queries and LFU for caching. Total indexsize varied between 4:3 GB for variable-byte and 2:8 GB for Entropy.

Figure 13: Comparison of PForDelta, S16, and Rice coding, usingLFU caching and assuming 10MB/s disk speed.

Figure 14: Comparison of PForDelta, S16, and Rice coding, usingLFU caching and assuming 50MB/s disk speed.

Figure 15: Comparison of query processing costs for different diskspeeds, using LFU and a 128MB cache (fixed).

Conclusions

• detailed experiments conducted for inverted index compression and caching (more attention is given to compression methods)

• all figure numbers correspond to initial paper settings

• open question 1: improve on compression speed of PForDelta

• open question 2: compression on position data (applied on web based collections)

• Any questions?

Thank you for your attention!

« performance of compressed inverted list caching in search engines » proceedings of the...

Documents

inverted index caching

lfu caching

enginelist caching

inverted index structure

index format

invert index compression

search enginesresult

list compression limit