jinru he, junyuan zeng, and torsten suel computer science & engineering polytechnic institute of...

30
Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned Document Collections

Upload: leonardo-ping

Post on 11-Dec-2015

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Jinru He, Junyuan Zeng, and Torsten Suel

Computer Science & EngineeringPolytechnic Institute of NYU

Improved Index Compression Techniques for Versioned Document Collections

Page 2: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Content of this Talk

• Introduction

• Related work

• Our improved approaches

• Conclusion and future work

Page 3: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• What is a versioned document collection?

Introduction

Page 4: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Versioned Document Collections

Page 5: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Versioned Document Collections• Challenges

• Index representation and compression• Index traversal techniques• Support for temporal range queries• Aggregated query processing (e.g. stable top-k)

Our focus is here

Page 6: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Content of this Talk

• Introduction

• Related work

• Our improved approaches

• Conclusion and future work

Page 7: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• An inverted index consists of inverted lists• Each term has an inverted list• Each inverted list is a sequence of postings• Each posting contains docID and frequency value• Inverted lists are sorted by docID and compressed

• Usually…. To improve the compressibility, we store difference between docIDs instead of docIDs

Related Work: Inverted Index

Page 8: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• Interpolative coding: Stuiver and Moffat 1997• works very well for clustered data (but slow)• thus, great for reordered collections

• OPT-PFD: Yan, Ding and Suel 2009• Works well for the clustered data • based on S. Heman (2005) . • Decompression is fast

• Binary Arithmetic Coding• Binary coder driven by the probability of symbols.• Works well if the prediction is good.• Really slow, not practical (only used to achieve theoretical bound)

Related Work: Index Compression Scheme

Page 9: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Related Work on Archive Indexing• One level Indexing of Version Collections

• DIFF by P. Anick and R. Flynn 1992• Index the symmetric difference between versions.

• Posting coalescing by Berberich 2007• Lossy compression method

• MSA Herscovici etc 2007• Virtual document represents a range of versions.

• He, Yan and Suel CIKM 2009

Page 10: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Related Work on Archive Indexing• Two-level Indexes by Altingovde etc 2008

• Top level index the union of all versions• Lower level using bit vectors. He etc 2009

• The length of each bit vector is the number of versions in the document.

• For bit vector of term t, if t appears in ith version, the ith position of bitmap is set to 1 otherwise, it is set to 0

10, 30, 34 …

0 1 1 0 0 1 0 0 1 0

Page 11: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Data Set• 10% of Wikipedia from Jan. 2001 to Jan. 2008, 0.24 million

documents with 35 versions on average for each document.• 1.06 million web pages from Ireland domain collected

between 1996 and 2006, with 15 on average versions per document from Internet Archive

Page 12: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Data Analysis• Most changes are small

• More than 50% changes between two consecutive versions are less than 5 terms.

• Term changes are bursty• Terms just appeared are likely to disappear again shortly

• Change size is bursty• Less than 10% versions makes up more than 50% and 70%

changes in wiki and Ireland• Terms are dependent

• 48.8% terms disappear together if they come together• 30.5% terms disappear together otherwise

Page 13: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• Introduction

• Related work

• Our improved approaches

• Combinatorial approach

• Better practical methods

• Query processing

• Conclusion and future work

Content of this Talk

Page 14: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• For simplicity, Each document version is a set (docIDs only)

• first level stores IDs of any docs where term has occurred• second level models the version bitvector

• Given the information of known versions in a document, we derive models to predict what it is like in the next versions.

Combinatorial Approach

0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1

Page 15: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Combinatorial Lower Bound• Based on Communication Complexity ( compared by Orlitsky 1990)• Basic model:

For a document with m versions, given the total number of terms in the document as s, and the number of changes in each version cj. Total number of possible versions are:

It can be proved the minimum bits required to encode these versions without ambiguity is

Page 16: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

• Idea: assign probability to next bit• Using information in a document. • Can then use arithmetic coding to compress bit vector• Given the number of changes between two versions as ch() , and total

number of terms in document d as s(d)

• Versions for more complicated models

Combinatorial Upper Bound

Page 17: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Featured based Prediction• In addition, what if we exploit more features in the versions?• K-D tree: partition the 8-dimensional space.

• each bit is a 8-dimension points in the space• recursively partition each dimension of k-D tree to decrease the

overall entropy.• Trade-offs: the smaller the index size, the larger the size of the

tree itself.• Note: like C4.5

Page 18: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Wiki Ireland

Previous Result 140 304

Model lower bound 88 249

Feature-Based 77.5 209

Experiment Result

• Information of changes between version has huge impact.• Additional features only result in moderate further gains.

Lower lever index size in MB

Page 19: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Better Practical Methods• Combinatorial methods achieve good compression• Decompression is slow (arithmetic coding)• Changes play a role

• Here we want to engineering existing compression method by using change information

• Two levels applied to DIFF, MSA• Bit vector reordering• Hybrid DIFF and MSA

Page 20: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Two-level DIFF and MSA• Leverage change information and apply it to the known index

compression method• First level index:

• The union of all versions• Second level:

• Each bit in the bit vector is a virtual document• Reorder the bits in bit vector.• Apply standard compression techniques (OPT-PFD, IPC) to

compress

Page 21: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Bit Vector Reordering on DIFF

10 5 25 6 7 3 18 12 10 13

• Bit Vector transformed to the Diff Bit Vector• Reorder the bit vector by the changes between versions• Index the gaps between ‘1’ bit

1 1 1 1 1 1 0 1 0 1

1 0 0 0 0 0 1 1 1 1

0 1 1 1 1 1 0 0 0 0

25 18 13 12 10 10 7 6 5 3

Page 22: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Hybrid DIFF and MSA

• MSA • Works well when virtual document contains many terms. • Less well if too many small non-empty virtual documents

• Idea • Pick large virtual documents in MSA• DIFF finish up the rest of the postings

Page 23: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Experiment Result ( docID only)

• Reordering improves 30% on Wikipedia• Improvements on Ireland data set are limited.

Page 24: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Experiment Result ( docID and Frequency)

Page 25: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Query Processing

• We have achieved good compression…• What about the query processing?

• Actually, we can do it even better than previous work!

Page 26: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Query Processing

10, 30, 34 …

0 1 1 0 0 1 0 0 1 0

30, 33, 37 …

0 1 1 0 0 1 0 0 1 0

Polytechnic

Institute

• Intersect first on the first level

Page 27: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

30…

30…

0 0 1 0

0 1 1 0

0 0 1 0

Polytechnic

Institute

• Bit vectors corresponding to the result docIDs are fetched• AND the bit vector• First level index is small, many bit vectors can be skipped to speed

up the second level query processing!

Query Processing

Page 28: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Query Processing Result

• The 2-DIFF query processing is about 32% faster.• Mix is good.• 2R-DIFF gains moderate improvement while achieving best compressibility

Page 29: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Conclusion and Future Work• New index organization

• Reduced index size• Faster query processing

• Simple model exploiting the change information matters.

• Future work:• Generative models with user behavior• Different classes of query processing ( stable Top-k, temporal

query processing)

Page 30: Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Q & A

Thank You!