faceted search with lucene

Faceted Search with Lucene

Shai EreraResearcher, IBM

• Working at IBM – Information Retrieval Research• Lucene/Solr committer and PMC member• http://shaierera.blogspot.com• [email protected]

Who Am I

Lucene Facets 101

• Technique for accessing documents that were classified into a taxonomy of categories– Flat: Author/John Doe, Tags/Lucene, Popularity/High

– Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)

• Quick overview of the break down of the search results– How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet

• Simplifies interaction with the search application– Drilldown to issues that were updated in Past 2 days by clicking a link

– No knowledge required about search syntax and index schema

Faceted Search

http://jirasearch.mikemccandless.com

http://jirasearch.mikemccandless.com/

• Contributed by IBM in 2011, released in 3.4.0• Major changes since 4.1.0+

– NRT support– Nearly 400% search speedups– Complete API revamp– New features (SortedSet, range faceting, drill-sideways)

• Two main indexing-time modes– Taxonomy-based: hierarchical facets, managed by a

sidecar index, low NRT reopen cost– SortedSetDocValues: flat facets only, no sidecar index,

higher NRT reopen cost

• Runtime modes– Range facets (on NumericDocValues fields)

• Other implementations: Solr, ElasticSearch, Bobo Browse

Lucene Facets

• TaxonomyWriter/Reader– Manage the taxonomy information

• FacetFields– Add facets information to documents (DocValues fields, drilldown terms)

• FacetRequest– Defines which facets to aggregate and the FacetsAggregator (aggregation function)

• FacetsCollector– Collects matching documents and computes the top-K categories for each facet request

(invokes FacetsAccumulator)

• DrillDownQuery / DrillSideways– Execute drilldown and drill-sideways requests

Lucene Facet Components

// Builds the taxonomy as documents are indexed, multi-threaded, single instanceTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);

// Adds facets information to a document, can be initialized once per threadFacetFields facetFields = new FacetFields(taxoWriter);

// List of categories to add to the documentList<CategoryPath> cats = new ArrayList<CategoryPath>();cats.add(new CategoryPath("Author", "Erik Hatcher"));cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));

Document bookDoc = new Document();bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);

// add categories fields (DocValues, Postings)facetFields.addFields(bookDoc, cats);

// index the documentindexWriter.addDocument(bookDoc);

Sample Code – Indexing

// Open an NRT TaxonomyReaderTaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);

// Define the facets to aggregate (top-10 categories for each)FacetSearchParams fsp = new FacetSearchParams();fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));

// Collect both top-K facets and top-N matching documentsTopDocsCollector tdc = TopScoredDocCollector.create(10, true);FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);Query q = new TermQuery(new Term(“title”, “lucene”));searcher.search(q, MultiCollector.wrap(tdc, fc));

// Traverse the top facetsfor (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); }}

Sample Code – Search

• Drilldown adds a filter to the search– Multiple categories can be OR’d

// Drilldown – filter results to “Component/core/index”;// All other “Component/*” and “Component/core/*” get count 0Query base = new MatchAllDocsQuery();DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);ddq.add(new CategoryPath(“Component/core/index”, ‘/’));

• Drill sideways allows drilldown, yet still aggregate “sideways” categories

// Drill-Sideways – drilldown on “Component/core/index”;// Other “Component/*” and “Component/core/*” are counted tooDrillSideways ds = new DrillSideways(searcher, taxoReader);DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);

Drilldown and Drill-Sideways

http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html

http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html

• Range facets on NumericDocValues fields– Define interested buckets during query– Supports any arbitrary ValueSource (Lucene 4.6.0)

// Aggregate matching documents into bucketsRangeAccumulator a = new RangeAccumulator(new

RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));

Dynamic Facets

• Not all facets created equal– Categories added by an automatic categorization system, e.g. Category/Apache

Lucene (0.74) (confidence level is 0.74)– Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated

from contracts)– Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,

numUpdates=8…)

• Categories can have values associated with them per document– They are later aggregated by these values– NOTE: ≠ NumericDocValuesFields!

• Facet associations are completely customizable – encoded as a byte[] per document

Facet Associations

http://shaierera.blogspot.com/2013/01/facet-associations.html

http://shaierera.blogspot.com/2013/01/facet-associations.html

• Complements– Holds the count of each category in-memory, per IndexReader – When number of search results is >50% of the index, count the “complement set”– Useful for “overview” queries, e.g. MatchAllDocsQuery

• Sampling– Aggregate a sampled set of the search results– Optionally re-count top-K facets for accurate values

• Partitions– Partition the taxonomy space to control memory usage during faceted search– Useful for very big taxonomies (10s of millions of categories)

More Features

Lucene Facets Under the Hood

• The taxonomy maps categories to integer codes (referred to as ordinals)– Kind of like a Map<CategoryPath,Integer>, with hierarchy support– Provides taxonomy browsing services– DirectoryTaxonomyWriter is managed as a sidecar Lucene index

• Categories are broken down to their path components, e.g. Date/2012/March/20 becomes:

– Date, with ordinal=1– Date/2012, with ordinal=2– Date/2012/March, with ordinal=3– Date/2012/March/20, with ordinal=4

The Taxonomy Index

• Categories are added as drilldown terms, e.g. for Date/2012/March/20:– $facets:Date– $facets:Date/2012– …

• All category ordinals associated with the document are added as a BinaryDocValuesField

– All path components ordinals’ are added, not just the leafs’– Encoded as VInt + gap for efficient compression and speed

• Other compression methods attempted, but were slower to decode (LUCENE-4609)

– Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)

The Search Index

https://issues.apache.org/jira/browse/LUCENE-4609

• SortedSetFacetFields add SortedSetDocValuesFields and drilldown terms to documents

• Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState

• Use SortedSetDocValuesAccumulator to accumulate SortedSet facets• Advantages:

– Taxonomy representation requires less RAM (flat taxonomy)– No sidecar index– Tie-breaks by label-sort order

• Disadvantages:– Not full taxonomy– Overall uses more RAM (local-to-global ordinal mapping)– Adds NRT reopen cost– Slower than taxonomy-based facets

SortedSet Facets

• Per-segment integer codes (as used by the SortedSet approach) are less efficient– Different ordinals for same categories across segments– Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable– Resolve top-K on the String representation of categories – more CPU

• Global ordinals allow efficient per-segment faceting and aggregation– No translation maps required (no extra RAM, highly scalable)– Aggregation, top-K computation done on integer codes

• But, do not play well with IndexWriter.addIndexes(Directory…)– Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the

input search are mapped to the destination’s

Global Ordinals

• FacetsCollector works in two steps:– Collects matching documents (and optionally their scores)– Invokes FacetsAccumulator to accumulate the top-K facets

• Performance tests show that this improves faceted search (LUCENE-4600)– Locality of reference?

• Useful for Sampling and Complements– Hard to do otherwise

Two-Phase Aggregation

https://issues.apache.org/jira/browse/LUCENE-4600

• Determine how facets are encoded– Partition size– Facet delimiter character (for drilldown terms, default \u001F)– CategoryListParams

• CategoryListParams holds parameters for a category list– Encoder/Decoder (default DGapVInt)– OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and

ALL_BUT_DIMENSION (default)

• CategoryListParams can be used to group facets together– Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)– Expert: separate categories by dimension into different category lists

• Useful when sets of categories are always aggregated together, but not with other categories

• FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!

FacetIndexingParams

Questions?

faceted search with lucene

Technology

lucene facets

new drillsidewayssearcher

new longrange1

new longrangeover

new longrange21

new longrange6

facets information

new document bookdoc