faceted search with lucene

21

Upload: lucenerevolution

Post on 11-May-2015

5.181 views

Category:

Technology


2 download

DESCRIPTION

Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.

TRANSCRIPT

Page 1: Faceted Search with Lucene
Page 2: Faceted Search with Lucene

Faceted Search with Lucene

Shai EreraResearcher, IBM

Page 3: Faceted Search with Lucene

• Working at IBM – Information Retrieval Research• Lucene/Solr committer and PMC member• http://shaierera.blogspot.com• [email protected]

Who Am I

Page 4: Faceted Search with Lucene

Lucene Facets 101

Page 5: Faceted Search with Lucene

• Technique for accessing documents that were classified into a taxonomy of categories– Flat: Author/John Doe, Tags/Lucene, Popularity/High

– Hierarchical: Computers/Software/Information Retrieval/Fulltext/Apache Lucene (ODP)

• Quick overview of the break down of the search results– How many documents are in category Committed Paths/lucene/core vs. Committed Paths/lucene/facet

• Simplifies interaction with the search application– Drilldown to issues that were updated in Past 2 days by clicking a link

– No knowledge required about search syntax and index schema

Faceted Search

http://jirasearch.mikemccandless.com

Page 6: Faceted Search with Lucene

• Contributed by IBM in 2011, released in 3.4.0• Major changes since 4.1.0+

– NRT support– Nearly 400% search speedups– Complete API revamp– New features (SortedSet, range faceting, drill-sideways)

• Two main indexing-time modes– Taxonomy-based: hierarchical facets, managed by a

sidecar index, low NRT reopen cost– SortedSetDocValues: flat facets only, no sidecar index,

higher NRT reopen cost

• Runtime modes– Range facets (on NumericDocValues fields)

• Other implementations: Solr, ElasticSearch, Bobo Browse

Lucene Facets

Page 7: Faceted Search with Lucene

• TaxonomyWriter/Reader– Manage the taxonomy information

• FacetFields– Add facets information to documents (DocValues fields, drilldown terms)

• FacetRequest– Defines which facets to aggregate and the FacetsAggregator (aggregation function)

• FacetsCollector– Collects matching documents and computes the top-K categories for each facet request

(invokes FacetsAccumulator)

• DrillDownQuery / DrillSideways– Execute drilldown and drill-sideways requests

Lucene Facet Components

Page 8: Faceted Search with Lucene

// Builds the taxonomy as documents are indexed, multi-threaded, single instanceTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);

// Adds facets information to a document, can be initialized once per threadFacetFields facetFields = new FacetFields(taxoWriter);

// List of categories to add to the documentList<CategoryPath> cats = new ArrayList<CategoryPath>();cats.add(new CategoryPath("Author", "Erik Hatcher"));cats.add(new CategoryPath("Author/Otis Gospodnetić“, ‘/’));cats.add(new CategoryPath("Pub Date", "2004", "December", "1"));

Document bookDoc = new Document();bookDoc.add(new TextField(“title”, “lucene in action”, Store.YES);

// add categories fields (DocValues, Postings)facetFields.addFields(bookDoc, cats);

// index the documentindexWriter.addDocument(bookDoc);

Sample Code – Indexing

Page 9: Faceted Search with Lucene

// Open an NRT TaxonomyReaderTaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);

// Define the facets to aggregate (top-10 categories for each)FacetSearchParams fsp = new FacetSearchParams();fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));

// Collect both top-K facets and top-N matching documentsTopDocsCollector tdc = TopScoredDocCollector.create(10, true);FacetsCollector fc = FacetsCollector.create(fsp, indexr, taxor);Query q = new TermQuery(new Term(“title”, “lucene”));searcher.search(q, MultiCollector.wrap(tdc, fc));

// Traverse the top facetsfor (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(“ “ + cat.label.components[0] + “ (“ + cat.value + “)”); }}

Sample Code – Search

Page 10: Faceted Search with Lucene

• Drilldown adds a filter to the search– Multiple categories can be OR’d

// Drilldown – filter results to “Component/core/index”;// All other “Component/*” and “Component/core/*” get count 0Query base = new MatchAllDocsQuery();DrillDownQuery ddq = new DrillDownQuery(facetIndexingParams, base);ddq.add(new CategoryPath(“Component/core/index”, ‘/’));

• Drill sideways allows drilldown, yet still aggregate “sideways” categories

// Drill-Sideways – drilldown on “Component/core/index”;// Other “Component/*” and “Component/core/*” are counted tooDrillSideways ds = new DrillSideways(searcher, taxoReader);DrillSidewaysResult sidewaysRes = ds.search(null, ddq, 10, fsp);

Drilldown and Drill-Sideways

http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html

Page 11: Faceted Search with Lucene

• Range facets on NumericDocValues fields– Define interested buckets during query– Supports any arbitrary ValueSource (Lucene 4.6.0)

// Aggregate matching documents into bucketsRangeAccumulator a = new RangeAccumulator(new

RangeFacetRequest<LongRange>("field", new LongRange(“1-5", 1L, true, 5L, true), new LongRange(“6-20", 6L, true, 20L, true), new LongRange(“21-100", 21L, false, 100L, false), new LongRange(“over 100", 100L, false, Long.MAX_VALUE, true)));

Dynamic Facets

Page 12: Faceted Search with Lucene

• Not all facets created equal– Categories added by an automatic categorization system, e.g. Category/Apache

Lucene (0.74) (confidence level is 0.74)– Important metadata about the facet, e.g. Contracts/US ($5M) (total $$$ generated

from contracts)– Complex structures, e.g. Users/Shai Erera (lastAccess=YYYY/MM/DD,

numUpdates=8…)

• Categories can have values associated with them per document– They are later aggregated by these values– NOTE: ≠ NumericDocValuesFields!

• Facet associations are completely customizable – encoded as a byte[] per document

Facet Associations

http://shaierera.blogspot.com/2013/01/facet-associations.html

Page 13: Faceted Search with Lucene

• Complements– Holds the count of each category in-memory, per IndexReader – When number of search results is >50% of the index, count the “complement set”– Useful for “overview” queries, e.g. MatchAllDocsQuery

• Sampling– Aggregate a sampled set of the search results– Optionally re-count top-K facets for accurate values

• Partitions– Partition the taxonomy space to control memory usage during faceted search– Useful for very big taxonomies (10s of millions of categories)

More Features

Page 14: Faceted Search with Lucene

Lucene Facets Under the Hood

Page 15: Faceted Search with Lucene

• The taxonomy maps categories to integer codes (referred to as ordinals)– Kind of like a Map<CategoryPath,Integer>, with hierarchy support– Provides taxonomy browsing services– DirectoryTaxonomyWriter is managed as a sidecar Lucene index

• Categories are broken down to their path components, e.g. Date/2012/March/20 becomes:

– Date, with ordinal=1– Date/2012, with ordinal=2– Date/2012/March, with ordinal=3– Date/2012/March/20, with ordinal=4

The Taxonomy Index

Page 16: Faceted Search with Lucene

• Categories are added as drilldown terms, e.g. for Date/2012/March/20:– $facets:Date– $facets:Date/2012– …

• All category ordinals associated with the document are added as a BinaryDocValuesField

– All path components ordinals’ are added, not just the leafs’– Encoded as VInt + gap for efficient compression and speed

• Other compression methods attempted, but were slower to decode (LUCENE-4609)

– Used during faceted search to read all the associated ordinals and aggregate accordingly (e.g. count)

The Search Index

Page 17: Faceted Search with Lucene

• SortedSetFacetFields add SortedSetDocValuesFields and drilldown terms to documents

• Local-segment SortedSet ordinals are mapped to global ones through SortedSetDocValuesReaderState

• Use SortedSetDocValuesAccumulator to accumulate SortedSet facets• Advantages:

– Taxonomy representation requires less RAM (flat taxonomy)– No sidecar index– Tie-breaks by label-sort order

• Disadvantages:– Not full taxonomy– Overall uses more RAM (local-to-global ordinal mapping)– Adds NRT reopen cost– Slower than taxonomy-based facets

SortedSet Facets

Page 18: Faceted Search with Lucene

• Per-segment integer codes (as used by the SortedSet approach) are less efficient– Different ordinals for same categories across segments– Hold in-memory codes map (e.g. local-to-global) – more RAM and less scalable– Resolve top-K on the String representation of categories – more CPU

• Global ordinals allow efficient per-segment faceting and aggregation– No translation maps required (no extra RAM, highly scalable)– Aggregation, top-K computation done on integer codes

• But, do not play well with IndexWriter.addIndexes(Directory…)– Must use IndexWriter.addIndexes(IndexReader…), so that the ordinals in the

input search are mapped to the destination’s

Global Ordinals

Page 19: Faceted Search with Lucene

• FacetsCollector works in two steps:– Collects matching documents (and optionally their scores)– Invokes FacetsAccumulator to accumulate the top-K facets

• Performance tests show that this improves faceted search (LUCENE-4600)– Locality of reference?

• Useful for Sampling and Complements– Hard to do otherwise

Two-Phase Aggregation

Page 20: Faceted Search with Lucene

• Determine how facets are encoded– Partition size– Facet delimiter character (for drilldown terms, default \u001F)– CategoryListParams

• CategoryListParams holds parameters for a category list– Encoder/Decoder (default DGapVInt)– OrdinalPolicy (how path components are encoded): ALL_PARENTS, NO_PARENTS and

ALL_BUT_DIMENSION (default)

• CategoryListParams can be used to group facets together– Default: all facets are put in the same “category list” (i.e. one BinaryDocValues field)– Expert: separate categories by dimension into different category lists

• Useful when sets of categories are always aggregated together, but not with other categories

• FacetIndexingParams are currently not recorded per-segment and therefore you should be careful if you suddenly change them!

FacetIndexingParams

Page 21: Faceted Search with Lucene

Questions?