grouping and joining in lucene/solr
DESCRIPTION
Presented by Martijn van Groningen, SearchWorkings - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 In the real world data isn’t flat. Data is often modelled into complex models. Lucene is document oriented and doesn’t support relations natively. The only way you could index this data is by de-normalizing the relations in a document with many fields and execute subsequent queries. Subsequent queries can be expensive and data gets duplicated. This isn’t always ideal. Recently Solr and Lucene provide features that allow you to join and group. You can join and group on fields across documents and still have the power of Lucene’s awesome free text search. In this presentation, we’ll look at these new alternatives, the advantages and disadvantages and how these features can be utilized. how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.TRANSCRIPT
Searchworkings.org - The online search community
Overview
Grouping & Joining
‣ Background
‣ Joining
‣ Result grouping
‣ Conclusion
2
Thursday, May 17, 2012
Searchworkings.org - The online search community
Lucene’s model
Background
‣ Lucene is document based.
‣ Lucene doesn’t store information about relations between documents.
‣ Data often holds relations.
‣ Good free text search over relational data.
3
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example
Background
‣ Product
‣ Name
‣ Description
‣ Product-item
‣ Color
‣ Size
‣ Price
‣ Goal: Show the most applicable product based on product-item criteria.
4
Thursday, May 17, 2012
Searchworkings.org - The online search community
Common Lucene solutions
Background
‣ Compound documents.
‣May result in documents with many fields.
‣ Subsequent searches.
‣May cause a lot network overhead.
‣ Non Lucene based approach:
‣ If free text search isn’t very important use a relational database.
5
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Background
‣ Compound Product & Product-items document.
‣ Each product-item has its own field prefix.
6
Thursday, May 17, 2012
Searchworkings.org - The online search community
Different solutions
Background
‣ Lucene offers solutions to have a 'relational' like search.
‣ Parent child
‣ Grouping & joining aren't naturally supported.
‣ All the solutions do increase the search time.
‣ Some scenarios grouping and joining isn't the right solution.
7
Thursday, May 17, 2012
Joining
Modelling relations
Thursday, May 17, 2012
Searchworkings.org - The online search community
Introduction
Joining
‣ Support for parent child like search from Lucene 3.4
‣ Not a SQL join.
‣ The parent and each children are stored as documents.
‣ Two types:
‣ Index time join
‣ Query time join
9
Thursday, May 17, 2012
Searchworkings.org - The online search community
Index time join
Joining
‣ Two block join queries:
‣ ToParentBlockJoinQuery
‣ ToChildBlockJoinQuery
‣ One Lucene collector:
‣ ToParentBlockJoinCollector
‣ Index time join requires block indexing.
10
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
‣ Atomically adding documents.
‣ A block of documents.
‣ Each document gets sequentially assigned Lucene document id.
‣ IndexWriter#addDocuments(docs);
11
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
‣ Index doesn't record blocks.
‣ Segment merging doesn’t re-order documents in a segment.
‣ App is responsible for identifying block documents.
‣Marking the last document in a block.
‣ Adding a document to a block requires you to reindex the whole block.
‣ Removing a document from a block doesn’t requires reindexing a block.
12
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Joining
‣ Parent is the last document in a block.
13
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
14
Marking parent documents
Thursday, May 17, 2012
Searchworkings.org - The online search community
Block indexing
Joining
15
Add block
Add block
Thursday, May 17, 2012
Searchworkings.org - The online search community
‣ Parent filter marks the parent documents.
‣ Child query is executed in the parent space.
‣ ToChildBlockJoinQuery works in the opposite direction.
ToParentBlockJoinQuery
Joining
16
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
‣ Query time joining is executed in two phases and is field based:
‣ fromField
‣ toField
‣ Doesn’t require block indexing.
17
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
‣ First phase collects all the terms in the fromField for the documents that match with the original query.
‣ Currently doesn’t take the score from original query into account.
‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.
‣ Two different implementations:
‣ JoinUtil - Lucene (≥ 3.6)
‣ Join query parser - Solr (trunk)18
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining - Indexing
Joining
19
Referrer the product id.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining - Indexing
Joining
20
Thursday, May 17, 2012
Searchworkings.org - The online search community
Query time joining
Joining
21
‣ Result will contain one product.
‣ Possible to join over two indices.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Final thoughts
Joining
‣ Joining module has good solutions to model parent child relations.
‣ Use block join if you care about scoring.
‣ Frequent updates can be problematic.
‣ Use query time join for parent child filtering.
‣ Query time join is slower than index time join.
‣Mostly a Lucene feature only.
‣ All code is annotated as experimental.22
Thursday, May 17, 2012
Result grouping
Previously known as Field Collapsing.
Thursday, May 17, 2012
Searchworkings.org - The online search community
Introduction
Result grouping
‣ Group matching documents that share a common property.
‣ Search hit represents a group.
‣ Facet counts & total hit count represent groups.
‣ Per group collect information
‣Most relevant document.
‣ Top three documents.
‣ Aggregated counts24
Thursday, May 17, 2012
Searchworkings.org - The online search community
Usages
Result grouping
‣ Group documents by a shared property
‣ Product-item by product id (Parent child)
‣ Collapse similar looking documents
‣ E.g. all results from the Wikipedia domains.
‣ Remove duplicates from the search result.
‣ Based on a field that contains a hash
25
Thursday, May 17, 2012
Searchworkings.org - The online search community
Example domain
Result grouping
‣ Each Product-item is a document, but includes the product data.
26
Thursday, May 17, 2012
Searchworkings.org - The online search community
Implementation
Result grouping
‣ Result grouping implemented with Lucene collectors.
‣Module in trunk and a contrib in 3.x versions.
‣ Two pass result grouping.
‣ Grouping by indexed field, function or doc values.
‣ Single pass result grouping.
‣ Requires block indexing.
27
Thursday, May 17, 2012
Searchworkings.org - The online search community
Two pass implementation
Result grouping
‣ First pass collects the top N groups.
‣ Per group: group value + sort value
‣ Second pass collects data for each top group.
‣ The top N documents per group.
‣ Possible other aggregated information.
‣ Second pass search ignores all documents outside topN groups.
28
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping - Indexing
Result grouping
29
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping - Searching
Result grouping
30
Thursday, May 17, 2012
Searchworkings.org - The online search community
Result grouping made easier
Result grouping
31
‣ GroupingSearch
‣ Solr
‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id
‣Many more options:
‣ http://wiki.apache.org/solr/FieldCollapsing
Thursday, May 17, 2012
Searchworkings.org - The online search community
Parent child result
Result grouping
‣ TopGroups - Equivalent to TopDocs.
‣ Hit count
‣ Group count
‣ Groups
‣ Top documents
‣ Facet and total count can represent groups instead of documents.
‣ But requires more query time.
32
Thursday, May 17, 2012
Conclusion
Compare...
Thursday, May 17, 2012
Searchworkings.org - The online search community
Compare the parent child solutions
Conclusion
‣ Result grouping
‣ + Distributed support & Parent child relation as hit.
‣ - Parent data duplication
‣ - Impact on query time
‣ Joining
‣ + Fast & no data duplication
‣ - Index time join not optimal for updates
‣ - Query time join is limited.34
Thursday, May 17, 2012
Searchworkings.org - The online search community
Compare the parent child solutions
Conclusion
‣ Compound documents.
‣ + Fast and works out-of-the box with all features.
‣ - Not flexible when it comes to updates.
‣ - Document granularity is set in stone.
35
Thursday, May 17, 2012
36
Any questions?
Thursday, May 17, 2012
Extra slides
We have time left!
Thursday, May 17, 2012
Searchworkings.org - The online search community
Future work
Conclusion
‣ Higher level parent-child API.
‣ Needs to cover search & indexing.
‣ Joining
‣ Distributed support.
‣ Represent a hit as a parent child relation in the search result.
‣ Result grouping
‣ Aggregated grouped information like: sum, avg, min, max etc.
38
Thursday, May 17, 2012
Searchworkings.org - The online search community
ToParentBlockJoinCollector
Joining
‣ TopGroups contains a group per top N parent document.
‣ Each group contains a parent and child documents.
39
Thursday, May 17, 2012
Searchworkings.org - The online search community
Groups & facet counts
Result grouping
‣ Faceting and result grouping are different features.
‣ But are often used together!
‣ Facet counts can be based on:
‣ Found documents.
‣ Found groups.
‣ Combination of facet value and group.
‣ All options are supported in Solr.40
Thursday, May 17, 2012
Searchworkings.org - The online search community
Doc values
Result grouping
‣ Doc values / Column Stride values
‣ Prevents the creation of expensive data structures in FieldCache.
‣ Inverted index is meant for free text search.
‣ All grouping collectors have doc values based implementations!
41
Thursday, May 17, 2012