grouping and joining in lucene/solr

Grouping & Joining

Martijn van [email protected] Committer & PMC Member

Thursday, May 17, 2012

mailto:[email protected]

mailto:[email protected]

Searchworkings.org - The online search community

Overview

Grouping & Joining

‣ Background

‣ Joining

‣ Result grouping

‣ Conclusion

2



Lucene’s model

Background

‣ Lucene is document based.

‣ Lucene doesn’t store information about relations between documents.

‣ Data often holds relations.

‣ Good free text search over relational data.

3



Example

Background

‣ Product

‣ Name

‣ Description

‣ Product-item

‣ Color

‣ Size

‣ Price

‣ Goal: Show the most applicable product based on product-item criteria.

4



Common Lucene solutions

Background

‣ Compound documents.

‣May result in documents with many fields.

‣ Subsequent searches.

‣May cause a lot network overhead.

‣ Non Lucene based approach:

‣ If free text search isn’t very important use a relational database.

5



Example domain

Background

‣ Compound Product & Product-items document.

‣ Each product-item has its own field prefix.

6



Different solutions

Background

‣ Lucene offers solutions to have a 'relational' like search.

‣ Parent child

‣ Grouping & joining aren't naturally supported.

‣ All the solutions do increase the search time.

‣ Some scenarios grouping and joining isn't the right solution.

7


Joining

Modelling relations



Introduction

Joining

‣ Support for parent child like search from Lucene 3.4

‣ Not a SQL join.

‣ The parent and each children are stored as documents.

‣ Two types:

‣ Index time join

‣ Query time join

9



Index time join

Joining

‣ Two block join queries:

‣ ToParentBlockJoinQuery

‣ ToChildBlockJoinQuery

‣ One Lucene collector:

‣ ToParentBlockJoinCollector

‣ Index time join requires block indexing.

10



Block indexing

Joining

‣ Atomically adding documents.

‣ A block of documents.

‣ Each document gets sequentially assigned Lucene document id.

‣ IndexWriter#addDocuments(docs);

11



Block indexing

Joining

‣ Index doesn't record blocks.

‣ Segment merging doesn’t re-order documents in a segment.

‣ App is responsible for identifying block documents.

‣Marking the last document in a block.

‣ Adding a document to a block requires you to reindex the whole block.

‣ Removing a document from a block doesn’t requires reindexing a block.

12



Example domain

Joining

‣ Parent is the last document in a block.

13



Block indexing

Joining

14

Marking parent documents



Block indexing

Joining

15

Add block

Add block



‣ Parent filter marks the parent documents.

‣ Child query is executed in the parent space.

‣ ToChildBlockJoinQuery works in the opposite direction.

ToParentBlockJoinQuery

Joining

16



Query time joining

Joining

‣ Query time joining is executed in two phases and is field based:

‣ fromField

‣ toField

‣ Doesn’t require block indexing.

17



Query time joining

Joining

‣ First phase collects all the terms in the fromField for the documents that match with the original query.

‣ Currently doesn’t take the score from original query into account.

‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField.

‣ Two different implementations:

‣ JoinUtil - Lucene (≥ 3.6)

‣ Join query parser - Solr (trunk)18



Query time joining - Indexing

Joining

19

Referrer the product id.



Query time joining - Indexing

Joining

20



Query time joining

Joining

21

‣ Result will contain one product.

‣ Possible to join over two indices.



Final thoughts

Joining

‣ Joining module has good solutions to model parent child relations.

‣ Use block join if you care about scoring.

‣ Frequent updates can be problematic.

‣ Use query time join for parent child filtering.

‣ Query time join is slower than index time join.

‣Mostly a Lucene feature only.

‣ All code is annotated as experimental.22


Result grouping

Previously known as Field Collapsing.



Introduction

Result grouping

‣ Group matching documents that share a common property.

‣ Search hit represents a group.

‣ Facet counts & total hit count represent groups.

‣ Per group collect information

‣Most relevant document.

‣ Top three documents.

‣ Aggregated counts24



Usages

Result grouping

‣ Group documents by a shared property

‣ Product-item by product id (Parent child)

‣ Collapse similar looking documents

‣ E.g. all results from the Wikipedia domains.

‣ Remove duplicates from the search result.

‣ Based on a field that contains a hash

25



Example domain

Result grouping

‣ Each Product-item is a document, but includes the product data.

26



Implementation

Result grouping

‣ Result grouping implemented with Lucene collectors.

‣Module in trunk and a contrib in 3.x versions.

‣ Two pass result grouping.

‣ Grouping by indexed field, function or doc values.

‣ Single pass result grouping.

‣ Requires block indexing.

27



Two pass implementation

Result grouping

‣ First pass collects the top N groups.

‣ Per group: group value + sort value

‣ Second pass collects data for each top group.

‣ The top N documents per group.

‣ Possible other aggregated information.

‣ Second pass search ignores all documents outside topN groups.

28



Result grouping - Indexing

Result grouping

29



Result grouping - Searching

Result grouping

30



Result grouping made easier

Result grouping

31

‣ GroupingSearch

‣ Solr

‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

‣Many more options:

‣ http://wiki.apache.org/solr/FieldCollapsing


http://myhost/solr/select?q=solr&group=true&group.field=

http://myhost/solr/select?q=solr&group=true&group.field=

http://wiki.apache.org/solr/FieldCollapsing

http://wiki.apache.org/solr/FieldCollapsing


Parent child result

Result grouping

‣ TopGroups - Equivalent to TopDocs.

‣ Hit count

‣ Group count

‣ Groups

‣ Top documents

‣ Facet and total count can represent groups instead of documents.

‣ But requires more query time.

32


Conclusion

Compare...



Compare the parent child solutions

Conclusion

‣ Result grouping

‣ + Distributed support & Parent child relation as hit.

‣ - Parent data duplication

‣ - Impact on query time

‣ Joining

‣ + Fast & no data duplication

‣ - Index time join not optimal for updates

‣ - Query time join is limited.34



Compare the parent child solutions

Conclusion

‣ Compound documents.

‣ + Fast and works out-of-the box with all features.

‣ - Not flexible when it comes to updates.

‣ - Document granularity is set in stone.

35


36

Any questions?


Extra slides

We have time left!



Future work

Conclusion

‣ Higher level parent-child API.

‣ Needs to cover search & indexing.

‣ Joining

‣ Distributed support.

‣ Represent a hit as a parent child relation in the search result.

‣ Result grouping

‣ Aggregated grouped information like: sum, avg, min, max etc.

38



ToParentBlockJoinCollector

Joining

‣ TopGroups contains a group per top N parent document.

‣ Each group contains a parent and child documents.

39



Groups & facet counts

Result grouping

‣ Faceting and result grouping are different features.

‣ But are often used together!

‣ Facet counts can be based on:

‣ Found documents.

‣ Found groups.

‣ Combination of facet value and group.

‣ All options are supported in Solr.40



Doc values

Result grouping

‣ Doc values / Column Stride values

‣ Prevents the creation of expensive data structures in FieldCache.

‣ Inverted index is meant for free text search.

‣ All grouping collectors have doc values based implementations!

41


grouping and joining in lucene/solr

Technology

index time

query time

parent child

grouping amp

parent child

original query

search result

result grouping