working with deeply nested documents in apache solr: presented by anshum gupta & alisa zhila,...

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Working with deeply nested documents in Apache Solr Anshum Gupta, Alisa Zhila

IBM Watson

3

Anshum Gupta

• Apache Lucene/Solr committer and PMC member

• Search guy @ IBM Watson.

• Interested in search and related stuff.

• Apache Lucene since 2006 and Solr since 2010.

4

Alisa Zhila

• Apache Lucene/Solr supporter :)

• Natural Language Processing technologies @ IBM Watson

• Interested in search and related stuff

5

Agenda

• Hierarchical Data/Nested Documents

• Indexing Nested Documents

• Querying Nested Documents

• Faceting on Nested Documents

Hierarchical Documents

7

• Social media comments, Email threads, Annotated data - AI

• Relationship between documents

• Possibility to flatten

Need for nested data

EXAMPLE: Blog Post with Comments Peter Navarro outlines the Trump economic plan Tyler Cowen, September 27, 2016 at 3:07am Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. 1 Ray Lopez September 27, 2016 at 3:21 am I’ll be the first to say this, but the analysis is flawed. {negative} 2 Brian Donohue September 27, 2016 at 9:20 am The math checks out. Solid. {positive}

examples from http://marginalrevolution.com

http://marginalrevolution.com

8

• Can not flatten, need to retain context


• Get all 'positive comments' to 'posts about Trump' -- IMPOSSIBLE!!!

Nested Documents

EXAMPLE: Data Flattening

Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Comment_authors: [Ray Lopez, Brian Donohue] Comment_dates: [September 27, 2016 at 3:21 am, September 27, 2016 at 9:20 am] Comment_texts: ["I’ll be the first to say this, but the analysis is flawed.", "The math checks out. Solid."] Comment_sentiments: [negative, positive]

9

• Can not flatten, need to retain context


• Get all 'positive comments' to 'posts about Trump' -- POSSIBLE!!! (stay tuned)

Nested DocumentsEXAMPLE: Hierarchical Documents

Type: Post Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports.

Type: Comment Author: Ray Lopez Date: September 27, 2016 at 3:21 am Text: I’ll be the first to say this, but the analysis is flawed. Sentiment: negative

Type: Comment Author: Brian Donohue Date: September 27, 2016 at 9:20 am Text: The math checks out. Solid. Sentiment: positive

10

• Blog Post Data with Comments and Replies from http://marginalrevolution.com (cured)

• 2 posts, 2-3 comments per post, 0-3 replies per comment

• Extracted keywords & sentiment data

• 4 levels of "nesting"

• Too big to show on slides

• Data + Scripts + Demo Queries:

• https://github.com/alisa-ipn/solr-revolution-2016-nested-demo

Running Example

http://marginalrevolution.com

https://github.com/alisa-ipn/solr-revolution-2016-nested-demo


Indexing Nested Documents

12

• Nested XML

• JSON Documents

• Add _childDocument_ tags for all children

• Pre-process field names to FQNs

• Lose information, or add that as meta-data during pre-processing

• JSON Document endpoint (6x only) - /update/json/docs

• Field name mappings

• Child Document splitting - Enhanced support coming soon.

Sending Documents to Solr

13

solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml

Sending Documents to Solr: Nested XML

<add> <doc> <field name="type">post</field> <field name="author"> "Alex Tabarrok"</field> <field name="title">"The Irony of Hillary Clinton’s Data Analytics"</field> <field name="body">"Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth."</field> <field name="id">"12015-24204"</field> <doc> <field name="type">comment</field> <field name="author">"Todd"</field> <field name="text">"Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."</field> <field name="sentiment">"positive"</field> <field name="id">"29798-24171"</field> <doc> <field name="type">reply</field> <field name="author">"The Other Jim"</field> <field name="text">"No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win."</field> <field name="sentiment">"negative"</field> <field name="id">"29798-21232"</field> </doc> </doc> </doc> </add>

14

• Add _childDocument_ tags for all children

• Pre-process field names to FQNs

• Lose information, or add that as meta-data during pre-processing solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr

Sending Documents to Solr: JSON Documents

[{ "path": "1.posts", "id": "28711", "author": "Alex Tabarrok", "title": "The Irony of Hillary Clinton’s Data Analytics", "body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth.", "_childDocuments_": [ { "path": "2.posts.comments", "id": "28711-19237", "author": "Todd", "text": "Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world.", "sentiment": "positive", "_childDocuments_": [ { "path": "3.posts.comments.replies", "author": "The Other Jim", "id": "28711-12444", "sentiment": "negative", "text": "No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win." }]}]}]

15

• JSON Document endpoint (6x only) - /update/json/docs

• Field name mappings

• Child Document splitting - Enhanced support coming soon.

solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs?split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data-binary @small-example-data.json -H ‘Content-type:application/json'

NOTE: All documents must contain a unique ID.

Sending Documents to Solr: JSON Endpoint

16

• Update Request Processors don’t work with nested documents

• Example:

• UUID update processor does not auto-add an id for a child document.

• Workaround:

• Take responsibility at the client layer to handle the computation for nested documents.

• Change the update processor in Solr to handle nested documents.

Update Processors and Nested Documents

17

• The entire block needs reindexing

• Forgot to add a meta-data field that might be useful? Complete reindex

• Store everything in Solr IF

• it’s too expensive to reconstruct the doc from original data source

• No access to data anymore e.g. streaming data

Re-Indexing Your Documents

18

• Various ways to index nested documents

• Need to re-index entire block

Nested Document Indexing Summary

Let’s ask some interesting questions

20

{ "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "path":["3.posts.comments.keywords"], "text":["Trump"]}, { "path":["2.posts.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."], "path":["1.posts"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}

Easy question firstFind all documents that mention Trumpq=text:Trump

21

{ "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."], "path":["2.posts.comments"]}

Returning certain types of documentsFind all comments and replies that mention Trump q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump

Recipe: At the data pre-processing stage, add a field that indicates document type and also its path in the hierarchy (-- stay tuned):

22

{ "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]}

Returning similar type from different level Find all keywords that are Hillary q=path:*.keywords AND text:Hillary

Recipe: Use wild-cards in the field that stores the hierarchy path

Cross-Level Querying

24

{ "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]}

Recap so far...Find all keywords that are Hillary q=path:*.keywords AND text:Hillary

We're querying precisely for documents which we provide a search condition for

Query Level 3

Result Level 3

Query Level 4

Result Level 4

Query Level 2

Result Level 2

25

Returning parents by querying children: Block Join Parent Query

Find all comments whose keywords detected positive sentiment towards Hillary q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive

Query Level 3

Result Level 2 {

"author":["Brian Donohue"], "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "author":["Todd"], "text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."], "path":["2.posts.comments"]}

26

{ "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]}

Returning children by querying parents: Block Join Child Query

Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies

Query Level 2

Result Level 3

27

{ "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]}

Returning children by querying parents: Block Join Child Query

Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies

Query Level 2

Result Level 3

Block Join Child Query + Filtering Query A bit counterintuitive and non-symmetrical to the BJPQ

28

{ "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"}

Returning all document's descendants Block Join Child Query

Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative

Query Level 2

Results Level 3

Results Level 4

29

Returning all document's descendants Block Join Child Query

Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative

Query Level 2

Results Level 3

Results Level 4

{ "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"}

Issue: no grouping by parent What if we want to bring the whole sub-structure?

30

Find all negative comments and return them with all their descendants q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*]

Query Level 2

Result Level 2

sub-hierarchy

Returning document with all descendants: ChildDocTransformer

{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "path":["4.posts.comments.replies.keywords"], "text":["U.S."]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ...

Issue: the "sub-hierarchy" is flat

• Returns all descendant documents along with the queried document

• flattens the sub-hierarchy

• Workarounds:

• Reconstruct the document using path ("path":["3.posts.comments.replies"]) information in case you want the entire subtree (result post-processing)

• use childFilter in case you want a specific level

31

“This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document." (ChildDocTransformer cwiki)

Returning document with all descendants: ChildDocTransformer

32

Find all negative comments and return them with all replies to them q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*

childFilter=path:3.posts.comments.replies]

{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ...

Returning document with specific descendants: ChildDocTransformer + childFilter

Query Level 2:comments

Result Level 2:comments + Level 3:replies

33

Find all negative comments and return them with all their descendants that mention Trump q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump]

{ "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]} ] }, ...

Returning document with queried descendants: ChildDocTransformer + childFilter

Query Level 2:comments

Result Level 2:comments

+ sub-levels

Issue: cannot use boolean expressions in childFilter query

34

Cross-Level Querying Mechanisms:

• Block Join Parent Query

• Block Join Children Query

• ChildDocTransformer Good points:

• overlapping & complementary features

• good capabilities of querying direct ancestors/descendants

• possible to query on siblings of different type Drawbacks:

• need for data-preprocessing for better querying flexibility

• limited support of querying over non-directly related branches (overcome with graphs?)

• flattening nested data (additional post-processing is needed for reconstruction)

Nested Document Querying Summary

Faceting on Nested Documents

36

• Solr allows faceting on nested documents!

• Two mechanisms for faceting:

• Faceting with JSON Facet API (since Solr 5.3)

• Block Join Faceting (since Solr 5.5)

Faceting on Nested Documents

37

q=path:2.posts.comments AND sentiment:positive& json.facet={ most_liked_authors : { type: terms, field: author, domain: { blockParent : "path:1.posts"}}}

Faceting on parents by descendants JSON Facet API: Parent Domain

Count authors of the posts that received positive comments

"most_liked_authors":{ "buckets":[ { "val":"Alex Tabarrok", "count":1}, { "val":"Tyler Cowen", "count":1} ] }

Query Level 2

Facet Level 1

38

Faceting on descendants by ancestors JSON Facet API: Child Domain

Distribution of keywords that appear in comments and replies by the top-level postsQuery Level 1

Facet Descendant

Levels

"top_keywords":{ "buckets":[{ "val":"hillary", "count":4, "counts_by_posts":2}, { "val":"trump", "count":3, "counts_by_posts":2}, { "val":"dnc", "count":1, "counts_by_posts":1}, { "val":"obama", "count":2, "counts_by_posts":1}, { "val":"u.s", "count":1, "counts_by_posts":1} ]}

39

q=path:1.posts&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}

Faceting on descendants by ancestors JSON Facet API: Child Domain


Facet Descendant

Levels

40

Faceting on descendants by top-level ancestor JSON Facet API: Child Domain


Facet Descendant

Levels

Issue: only the top-ancestor gets the unique "_root_" field by default

q=path:1.posts&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}

41

q=path:2.posts.comments&rows=0&json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(2.posts.comments-id)" }}}}}

Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields

Distribution of keywords that appear in comments and replies by the comments

Query Level 2

Facet Descendant

Levels

At pre-processing, introduce unique fields for each level

42

Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields


Query Level 2

Facet Descendant

Levels

"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}

Now let's try the same using Block Join Faceting

44

• Experimental Feature

• Needs to be turned on explicitly in solrconfig.xml More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting

Block Join Faceting

https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting

45

bjqfacet?q={!parent which=path:2.posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text

Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain


"facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] }

Query Level 2

Facet Descendant

Levels

46

bjqfacet?q={!parent which=path:2.posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text

Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain



Query Level 2

Facet Descendant

Levels

bjqfacet request handler instead of query

47

Output Comparison

Block Join Facet JSON Facet API


"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}


48

Output Comparison

Block Join Facet JSON Facet API


"top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, ...


Output is sorted in alphabetical order. It cannot be changed

facet:{ top_keywords : { ... sort: "counts_by_comments desc" }}}

49

JSON Facet API:

• Experimental - but more mature

• More developed and established feature

• bulky JSON syntax

• faceting on children by non-top level ancestors requires introducing unique branch identifiers similar to "_root_" on each level

Block Join Facet:

• Experimental feature

• Lacks controls: sorting, limit...

• traditional query-style syntax

• proper handling of faceting on children by non-top level ancestors

Hierarchical Faceting Summary

50

• Returning hierarchical structure

• JSON facet rollups is in the works - SOLR-8998

• Graph querying might replace a lot of functionalities of cross-level querying - No distributed support right now.

• There’s more but the community would love to have more people involved!

Community Roadmap

Thank you!

Anshum Gupta [email protected] | @anshumgupta Alisa Zhila [email protected] https://github.com/alisa-ipn/solr-revolution-2016-nested-demo

mailto:[email protected]


working with deeply nested documents in apache solr: presented by anshum gupta & alisa zhila,...

Technology