nested and parent/child docs in elasticsearch

Nested & Parent/Child Docs

hidden gems in ElasticSearch

Anne Veling | ElasticSearch NL Meetup | February 26, 2013

agenda

Refworks FlowReference Manager for Researchers

Use of ElasticSearch in Flow

Use Case 1: Nested documents

Use Case 2: Parent/Child relations

Lessons Learned

introduction

Anne Veling, @anneveling

Self-employed contractorSoftware Architect

Agile process management

Performance optimization

Lucene/SOLR/ElasticSearch implementations & training

tech stack

architecture

Flow

PDF Pipeline

Mongo

ElasticSear

ch

Citation Authority

Citation Canonicalization

Use Case 1

Reference Canonicalization

We built a large Citation Authority index in ElasticSearch

With full, deduped metadata for a large portion of English scientific research

In the Reference Edit screenTry to find high quality matches to a large index of canonical references of scientific articles

Based on known fieldsTitle, possibly partial and incorrect

Author(s)

Other identifying fields: journal, year, …

{ "query": { "bool": { "must": [ { "text": { "title": "market elasticity" } }, { "text": { "authors.lastName": "Russell" } }, { "text": { "authors.firstNames": "G" } } ] } }}

problem

Searching on a sub-documentSearching for all documents where

quthors.lastName: “Russell”

authors.firstNames: “G”

Also matches documents by“Jack Russell and Frederickson, G”

We need a sub-document JOIN query…Combined with other information on the parent document (title)Oh noes!

We’re using a NoSQL

database, so we can’t…

Can’t we?

luce

ne d

ocu

ments

term

query

term

Lucene block indexing

Save “children” documents always right before their “parent” document

Requires you to writeBlockJoinQuery

ParentsFilter

ChildQuery

ToParentBlockJoinQuery

This means: all children (and parent!) needs to be reindexed upon any change in them…

authors: {properties: {

rawName: {analyzer: “caName”type: “string”

},lastName: {

analyzer: “caName”type: “string”

},firstNames: {

null_value: “__NONAME”analyzer: “caName”type: “string”

}},type: “nested”

},title: {

analyzer: “caText”type: “string”

}

mapping

query{ "bool" : { "must" : [ { "text" : { "title" : { "query" : "market elasticity", "type" : "phrase", "slop" : 2 } } }, { "bool" : { "must" : { "nested" : { "query" : { "bool" : { "should" : [ { "bool" : { "must" : [ { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean" } } }, { "bool" : { "must" : { "bool" : { "should" : [ { "text" : { "firstNames" : { "query" : "G", "type" : "boolean" } } }, { "prefix" : { "firstNames" : "g" } } ] } } } } ] } },

{ "filtered" : { "query" : { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean", "operator" : "AND" } } }, "filter" : { "missing" : { "field" : "firstNames" } } } } ] } }, "path" : "authors" } } } } ] }}

(title:"market elasticity") AND ( authors: ( (lastName:"Russell") AND ( (firstNames:"G") OR (firstNames:"g*") OR (lastName:"Russell" AND NOT(firstNames)) ) ))

“nested”

Just setting the subdocument type to “nested” in mapping

Combine parent-query with “nested” query that specifies the path

Complex subcombination JOIN operations

Automatic hiding of “nested” subdocumentsThis will increase your index size

“nested”

Efficient!ElasticSearch handles document updates

Child-whereclauses handled INSIDE parent query docEnum

Children are sharded with their parents => locality!

Facet counts (on parent) still correct!

LimitationsCombinations of nested subdocuments with other queries

Like “dis_max”, or “text”

No automatic recognition of “authors.lastName” in other queries to a “nested” subquery

Multipage IndexingUse Case 2

architecture

Flow

PDF Pipeline

Mongo

ElasticSear

ch

Citation Authority

doc

page

page

page

S3

problem

How to index both Doc metadata and Pages textDoc in Flow app

Pages only in PDF pipeline and on S3

Docs updated frequently, on the Flow appReindex Page would require download of text content from S3…

Nested Docs?No; too slow for updates here…

solution

Parent/Child documents in ElasticSearch!

Store parent type on children type mappingTo index a child, specify the parent ID

Stored as “_parent” field on the child

QueryCombine parent query with “has_child” child-query

itemtext: {properties: {

text: {analyzer: “pqdText”,type: “string”

}},_parent: {

type: “item”}

}

{ "bool" : { "must" : [ { "bool" : { "should" : [ { "query_string" : { "query" : "elasticity", "fields" : [ "item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.reference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "item.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.reference.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.reference.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ], "use_dis_max" : true, "default_operator" : "and" } }, { "has_child" : { "query" : { "text" : { "text" : { "query" : "elasticity", "type" : "boolean", "operator" : "AND" } } }, "type" : "itemtext", "boost" : 0.1 } } ] } }, { "term" : { "userId" : "user:50a3bd090364f635f24c713c" } } ] }}

NOT SO SURE WHO IS PARENT, WHO IS CHILD

IN PARENT-CHILD RELATIONSHIP

conclusions

Parent/Child ‘remote key’ solution in ElasticSearchEasy connection of two types of documents with

Separate update cycles

Complex JOIN queries possibles, combining parent & child fields

Slower than “nested”

Locality principle: Children always sharded with parent

LimitationsHas_child filter returns only parents, cannot return child data

But: has_parent filter

ElasticSearches caches parent-child ID table in heap…

conclusions

Complex join-style queries can be done with ElasticSearch

Easily

Efficiently

Use “nested” typesIf data can be duplicated

Very efficient

Use “parent/child” typesFor real independently updateable documents

SELECT * FROM ARTICLESLEFT JOIN AUTHORS ON AUTHORS.ARTICLEID = ARTICLES.IDWHERE ARTICLES.TITLE MATCHES "market elasticity" AND AUTHORS.LASTNAME MATCHES "Russell" AND AUTHORS.FIRSTNAME MATCHES "G"

conclusions

ElasticSearch rocksHides complex JSON document to Lucene key/value model mapping

Allows you to easily use more of Lucene greatness

So you can focus on actual queries and use cases

NoSql does not mean NoJoinsJust forcing you to model in such a way, joins will be efficient

[email protected]

@anneveling

ElasticSearch “nested” types:the best thing since sliced bread

thank you

nested and parent/child docs in elasticsearch

Documents

firstnames query

query text

authors query

nested query

reference canonicalizationwe

mappingcombine parentquery

childquerycombine parent

store parent type