nested and parent/child docs in elasticsearch
DESCRIPTION
A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to use the "nested" type and parent-child relations in ElasticSearch to do complex where-clause queries in an efficient wayTRANSCRIPT
Nested & Parent/Child Docs
hidden gems in ElasticSearch
Anne Veling | ElasticSearch NL Meetup | February 26, 2013
agenda
Refworks FlowReference Manager for Researchers
Use of ElasticSearch in Flow
Use Case 1: Nested documents
Use Case 2: Parent/Child relations
Lessons Learned
introduction
Anne Veling, @anneveling
Self-employed contractorSoftware Architect
Agile process management
Performance optimization
Lucene/SOLR/ElasticSearch implementations & training
tech stack
architecture
Flow
PDF Pipeline
Mongo
ElasticSear
ch
Citation Authority
Citation Canonicalization
Use Case 1
Reference Canonicalization
We built a large Citation Authority index in ElasticSearch
With full, deduped metadata for a large portion of English scientific research
In the Reference Edit screenTry to find high quality matches to a large index of canonical references of scientific articles
Based on known fieldsTitle, possibly partial and incorrect
Author(s)
Other identifying fields: journal, year, …
{ "query": { "bool": { "must": [ { "text": { "title": "market elasticity" } }, { "text": { "authors.lastName": "Russell" } }, { "text": { "authors.firstNames": "G" } } ] } }}
problem
Searching on a sub-documentSearching for all documents where
quthors.lastName: “Russell”
authors.firstNames: “G”
Also matches documents by“Jack Russell and Frederickson, G”
We need a sub-document JOIN query…Combined with other information on the parent document (title)Oh noes!
We’re using a NoSQL
database, so we can’t…
Can’t we?
luce
ne d
ocu
ments
term
query
term
Lucene block indexing
Save “children” documents always right before their “parent” document
Requires you to writeBlockJoinQuery
ParentsFilter
ChildQuery
ToParentBlockJoinQuery
This means: all children (and parent!) needs to be reindexed upon any change in them…
authors: {properties: {
rawName: {analyzer: “caName”type: “string”
},lastName: {
analyzer: “caName”type: “string”
},firstNames: {
null_value: “__NONAME”analyzer: “caName”type: “string”
}},type: “nested”
},title: {
analyzer: “caText”type: “string”
}
mapping
query{ "bool" : { "must" : [ { "text" : { "title" : { "query" : "market elasticity", "type" : "phrase", "slop" : 2 } } }, { "bool" : { "must" : { "nested" : { "query" : { "bool" : { "should" : [ { "bool" : { "must" : [ { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean" } } }, { "bool" : { "must" : { "bool" : { "should" : [ { "text" : { "firstNames" : { "query" : "G", "type" : "boolean" } } }, { "prefix" : { "firstNames" : "g" } } ] } } } } ] } },
{ "filtered" : { "query" : { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean", "operator" : "AND" } } }, "filter" : { "missing" : { "field" : "firstNames" } } } } ] } }, "path" : "authors" } } } } ] }}
(title:"market elasticity") AND ( authors: ( (lastName:"Russell") AND ( (firstNames:"G") OR (firstNames:"g*") OR (lastName:"Russell" AND NOT(firstNames)) ) ))
“nested”
Just setting the subdocument type to “nested” in mapping
Combine parent-query with “nested” query that specifies the path
Complex subcombination JOIN operations
Automatic hiding of “nested” subdocumentsThis will increase your index size
“nested”
Efficient!ElasticSearch handles document updates
Child-whereclauses handled INSIDE parent query docEnum
Children are sharded with their parents => locality!
Facet counts (on parent) still correct!
LimitationsCombinations of nested subdocuments with other queries
Like “dis_max”, or “text”
No automatic recognition of “authors.lastName” in other queries to a “nested” subquery
Multipage IndexingUse Case 2
architecture
Flow
PDF Pipeline
Mongo
ElasticSear
ch
Citation Authority
doc
page
page
page
S3
problem
How to index both Doc metadata and Pages textDoc in Flow app
Pages only in PDF pipeline and on S3
Docs updated frequently, on the Flow appReindex Page would require download of text content from S3…
Nested Docs?No; too slow for updates here…
solution
Parent/Child documents in ElasticSearch!
Store parent type on children type mappingTo index a child, specify the parent ID
Stored as “_parent” field on the child
QueryCombine parent query with “has_child” child-query
itemtext: {properties: {
text: {analyzer: “pqdText”,type: “string”
}},_parent: {
type: “item”}
}
{ "bool" : { "must" : [ { "bool" : { "should" : [ { "query_string" : { "query" : "elasticity", "fields" : [ "item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.reference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "item.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.reference.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.reference.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ], "use_dis_max" : true, "default_operator" : "and" } }, { "has_child" : { "query" : { "text" : { "text" : { "query" : "elasticity", "type" : "boolean", "operator" : "AND" } } }, "type" : "itemtext", "boost" : 0.1 } } ] } }, { "term" : { "userId" : "user:50a3bd090364f635f24c713c" } } ] }}
NOT SO SURE WHO IS PARENT, WHO IS CHILD
IN PARENT-CHILD RELATIONSHIP
conclusions
Parent/Child ‘remote key’ solution in ElasticSearchEasy connection of two types of documents with
Separate update cycles
Complex JOIN queries possibles, combining parent & child fields
Slower than “nested”
Locality principle: Children always sharded with parent
LimitationsHas_child filter returns only parents, cannot return child data
But: has_parent filter
ElasticSearches caches parent-child ID table in heap…
conclusions
Complex join-style queries can be done with ElasticSearch
Easily
Efficiently
Use “nested” typesIf data can be duplicated
Very efficient
Use “parent/child” typesFor real independently updateable documents
SELECT * FROM ARTICLESLEFT JOIN AUTHORS ON AUTHORS.ARTICLEID = ARTICLES.IDWHERE ARTICLES.TITLE MATCHES "market elasticity" AND AUTHORS.LASTNAME MATCHES "Russell" AND AUTHORS.FIRSTNAME MATCHES "G"
conclusions
ElasticSearch rocksHides complex JSON document to Lucene key/value model mapping
Allows you to easily use more of Lucene greatness
So you can focus on actual queries and use cases
NoSql does not mean NoJoinsJust forcing you to model in such a way, joins will be efficient
@anneveling
ElasticSearch “nested” types:the best thing since sliced bread
thank you