nested and parent/child docs in elasticsearch

30
Nested & Parent/Child Docs hidden gems in ElasticSearch Anne Veling | ElasticSearch NL Meetup | February 26, 2013

Upload: beyondtrees

Post on 15-Jan-2015

34.547 views

Category:

Documents


3 download

DESCRIPTION

A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to use the "nested" type and parent-child relations in ElasticSearch to do complex where-clause queries in an efficient way

TRANSCRIPT

Page 1: Nested and Parent/Child Docs in ElasticSearch

Nested & Parent/Child Docs

hidden gems in ElasticSearch

Anne Veling | ElasticSearch NL Meetup | February 26, 2013

Page 2: Nested and Parent/Child Docs in ElasticSearch

agenda

Refworks FlowReference Manager for Researchers

Use of ElasticSearch in Flow

Use Case 1: Nested documents

Use Case 2: Parent/Child relations

Lessons Learned

Page 3: Nested and Parent/Child Docs in ElasticSearch

introduction

Anne Veling, @anneveling

Self-employed contractorSoftware Architect

Agile process management

Performance optimization

Lucene/SOLR/ElasticSearch implementations & training

Page 4: Nested and Parent/Child Docs in ElasticSearch
Page 5: Nested and Parent/Child Docs in ElasticSearch
Page 6: Nested and Parent/Child Docs in ElasticSearch

tech stack

Page 7: Nested and Parent/Child Docs in ElasticSearch

architecture

Flow

PDF Pipeline

Mongo

ElasticSear

ch

Citation Authority

Page 8: Nested and Parent/Child Docs in ElasticSearch

Citation Canonicalization

Use Case 1

Page 9: Nested and Parent/Child Docs in ElasticSearch

Reference Canonicalization

We built a large Citation Authority index in ElasticSearch

With full, deduped metadata for a large portion of English scientific research

In the Reference Edit screenTry to find high quality matches to a large index of canonical references of scientific articles

Based on known fieldsTitle, possibly partial and incorrect

Author(s)

Other identifying fields: journal, year, …

Page 10: Nested and Parent/Child Docs in ElasticSearch

{ "query": { "bool": { "must": [ { "text": { "title": "market elasticity" } }, { "text": { "authors.lastName": "Russell" } }, { "text": { "authors.firstNames": "G" } } ] } }}

Page 11: Nested and Parent/Child Docs in ElasticSearch

problem

Searching on a sub-documentSearching for all documents where

quthors.lastName: “Russell”

authors.firstNames: “G”

Also matches documents by“Jack Russell and Frederickson, G”

We need a sub-document JOIN query…Combined with other information on the parent document (title)Oh noes!

We’re using a NoSQL

database, so we can’t…

Can’t we?

Page 12: Nested and Parent/Child Docs in ElasticSearch

luce

ne d

ocu

ments

term

query

term

Lucene block indexing

Save “children” documents always right before their “parent” document

Requires you to writeBlockJoinQuery

ParentsFilter

ChildQuery

ToParentBlockJoinQuery

This means: all children (and parent!) needs to be reindexed upon any change in them…

Page 13: Nested and Parent/Child Docs in ElasticSearch
Page 14: Nested and Parent/Child Docs in ElasticSearch

authors: {properties: {

rawName: {analyzer: “caName”type: “string”

},lastName: {

analyzer: “caName”type: “string”

},firstNames: {

null_value: “__NONAME”analyzer: “caName”type: “string”

}},type: “nested”

},title: {

analyzer: “caText”type: “string”

}

mapping

Page 15: Nested and Parent/Child Docs in ElasticSearch

query{ "bool" : { "must" : [ { "text" : { "title" : { "query" : "market elasticity", "type" : "phrase", "slop" : 2 } } }, { "bool" : { "must" : { "nested" : { "query" : { "bool" : { "should" : [ { "bool" : { "must" : [ { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean" } } }, { "bool" : { "must" : { "bool" : { "should" : [ { "text" : { "firstNames" : { "query" : "G", "type" : "boolean" } } }, { "prefix" : { "firstNames" : "g" } } ] } } } } ] } },

{ "filtered" : { "query" : { "text" : { "lastName" : { "query" : "Russell", "type" : "boolean", "operator" : "AND" } } }, "filter" : { "missing" : { "field" : "firstNames" } } } } ] } }, "path" : "authors" } } } } ] }}

(title:"market elasticity") AND ( authors: ( (lastName:"Russell") AND ( (firstNames:"G") OR (firstNames:"g*") OR (lastName:"Russell" AND NOT(firstNames)) ) ))

Page 16: Nested and Parent/Child Docs in ElasticSearch
Page 17: Nested and Parent/Child Docs in ElasticSearch

“nested”

Just setting the subdocument type to “nested” in mapping

Combine parent-query with “nested” query that specifies the path

Complex subcombination JOIN operations

Automatic hiding of “nested” subdocumentsThis will increase your index size

Page 18: Nested and Parent/Child Docs in ElasticSearch

“nested”

Efficient!ElasticSearch handles document updates

Child-whereclauses handled INSIDE parent query docEnum

Children are sharded with their parents => locality!

Facet counts (on parent) still correct!

LimitationsCombinations of nested subdocuments with other queries

Like “dis_max”, or “text”

No automatic recognition of “authors.lastName” in other queries to a “nested” subquery

Page 19: Nested and Parent/Child Docs in ElasticSearch
Page 20: Nested and Parent/Child Docs in ElasticSearch

Multipage IndexingUse Case 2

Page 21: Nested and Parent/Child Docs in ElasticSearch

architecture

Flow

PDF Pipeline

Mongo

ElasticSear

ch

Citation Authority

doc

page

page

page

S3

Page 22: Nested and Parent/Child Docs in ElasticSearch

problem

How to index both Doc metadata and Pages textDoc in Flow app

Pages only in PDF pipeline and on S3

Docs updated frequently, on the Flow appReindex Page would require download of text content from S3…

Nested Docs?No; too slow for updates here…

Page 23: Nested and Parent/Child Docs in ElasticSearch

solution

Parent/Child documents in ElasticSearch!

Store parent type on children type mappingTo index a child, specify the parent ID

Stored as “_parent” field on the child

QueryCombine parent query with “has_child” child-query

Page 24: Nested and Parent/Child Docs in ElasticSearch

itemtext: {properties: {

text: {analyzer: “pqdText”,type: “string”

}},_parent: {

type: “item”}

}

Page 25: Nested and Parent/Child Docs in ElasticSearch

{ "bool" : { "must" : [ { "bool" : { "should" : [ { "query_string" : { "query" : "elasticity", "fields" : [ "item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.reference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "item.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.reference.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.reference.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ], "use_dis_max" : true, "default_operator" : "and" } }, { "has_child" : { "query" : { "text" : { "text" : { "query" : "elasticity", "type" : "boolean", "operator" : "AND" } } }, "type" : "itemtext", "boost" : 0.1 } } ] } }, { "term" : { "userId" : "user:50a3bd090364f635f24c713c" } } ] }}

Page 26: Nested and Parent/Child Docs in ElasticSearch

NOT SO SURE WHO IS PARENT, WHO IS CHILD

IN PARENT-CHILD RELATIONSHIP

Page 27: Nested and Parent/Child Docs in ElasticSearch

conclusions

Parent/Child ‘remote key’ solution in ElasticSearchEasy connection of two types of documents with

Separate update cycles

Complex JOIN queries possibles, combining parent & child fields

Slower than “nested”

Locality principle: Children always sharded with parent

LimitationsHas_child filter returns only parents, cannot return child data

But: has_parent filter

ElasticSearches caches parent-child ID table in heap…

Page 28: Nested and Parent/Child Docs in ElasticSearch

conclusions

Complex join-style queries can be done with ElasticSearch

Easily

Efficiently

Use “nested” typesIf data can be duplicated

Very efficient

Use “parent/child” typesFor real independently updateable documents

SELECT * FROM ARTICLESLEFT JOIN AUTHORS ON AUTHORS.ARTICLEID = ARTICLES.IDWHERE ARTICLES.TITLE MATCHES "market elasticity" AND AUTHORS.LASTNAME MATCHES "Russell" AND AUTHORS.FIRSTNAME MATCHES "G"

Page 29: Nested and Parent/Child Docs in ElasticSearch

conclusions

ElasticSearch rocksHides complex JSON document to Lucene key/value model mapping

Allows you to easily use more of Lucene greatness

So you can focus on actual queries and use cases

NoSql does not mean NoJoinsJust forcing you to model in such a way, joins will be efficient

Page 30: Nested and Parent/Child Docs in ElasticSearch

[email protected]

@anneveling

ElasticSearch “nested” types:the best thing since sliced bread

thank you