elasticsearch workshop

350
Workshop ELASTICSEARCH Felipe Dornelas

Upload: felipe-dornelas

Post on 12-Feb-2017

361 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Elasticsearch Workshop

Wo r k s h o p

ELASTICSEARCHFelipe Dornelas

Page 2: Elasticsearch Workshop

AGENDA

▫︎Part 1

▫︎ Introduction

▫︎Document Store

▫︎ Search Examples

▫︎Data Resiliency

▫︎ Comparison with Solr

▫︎Part 2

▫︎ Search

▫︎ Analytics2

Page 3: Elasticsearch Workshop

AGENDA

▫︎Part 3

▫︎ Inverted Index

▫︎ Analyzers

▫︎Mapping

▫︎ Proximity Matching

▫︎ Fuzzy Matching

▫︎Part 4

▫︎ Inside a Cluster

▫︎Data Modeling3

Page 4: Elasticsearch Workshop

→ github.com/felipead/elasticsearch-workshop

4

Page 5: Elasticsearch Workshop

PRE-REQUISITES

▫︎Vagrant

▫︎VirtualBox

▫︎Git

5

Page 6: Elasticsearch Workshop

ENVIRONMENT SETUP

▫︎git clone https://github.com/felipead/elasticsearch-workshop.git

▫︎vagrant up

▫︎vagrant ssh

▫︎cd /vagrant

6

Page 7: Elasticsearch Workshop

VERIFY EVERYTHING IS WORKING

▫︎curl http://localhost:9200

7

Page 8: Elasticsearch Workshop

PART 1Core concepts

8

Page 9: Elasticsearch Workshop

1-1 INTRODUCTIONYou know, for search

9

Page 10: Elasticsearch Workshop

WHAT IS ELASTICSEARCH?

A real-time distributed search and analytics engine

10

Page 11: Elasticsearch Workshop

IT CAN BE USED FOR

▫︎Full-text search

▫︎Structured search

▫︎Real-time analytics

▫︎…or any combination of the above

11

Page 12: Elasticsearch Workshop

FEATURES

▫︎Distributed document store:

▫︎RESTful API

▫︎Automatic scale

▫︎Plug & Play ™

12

Page 13: Elasticsearch Workshop

FEATURES

▫︎Handles the human language:

▫︎Score results by relevance

▫︎Synonyms

▫︎Typos and misspellings

▫︎Internationalization

13

Page 14: Elasticsearch Workshop

FEATURES

▫︎Powerful analytics:

▫︎Comprehensive aggregations

▫︎Geolocations

▫︎Can be combined with search

▫︎Real-time (no batch-processing)

14

Page 15: Elasticsearch Workshop

FEATURES

▫︎Free and open source

▫︎Community support

▫︎Backed by Elastic

15

Page 16: Elasticsearch Workshop

MOTIVATION

Most databases are inept at extracting knowledge from your data

16

Page 17: Elasticsearch Workshop

SQL DATABASES

SQL = Structured Query Language

17

Page 18: Elasticsearch Workshop

SQL DATABASES

▫︎Can only filter by exact values

▫︎Unable to perform full-text search

▫︎Queries can be complex and inefficient

▫︎Often requires big-batch processing

18

Page 19: Elasticsearch Workshop

APACHE LUCENE

▫︎Arguably, the best search engine

▫︎High performance

▫︎Near real-time indexing

▫︎Open source

19

Page 20: Elasticsearch Workshop

APACHE LUCENE

▫︎But…

▫︎It’s just a Java Library

▫︎Hard to use

20

Page 21: Elasticsearch Workshop

ELASTICSEARCH

▫︎Document Store

▫︎Distributed

▫︎Scalable

▫︎Real Time

▫︎Analytics

▫︎RESTful API

▫︎Easy to Use

21

Page 22: Elasticsearch Workshop

DOCUMENT ORIENTED

▫︎Documents instead of rows / columns

▫︎Every field is indexed and searchable

▫︎Serialized to JSON

▫︎Schemaless

22

Page 23: Elasticsearch Workshop

WHO USES

▫︎GitHub

▫︎Wikipedia

▫︎Stack Overflow

▫︎The Guardian

23

Page 24: Elasticsearch Workshop

TALKING TO ELASTICSEARCH

▫︎Java API

▫︎Port 9300

▫︎Native transport protocol

▫︎Node client (joins the cluster)

▫︎Transport client (doesn't join the cluster)

24

Page 25: Elasticsearch Workshop

TALKING TO ELASTICSEARCH

▫︎RESTful API

▫︎Port 9200

▫︎JSON over HTTP

25

Page 26: Elasticsearch Workshop

TALKING TO ELASTICSEARCH

We will only cover the RESTful API

26

Page 27: Elasticsearch Workshop

USING CURL

curl -X <VERB> <URL> -d <BODY>

or

curl -X <VERB> <URL> -d @<FILE>

27

Page 28: Elasticsearch Workshop

THE EMPTY QUERY

curl -X GET-d @part-1/empty-query.json localhost:9200/_count?pretty

28

Page 29: Elasticsearch Workshop

REQUEST

{ "query": { "match_all": {} } }

29

Page 30: Elasticsearch Workshop

RESPONSE

{ "count": 0, "_shards": { "total": 0, "successful": 0, "failed": 0 } }

30

Page 31: Elasticsearch Workshop

1-2 DOCUMENT STORE

31

Page 32: Elasticsearch Workshop

THE PROBLEM WITH RELATIONAL DATABASES

▫︎Stores data in columns and rows

▫︎Equivalent of using a spreadsheet

▫︎Inflexible storage medium

▫︎Not suitable for rich objects

32

Page 33: Elasticsearch Workshop

DOCUMENTS

{ "name": "John Smith", "age": 42, "confirmed": true, "join_date": "2015-06-01", "home": {"lat": 51.5, "lon": 0.1}, "accounts": [ {"type": "facebook", "id": "johnsmith"}, {"type": "twitter", "id": "johnsmith"} ] }

33

Page 34: Elasticsearch Workshop

DOCUMENT METADATA

▫︎Index - Where the document lives

▫︎Type - Class of object that the document represents

▫︎Id - Unique identifier for the document

34

Page 35: Elasticsearch Workshop

DOCUMENT METADATA

35

Relational DB Databases Tables Rows Columns

Elasticsearch Indices Types Documents Fields

Page 36: Elasticsearch Workshop

RESTFUL API

[VERB] /{index}/{type}/{id}?pretty

GET | POST | PUT | DELETE | HEAD

36

Page 37: Elasticsearch Workshop

RESTFUL API

▫︎JSON-only

▫︎Adding pretty to the query-string parameters pretty-prints the response

37

Page 38: Elasticsearch Workshop

INDEXING A DOCUMENT WITH YOUR OWN ID

PUT /{index}/{type}/{id}

38

Page 39: Elasticsearch Workshop

INDEXING A DOCUMENT WITH YOUR OWN ID

curl -X PUT-d @part-1/first-blog-post.jsonlocalhost:9200/blog/post/123?pretty

39

Page 40: Elasticsearch Workshop

REQUEST

{ "title": "My first blog post", "text": "Just trying this out...", "date": "2014-01-01" }

40

Page 41: Elasticsearch Workshop

RESPONSE

{ "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true }

41

Page 42: Elasticsearch Workshop

INDEXING A DOCUMENT WITH AUTOGENERATED ID

POST /{index}/{type}

* Autogenerated IDs are Base64-encoded UUIDs

42

Page 43: Elasticsearch Workshop

INDEXING A DOCUMENT WITH AUTOGENERATED ID

curl -X POST-d @part-1/second-blog-post.jsonlocalhost:9200/blog/post?pretty

43

Page 44: Elasticsearch Workshop

REQUEST

{ "title": "Second blog post", "text": "Still trying this out...", "date": "2014-01-01" }

44

Page 45: Elasticsearch Workshop

RESPONSE

{ "_index" : "blog", "_type" : "post", "_id" : "AVFWIbMf7YZ6Se7RwMws", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true }

45

Page 46: Elasticsearch Workshop

RETRIEVING A DOCUMENT WITH METADATA

GET /{index}/{type}/{id}

46

Page 47: Elasticsearch Workshop

RETRIEVING A DOCUMENT WITH METADATA

curl -X GETlocalhost:9200/blog/post/123?pretty

47

Page 48: Elasticsearch Workshop

RESPONSE

{ "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014-01-01" } }

48

Page 49: Elasticsearch Workshop

RETRIEVING A DOCUMENT WITHOUT METADATA

GET /{index}/{type}/{id}/_source

49

Page 50: Elasticsearch Workshop

RETRIEVING A DOCUMENT WITHOUT METADATA

curl -X GETlocalhost:9200/blog/post/123/_source?pretty

50

Page 51: Elasticsearch Workshop

RESPONSE

{ "title": "My first blog entry", "text": "Just trying this out...", "date": "2014-01-01" }

51

Page 52: Elasticsearch Workshop

RETRIEVING PART OF A DOCUMENT

GET /{index}/{type}/{id}?_source={fields}

52

Page 53: Elasticsearch Workshop

RETRIEVING PART OF A DOCUMENT

curl -X GET'localhost:9200/blog/post/123?_source=title,date&pretty'

53

Page 54: Elasticsearch Workshop

RESPONSE

{ "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "date": "2014-01-01" } }

54

Page 55: Elasticsearch Workshop

CHECKING WHETHER A DOCUMENT EXISTS

HEAD /{index}/{type}/{id}

55

Page 56: Elasticsearch Workshop

CHECKING WHETHER A DOCUMENT EXISTS

curl -i —X HEADlocalhost:9200/blog/post/123

56

Page 57: Elasticsearch Workshop

RESPONSE

HTTP/1.1 200 OKContent-Length: 0

57

Page 58: Elasticsearch Workshop

CHECKING WHETHER A DOCUMENT EXISTS

curl -i —X HEADlocalhost:9200/blog/post/666

58

Page 59: Elasticsearch Workshop

RESPONSE

HTTP/1.1 404 Not FoundContent-Length: 0

59

Page 60: Elasticsearch Workshop

UPDATING A WHOLE DOCUMENT

PUT /{index}/{type}/{id}

60

Page 61: Elasticsearch Workshop

UPDATING A WHOLE DOCUMENT

curl -X PUT-d @part-1/updated-blog-post.jsonlocalhost:9200/blog/post/123?pretty

61

Page 62: Elasticsearch Workshop

REQUEST

{ "title": "My first blog post", "text": "I am starting to get the hang of this...", "date": "2014-01-02" }

62

Page 63: Elasticsearch Workshop

RESPONSE

{ "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 2, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : false }

63

Page 64: Elasticsearch Workshop

DELETING A DOCUMENT

DELETE /{index}/{type}/{id}

64

Page 65: Elasticsearch Workshop

DELETING A DOCUMENT

curl -X DELETElocalhost:9200/blog/post/123?pretty

65

Page 66: Elasticsearch Workshop

RESPONSE

{ "found" : true, "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 3, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 } }

66

Page 67: Elasticsearch Workshop

DEALING WITH CONFLICTS

67

Page 68: Elasticsearch Workshop

PESSIMISTIC CONCURRENCY CONTROL

▫︎Used by relational databases

▫︎Assumes conflicts are likely to happen (pessimist)

▫︎Blocks access to resources

68

Page 69: Elasticsearch Workshop

OPTIMISTIC CONCURRENCY CONTROL

▫︎Assumes conflicts are unlikely to happen (optimist)

▫︎Does not block operations

▫︎If conflict happens, update fails

69

Page 70: Elasticsearch Workshop

HOW ELASTICSEARCH DEALS WITH CONFLICTS

▫︎Locking distributed resources would be very inefficient

▫︎Uses Optimistic Concurrency Control

▫︎Auto-increments _version number

70

Page 71: Elasticsearch Workshop

HOW ELASTICSEARCH DEALS WITH CONFLICTS

▫︎PUT /blog/post/123?version=1

▫︎If version is outdated returns 409 Conflict

71

Page 72: Elasticsearch Workshop

1-3 SEARCH EXAMPLES

72

Page 73: Elasticsearch Workshop

EMPLOYEE DIRECTORY EXAMPLE

▫︎Index: megacorp

▫︎Type: employee

▫︎Ex: John Smith, Jane Smith, Douglas Fir

73

Page 74: Elasticsearch Workshop

EMPLOYEE DIRECTORY EXAMPLE

curl -X PUT -d @part-1/john-smith.json localhost:9200/megacorp/employee/1

74

Page 75: Elasticsearch Workshop

REQUEST

{ "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": ["sports", "music"] }

75

Page 76: Elasticsearch Workshop

EMPLOYEE DIRECTORY EXAMPLE

curl -X PUT -d @part-1/jane-smith.jsonlocalhost:9200/megacorp/employee/2

76

Page 77: Elasticsearch Workshop

REQUEST

{ "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": ["music"] }

77

Page 78: Elasticsearch Workshop

EMPLOYEE DIRECTORY EXAMPLE

curl -X PUT -d @part-1/douglas-fir.jsonlocalhost:9200/megacorp/employee/3

78

Page 79: Elasticsearch Workshop

REQUEST

{ "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": ["forestry"] }

79

Page 80: Elasticsearch Workshop

SEARCHES ALL EMPLOYEES

GET /megacorp/employee/_search

80

Page 81: Elasticsearch Workshop

SEARCHES ALL EMPLOYEES

curl -X GETlocalhost:9200/megacorp/employee/_search?pretty

81

Page 82: Elasticsearch Workshop

SEARCH WITH QUERY-STRING

GET /megacorp/employee/_search?q=last_name:Smith

82

Page 83: Elasticsearch Workshop

SEARCH WITH QUERY-STRING

curl -X GET'localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty'

83

Page 84: Elasticsearch Workshop

RESPONSE

"hits" : { "total" : 2, "max_score" : 0.30685282, "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", … } }, { … "_score" : 0.30685282, "_source": { "first_name": "John", "last_name": "Smith", … } } ] }

84

Page 85: Elasticsearch Workshop

SEARCH WITH QUERY DSL

curl -X GET-d @part-1/last-name-query.jsonlocalhost:9200/megacorp/employee/_search?pretty

85

Page 86: Elasticsearch Workshop

REQUEST

{ "query": { "match": { "last_name": "Smith" } } }

86

Page 87: Elasticsearch Workshop

RESPONSE

"hits" : { "total" : 2, "max_score" : 0.30685282, "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", … } }, { … "_score" : 0.30685282, "_source": { "first_name": "John", "last_name": "Smith", … } } ] }

87

Page 88: Elasticsearch Workshop

SEARCH WITH QUERY DSL AND FILTER

curl -X GET-d @part-1/last-name-age-query.jsonlocalhost:9200/megacorp/employee/_search?pretty

88

Page 89: Elasticsearch Workshop

REQUEST

"query": { "filtered": { "filter": { "range": { "age": { "gt": 30 } } }, "query": { "match": { "last_name": "Smith" } } } }

89

Page 90: Elasticsearch Workshop

RESPONSE

"hits" : { "total" : 1, "max_score" : 0.30685282, "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, … } } ]

90

Page 91: Elasticsearch Workshop

FULL-TEXT SEARCH

curl -X GET-d @part-1/full-text—search.jsonlocalhost:9200/megacorp/employee/_search?pretty

91

Page 92: Elasticsearch Workshop

REQUEST

{ "query": { "match": { "about": "rock climbing" } } }

92

Page 93: Elasticsearch Workshop

RESPONSE

"hits" : [{ …

"_score" : 0.16273327,

"_source": {

"first_name": "John", "last_name": "Smith",

"about": "I love to go rock climbing", … }

}, { …

"_score" : 0.016878016,

"_source": {

"first_name": "Jane", "last_name": "Smith",

"about": "I like to collect rock albums", … }

}]93

Page 94: Elasticsearch Workshop

RELEVANCE SCORES

▫︎The _score field ranks searches results

▫︎The higher the score, the better

94

Page 95: Elasticsearch Workshop

PHRASE SEARCH

curl -X GET-d @part-1/phrase-search.jsonlocalhost:9200/megacorp/employee/_search?pretty

95

Page 96: Elasticsearch Workshop

REQUEST

{ "query": { "match_phrase": { "about": "rock climbing" } } }

96

Page 97: Elasticsearch Workshop

RESPONSE

"hits" : { "total" : 1, "max_score" : 0.23013961, "hits" : [ { … "_score" : 0.23013961, "_source": { "first_name": "John", "last_name": "Smith", "about": "I love to go rock climbing" … } } ] }

97

Page 98: Elasticsearch Workshop

1-4 DATA RESILIENCY

98

Page 99: Elasticsearch Workshop

CALL ME MAYBE

▫︎Jepsen Tests

▫︎Simulates network partition scenarios

▫︎Run several operations against a distributed system

▫︎Verify that the history of those operations makes sense

99

Page 100: Elasticsearch Workshop

NETWORK PARTITION

100

Page 101: Elasticsearch Workshop

ELASTICSEARCH STATUS

▫︎Risk of data loss on network partition and split-brain scenarios

101

Page 102: Elasticsearch Workshop

IT IS NOT SO BAD…

▫︎Still much more resilient than MongoDB

▫︎Elastic is working hard to improve it

▫︎Two-phase commits are planned

102

Page 103: Elasticsearch Workshop

IF YOU REALLY CARE ABOUT YOUR DATA

▫︎Use a more reliable primary data store:

▫︎Cassandra

▫︎Postgres

▫︎Synchronize it to Elasticsearch

▫︎…or set-up comprehensive back-up

103

Page 104: Elasticsearch Workshop

There’s no such thing as a 100%

reliable distributed system

104

Page 105: Elasticsearch Workshop

1-5 SOLR COMPARISON

105

Page 106: Elasticsearch Workshop

SOLR

▫︎SolrCloud

▫︎Both:

▫︎Are open-source and mature

▫︎Are based on Apache Lucene

▫︎Have more or less similar features

106

Page 107: Elasticsearch Workshop

SOLR API

▫︎HTTP GET

▫︎Query parameters passed in as URL parameters

▫︎Is not RESTful

▫︎Multiple formats ( JSON, XML…)

107

Page 108: Elasticsearch Workshop

SOLR API

▫︎Version 4.4 added Schemaless API

▫︎Older versions require up-front Schema

108

Page 109: Elasticsearch Workshop

ELASTICSEARCH API

▫︎RESTful

▫︎Schemaless

▫︎CRUD document operations

▫︎Manage indices, read metrics, etc…

109

Page 110: Elasticsearch Workshop

ELASTICSEARCH API

▫︎Query DSL

▫︎Better readability

▫︎JSON-only

110

Page 111: Elasticsearch Workshop

SEARCH

▫︎Both are very good with text search

▫︎Both based on Apache Lucene

111

Page 112: Elasticsearch Workshop

EASYNESS OF USE

▫︎Elasticsearch is simpler:

▫︎Just a single process

▫︎Easier API

▫︎SolrCloud requires Apache ZooKeeper

112

Page 113: Elasticsearch Workshop

SOLRCLOUD DATA RESILIENCY

▫︎SolrCloud uses Apache ZooKeeper to discover nodes

▫︎Better at preventing split-brain conditions

▫︎Jepsen Tests pass

113

Page 114: Elasticsearch Workshop

ANALYTICS

▫︎Elasticsearch is the choice for analytics:

▫︎Comprehensive aggregations

▫︎Thousands of metrics

▫︎SolrCloud is not even close

114

Page 115: Elasticsearch Workshop

PART 2Search and Analytics

115

Page 116: Elasticsearch Workshop

2-1 SEARCHFinding the needle in the haystack

116

Page 117: Elasticsearch Workshop

TWEETS EXAMPLE

▫︎/<country_code>/user

▫︎/<country_code>/tweet

117

Page 118: Elasticsearch Workshop

TWEETS EXAMPLE

/us/user/1

{ "email": "[email protected]", "name": "John Smith", "username": "@john" }

118

Page 119: Elasticsearch Workshop

TWEETS EXAMPLE

/gb/user/2

{ "email": "[email protected]", "name": "Mary Jones", "username": "@mary" }

119

Page 120: Elasticsearch Workshop

TWEET EXAMPLE

/gb/tweet/3

{ "date": "2014-09-13", "name": "Mary Jones", "tweet": "Elasticsearch means full text search has never been so easy", "user_id": 2 }

120

Page 121: Elasticsearch Workshop

TWEETS EXAMPLE

./part-2/load-tweet-data.sh

121

Page 122: Elasticsearch Workshop

GET /_search

▫︎Returns all documents on all indices

THE EMPTY SEARCH

122

Page 123: Elasticsearch Workshop

THE EMPTY SEARCH

curl -X GET localhost:9200/_search?pretty

123

Page 124: Elasticsearch Workshop

THE EMPTY SEARCH

"hits" : { "total" : 14, "hits" : [ { "_index": "us", "_type": "tweet", "_id": "7", "_score": 1, "_source": { "date": "2014-09-17", "name": "John Smith", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2 } }, … 9 RESULTS REMOVED … ] }

124

Page 125: Elasticsearch Workshop

MULTI-INDEX, MULTITYPE SEARCH

▫︎/_search

▫︎/gb/_search

▫︎/gb,us/_search

▫︎/gb/user/_search

▫︎/_all/user,tweet/_search

125

Page 126: Elasticsearch Workshop

PAGINATION

▫︎Returns 10 results per request (default)

▫︎Control parameters:

▫︎size: number of results to return

▫︎from: number of results to skip

126

Page 127: Elasticsearch Workshop

PAGINATION

▫︎GET /_search?size=5

▫︎GET /_search?size=5&from=5

▫︎GET /_search?size=5&from=10

127

Page 128: Elasticsearch Workshop

TYPES OF SEARCH

▫︎Structured query on concrete fields (similar to SQL)

▫︎Full-text query (sorts results by relevance)

▫︎Combination of the two

128

Page 129: Elasticsearch Workshop

SEARCH BY EXACT VALUES

▫︎Examples:

▫︎date

▫︎user ID

▫︎username

▫︎“Does this document match the query?”

129

Page 130: Elasticsearch Workshop

SELECT * FROM user WHERE name = "John Smith" AND user_id = 2 AND date > "2014-09-15"

▫︎SQL queries:

SEARCH BY EXACT VALUES

130

Page 131: Elasticsearch Workshop

FULL-TEXT SEARCH

▫︎Examples:

▫︎the text of a tweet

▫︎body of an email

▫︎“How well does this document match the query?”

131

Page 132: Elasticsearch Workshop

FULL-TEXT SEARCH

▫︎UK should also match United Kingdom

▫︎jump should also match jumped, jumps, jumping and leap

132

Page 133: Elasticsearch Workshop

FULL-TEXT SEARCH

▫︎fox news hunting should return stories about hunting on Fox News

▫︎fox hunting news should return news stories about fox hunting

133

Page 134: Elasticsearch Workshop

HOW ELASTICSEARCH PERFORMS TEXT SEARCH

▫︎Analyzes the text

▫︎Tokenizes into terms

▫︎Normalizes the terms

▫︎Builds an inverted index

134

Page 135: Elasticsearch Workshop

LIST OF INDEXED DOCUMENTS

135

ID Text

1 Baseball is played during summer months.

2 Summer is the time for picnics here.

3 Months later we found out why.

4 Why is summer so hot here.

Page 136: Elasticsearch Workshop

INVERTED INDEX

136

Term Frequency Document IDs

baseball 1 1during 1 1found 1 3here 2 2, 4hot 1 4is 3 1, 2, 4months 2 1, 3summer 3 1, 2, 4the 1 2why 2 3, 4

Page 137: Elasticsearch Workshop

GET /_search

{ "query": YOUR_QUERY_HERE }

QUERY DSL

137

Page 138: Elasticsearch Workshop

{ "match": { "tweet": "elasticsearch" } }

QUERY BY FIELD

138

Page 139: Elasticsearch Workshop

QUERY BY FIELD

curl -X GET -d @part-2/elasticsearch-tweets-query.jsonlocalhost:9200/_all/tweet/_search

139

Page 140: Elasticsearch Workshop

{ "bool": "must": { "match": { "tweet": "elasticsearch"} }, "must_not": { "match": { "name": "mary" } }, "should": { "match": { "tweet": "full text" } } }

QUERY WITH MULTIPLE CLAUSES

140

Page 141: Elasticsearch Workshop

QUERY WITH MULTIPLE CLAUSES

curl -X GET -d @part-2/combining-tweet-queries.jsonlocalhost:9200/_all/tweet/_search

141

Page 142: Elasticsearch Workshop

"_score": 0.07082729, "_source": { …

"name": "John Smith",

"tweet": "The Elasticsearch API is really easy to use"

}, …

"_score": 0.049890988, "_source": { …

"name": "John Smith",

"tweet": "Elasticsearch surely is one of the hottest new NoSQL products"

}, …

"_score": 0.03991279, "_source": { …

"name": "John Smith",

"tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her." }

QUERY WITH MULTIPLE CLAUSES

142

Page 143: Elasticsearch Workshop

MOST IMPORTANT QUERIES

▫︎match

▫︎match_all

▫︎multi_match

▫︎bool

143

Page 144: Elasticsearch Workshop

QUERIES VS. FILTERS

▫︎Queries:

▫︎full-text

▫︎“how well does the document match?”

▫︎Filters:

▫︎exact values

▫︎yes-no questions

144

Page 145: Elasticsearch Workshop

QUERIES VS. FILTERS

▫︎The goal of filters is to reduce the number of documents that have to be examined by a query

145

Page 146: Elasticsearch Workshop

PERFORMANCE COMPARISON

▫︎Filters are easy to cache and can be reused efficiently

▫︎Queries are heavier and non-cacheable

146

Page 147: Elasticsearch Workshop

WHEN TO USE WHICH

▫︎Use queries only for full-text search

▫︎Use filters for anything else

147

Page 148: Elasticsearch Workshop

"filtered": { "filter": { "term": { "user_id": 1 } } }

FILTER BY EXACT FIELD VALUES

148

Page 149: Elasticsearch Workshop

FILTER BY EXACT FIELD VALUES

curl -X GET -d @part-2/user-id—filter.jsonlocalhost:9200/_search

149

Page 150: Elasticsearch Workshop

"filtered": { "filter": { "range": { "date": { "gte": "2014-09-20" } } } }

FILTER BY EXACT FIELD VALUES

150

Page 151: Elasticsearch Workshop

FILTER BY EXACT FIELD VALUES

curl -X GET -d @part-2/date—filter.jsonlocalhost:9200/_search

151

Page 152: Elasticsearch Workshop

MOST IMPORTANT FILTERS

▫︎term

▫︎terms

▫︎range

▫︎exists and missing

▫︎bool

152

Page 153: Elasticsearch Workshop

"filtered": { "query": { "match": { "tweet": "elasticsearch" } }, "filter": { "term": { "user_id": 1 } } }

COMBINING QUERIES WITH FILTERS

153

Page 154: Elasticsearch Workshop

COMBINING QUERIES WITH FILTERS

curl -X GET -d @part-2/filtered—tweet-query.jsonlocalhost:9200/_search

154

Page 155: Elasticsearch Workshop

SORTING

▫︎Relevance score

▫︎The higher the score, the better

▫︎By default, results are returned in descending order of relevance

▫︎You can sort by any field

155

Page 156: Elasticsearch Workshop

RELEVANCE SCORE

▫︎Similarity algorithm

▫︎Term Frequency / Inverse Document Frequency (TF/IDF)

156

Page 157: Elasticsearch Workshop

RELEVANCE SCORE

▫︎Term frequency

▫︎How often does the term appear in the field?

▫︎The more often, the more relevant

157

Page 158: Elasticsearch Workshop

RELEVANCE SCORE

▫︎Inverse document frequency

▫︎How often does each term appear in the index?

▫︎The more often, the less relevant

158

Page 159: Elasticsearch Workshop

RELEVANCE SCORE

▫︎Field-length norm

▫︎How long is the field?

▫︎The longer it is, the less likely it is that words in the field will be relevant

159

Page 160: Elasticsearch Workshop

2-2 ANALYTICSHow many needles are in the haystack?

160

Page 161: Elasticsearch Workshop

SEARCH

▫︎Just looks for the needle in the haystack

161

Page 162: Elasticsearch Workshop

BUSINESS QUESTIONS

▫︎How many needles are in the haystack?

▫︎What is the needle average length?

▫︎What is the median length of the needles, by manufacturer?

▫︎How many needles were added to the haystack each month?

162

Page 163: Elasticsearch Workshop

BUSINESS QUESTIONS

▫︎What are your most popular needle manufactures?

▫︎Are there any anomalous clumps of needles?

163

Page 164: Elasticsearch Workshop

AGGREGATIONS

▫︎Answer Analytics questions

▫︎Can be combined with Search

▫︎Near real-time in Elasticsearch

▫︎SQL queries can take days

164

Page 165: Elasticsearch Workshop

AGGREGATIONS

Buckets + Metrics

165

Page 166: Elasticsearch Workshop

BUCKETS

▫︎Collection of documents that meet a certain criteria

▫︎Can be nested inside other buckets

166

Page 167: Elasticsearch Workshop

BUCKETS

▫︎Employee ⇒ male or female bucket

▫︎San Francisco ⇒ California bucket

▫︎2014-10-28 ⇒ October bucket

167

Page 168: Elasticsearch Workshop

METRICS

▫︎Calculations on top of buckets

▫︎Answer the questions

▫︎Ex: min, max, mean, sum…

168

Page 169: Elasticsearch Workshop

EXAMPLE

▫︎Partition by country (bucket)

▫︎…then partition by gender (bucket)

▫︎…then partition by age ranges (bucket)

▫︎…calculate the average salary for each age range (metric)

169

Page 170: Elasticsearch Workshop

CAR TRANSACTIONS EXAMPLE

▫︎/cars/transactions

170

Page 171: Elasticsearch Workshop

CAR TRANSACTIONS EXAMPLE

/cars/transactions/AVFr1xbVmdUYWpF46Ps4

{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }

171

Page 172: Elasticsearch Workshop

CAR TRANSACTIONS EXAMPLE

./part-2/load-car-data.sh

172

Page 173: Elasticsearch Workshop

{ "aggs": { "colors": { "terms": { "fields": "color" } } } }

BEST SELLING CAR COLOR

173

Page 174: Elasticsearch Workshop

BEST SELLING CAR COLOR

curl -X GET -d @part-2/best-selling-car-color.json'localhost:9200/cars/transactions/_search?search_type=count&pretty'

174

Page 175: Elasticsearch Workshop

"colors" : { "buckets" : [{ "key" : "red", "doc_count" : 16 }, { "key" : "blue", "doc_count" : 8 }, { "key" : "green", "doc_count" : 8 }] }

BEST SELLING CAR COLOR

175

Page 176: Elasticsearch Workshop

{ "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } }

AVERAGE CAR COLOR PRICE

176

Page 177: Elasticsearch Workshop

AVERAGE CAR COLOR PRICE

curl -X GET -d @part-2/average-car—color-price.json'localhost:9200/cars/transactions/_search?search_type=count&pretty'

177

Page 178: Elasticsearch Workshop

"colors" : { "buckets": [{ "key": "red", "doc_count": 16, "avg_price": { "value": 32500.0 } }, { "key": "blue", "doc_count": 8, "avg_price": { "value": 20000.0 } }, { "key": "green", "doc_count": 8, "avg_price": { "value": 21000.0 } }] }

AVERAGE CAR COLOR PRICE

178

Page 179: Elasticsearch Workshop

BUILDING BAR CHARTS

▫︎Very easy to convert aggregations to charts and graphs

▫︎Ex: histograms and time-series

179

Page 180: Elasticsearch Workshop

{ "aggs": { "price": { "histogram": { "field": "price", "interval": 20000 }, "aggs": { "revenue": {"sum": {"field" : "price"}} } } } }

CAR SALES REVENUE HISTOGRAM

180

Page 181: Elasticsearch Workshop

CAR SALES REVENUE HISTOGRAM

curl -X GET -d @part-2/car-revenue-histogram.json'localhost:9200/cars/transactions/_search?search_type=count&pretty'

181

Page 182: Elasticsearch Workshop

"price" : { "buckets": [ { "key": 0, "doc_count": 12, "revenue": {"value": 148000.0} }, { "key": 20000, "doc_count": 16, "revenue": {"value": 380000.0} }, { "key": 40000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 60000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 80000, "doc_count": 4, "revenue": {"value" : 320000.0} } ]}

CAR SALES REVENUE HISTOGRAM

182

Page 183: Elasticsearch Workshop

CAR SALES REVENUE HISTOGRAM

183

Page 184: Elasticsearch Workshop

TIME-SERIES DATA

▫︎Data with a timestamp:

▫︎How many cars sold each month this year?

▫︎What was the price of this stock for the last 12 hours?

▫︎What was the average latency of our website every hour in the last week?

184

Page 185: Elasticsearch Workshop

{ "aggs": { "sales": { "date_histogram": { "field": "sold", "interval": "month", "format": "yyyy-MM-dd" } } } }

HOW MANY CARS SOLD PER MONTH?

185

Page 186: Elasticsearch Workshop

HOW MANY CARS SOLD PER MONTH?

curl -X GET -d @part-2/car-sales-per-month.json'localhost:9200/cars/transactions/_search?search_type=count&pretty'

186

Page 187: Elasticsearch Workshop

"sales" : { "buckets" : [ {"key_as_string": "2014-01-01", "doc_count": 4}, {"key_as_string": "2014-02-01", "doc_count": 4}, {"key_as_string": "2014-03-01", "doc_count": 0}, {"key_as_string": "2014-04-01", "doc_count": 0}, {"key_as_string": "2014-05-01", "doc_count": 4}, {"key_as_string": "2014-06-01", "doc_count": 0}, {"key_as_string": "2014-07-01", "doc_count": 4}, {"key_as_string": "2014-08-01", "doc_count": 4}, {"key_as_string": "2014-09-01", "doc_count": 0}, {"key_as_string": "2014-10-01", "doc_count": 4}, {"key_as_string": "2014-11-01", "doc_count": 8} ] }

HOW MANY CARS SOLD PER MONTH?

187

Page 188: Elasticsearch Workshop

HOW MANY CARS SOLD PER MONTH?

188

Page 189: Elasticsearch Workshop

PART 3Dealing with human language

189

Page 190: Elasticsearch Workshop

3-1 INVERTED INDEX

190

Page 191: Elasticsearch Workshop

INVERTED INDEX

▫︎Data structure

▫︎Efficient full-text search

191

Page 192: Elasticsearch Workshop

EXAMPLE

192

The quick brown fox jumped over the lazy dog

Quick brown foxes leap over lazy dogs in summer

Document 1

Document 2

Page 193: Elasticsearch Workshop

TOKENIZATION

193

["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]

["Quick", "brown", "foxes", "leap", "over", "lazy", "dogs", "in", "summer"]

Document 1

Document 2

Page 194: Elasticsearch Workshop

194

Term Document 1 Document 2Quick 𐄂

The 𐄂

brown 𐄂 𐄂

dog 𐄂

dogs 𐄂

fox 𐄂

foxes 𐄂

in 𐄂

jumped 𐄂

lazy 𐄂 𐄂

leap 𐄂

over 𐄂 𐄂

quick 𐄂

summer 𐄂

the 𐄂

Page 195: Elasticsearch Workshop

EXAMPLE

▫︎Searching for “quick brown”

▫︎Naive similarity algorithm:

▫︎Document 1 is a better match

195

Term Document 1 Document 2brown 𐄂 𐄂

quick 𐄂

Total 2 1

Page 196: Elasticsearch Workshop

A FEW PROBLEMS

▫︎Quick and quick are the same word

▫︎fox and foxes are pretty similar

▫︎jumped and leap are synonyms

196

Page 197: Elasticsearch Workshop

NORMALIZATION

▫︎Quick lowercased to quick

▫︎foxes stemmed to fox

▫︎jumped and leap replaced by jump

197

Page 198: Elasticsearch Workshop

BETTER INVERTED INDEX

198

Term Document 1 Document 2brown 𐄂 𐄂

dog 𐄂 𐄂

fox 𐄂 𐄂

in 𐄂

jump 𐄂 𐄂

lazy 𐄂 𐄂

over 𐄂 𐄂

quick 𐄂 𐄂

summer 𐄂

the 𐄂 𐄂

Page 199: Elasticsearch Workshop

SEARCH INPUT

▫︎You can only find terms that exist in the inverted index

▫︎The query string is also normalized

199

Page 200: Elasticsearch Workshop

3-2 ANALYZERS

200

Page 201: Elasticsearch Workshop

ANALYSIS

▫︎Tokenizes a block of text into terms

▫︎Normalizes terms to standard form

▫︎Improves searchability

201

Page 202: Elasticsearch Workshop

ANALYZERS

▫︎Pipeline:

▫︎Character filters

▫︎Tokenizer

▫︎Token filters

202

Page 203: Elasticsearch Workshop

BUILT-IN ANALYZERS

▫︎Standard analyzer

▫︎Language-specific analyzers

▫︎30+ languages supported

203

Page 204: Elasticsearch Workshop

GET /_analyze?analyzer=standard

The quick brown fox jumped over the lazy dog.

TESTING THE STANDARD ANALYZER

204

Page 205: Elasticsearch Workshop

TESTING THE STANDARD ANALYZER

curl -X GET -d

@part-3/quick-brown-fox.txt

'localhost:9200/_analyze?analyzer=standard&pretty'

205

Page 206: Elasticsearch Workshop

"tokens" : [ {"token": "the", …}, {"token": "quick", …}, {"token": "brown", …}, {"token": "fox", …}, {"token": "jumps", …}, {"token": "over", …}, {"token": "the", …}, {"token": "lazy", …}, {"token": "dog", …} ]

TESTING THE STANDARD ANALYZER

206

Page 207: Elasticsearch Workshop

GET /_analyze?analyzer=english

The quick brown fox jumped over the lazy dog.

TESTING THE ENGLISH ANALYZER

207

Page 208: Elasticsearch Workshop

TESTING THE ENGLISH ANALYZER

curl -X GET -d

@part-3/quick-brown-fox.txt

'localhost:9200/_analyze?analyzer=english&pretty'

208

Page 209: Elasticsearch Workshop

"tokens" : [ {"token": "quick", …}, {"token": "brown", …}, {"token": "fox", …}, {"token": "jump", …}, {"token": "over", …}, {"token": "lazi", …}, {"token": "dog", …} ]

TESTING THE ENGLISH ANALYZER

209

Page 210: Elasticsearch Workshop

GET /_analyze?analyzer=brazilian

A rápida raposa marrom pulou sobre o cachorro preguiçoso.

TESTING THE BRAZILIAN ANALYZER

210

Page 211: Elasticsearch Workshop

TESTING THE BRAZILIAN ANALYZER

curl -X GET -d

@part-3/raposa-rapida.txt

'localhost:9200/_analyze?analyzer=brazilian&pretty'

211

Page 212: Elasticsearch Workshop

"tokens" : [ {"token": "rap", …}, {"token": "rapos", …}, {"token": "marrom", …}, {"token": "pul", …}, {"token": "cachorr", …}, {"token": "preguic", …} ]

TESTING THE BRAZILIAN ANALYZER

212

Page 213: Elasticsearch Workshop

STEMMERS

▫︎Algorithmic stemmers:

▫︎Faster

▫︎Less precise

▫︎Dictionary stemmers:

▫︎Slower

▫︎More precise

213

Page 214: Elasticsearch Workshop

3-3 MAPPING

214

Page 215: Elasticsearch Workshop

MAPPING

▫︎Every document has a type

▫︎Every type has its own mapping

▫︎A mapping defines:

▫︎The fields

▫︎The datatype for each field

215

Page 216: Elasticsearch Workshop

MAPPING

▫︎Elasticsearch guesses the mapping when a new field is added

▫︎Should customize the mapping for improved search and performance

▫︎Must customize the mapping when type is created

216

Page 217: Elasticsearch Workshop

MAPPING

▫︎A field's mapping cannot be changed

▫︎You can still add new fields

▫︎Only option is to reindex all documents

▫︎Reindexing with zero-downtime:

▫︎index aliases

217

Page 218: Elasticsearch Workshop

CORE FIELD TYPES

▫︎String

▫︎Integer

▫︎Floating-point

▫︎Boolean

▫︎Date

▫︎Inner Objects

218

Page 219: Elasticsearch Workshop

GET /{index}/_mapping/{type}

VIEWING THE MAPPING

219

Page 220: Elasticsearch Workshop

VIEWING THE MAPPING

curl -X GET

'localhost:9200/gb/_mapping/tweet?pretty'

220

Page 221: Elasticsearch Workshop

"date": { "type": "date", "format": "strict_date_optional_time…" }, "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" }

VIEWING THE MAPPING

221

Page 222: Elasticsearch Workshop

CUSTOMIZING FIELD MAPPINGS

▫︎Distinguish between:

▫︎Full-text string fields

▫︎Exact value string fields

▫︎Use language-specific analyzers

222

Page 223: Elasticsearch Workshop

STRING MAPPING ATTRIBUTES

▫︎index:

▫︎analyzed (full-text search, default)

▫︎not_analyzed (exact value)

▫︎analyzer:

▫︎standard (default)

▫︎english

▫︎…

223

Page 224: Elasticsearch Workshop

PUT /gb,us/_mapping/tweet

{ "properties": { "description": { "type": "string", "index": "analyzed", "analyzer": "english" } } }

ADDING NEW SEARCHABLE FIELD

224

Page 225: Elasticsearch Workshop

ADDING NEW SEARCHABLE FIELD

curl -X PUT -d

@part-3/add-new-mapping.json

'localhost:9200/gb,us/_mapping/tweet?pretty'

225

Page 226: Elasticsearch Workshop

ADDING NEW SEARCHABLE FIELD

curl -X GET

'localhost:9200/us,gb/_mapping/tweet?pretty'

226

Page 227: Elasticsearch Workshop

… "description": { "type": "string", "analyzer": "english" }…

ADDING NEW SEARCHABLE FIELD

227

Page 228: Elasticsearch Workshop

3-4 PROXIMITY MATCHING

228

Page 229: Elasticsearch Workshop

THE PROBLEM

▫︎Sue ate the alligator

▫︎The alligator ate Sue

▫︎Sue never goes anywhere without her alligator-skin purse

229

Page 230: Elasticsearch Workshop

THE PROBLEM

▫︎Search for “sue alligator” would match all three

▫︎Sue and alligator may be separated by paragraphs of other text

230

Page 231: Elasticsearch Workshop

HEURISTIC

▫︎Words that appear near each other are probably related

▫︎Give documents in which the words are close together a higher relevance score

231

Page 232: Elasticsearch Workshop

GET /_analyze?analyzer=standard

Quick brown fox.

TERM POSITIONS

232

Page 233: Elasticsearch Workshop

"tokens": [ { "token": "quick", … "position": 1 }, { "token": "brown", … "position": 2 }, { "token": "fox", … "position": 3 } ]

TERM POSITIONS

233

Page 234: Elasticsearch Workshop

GET /{index}/{type}/_search

{ "query": { "match_phrase": { "title": "quick brown fox" } } }

EXACT PHRASE MATCHING

234

Page 235: Elasticsearch Workshop

EXACT PHRASE MATCHING

▫︎quick, brown and fox must all appear

▫︎The position of brown must be 1 greater than the position of quick

▫︎The position of fox must be 2 greater than the position of quick

235

quick brown fox

Page 236: Elasticsearch Workshop

FLEXIBLE PHRASE MATCHING

▫︎Exact phrase matching is too strict

▫︎“quick fox” should also match

▫︎Slop matching

236

quick brown fox

Page 237: Elasticsearch Workshop

"query": { "match_phrase": { "title": { "query": "quick fox", "slop": 1 } } }

FLEXIBLE PHRASE MATCHING

237

Page 238: Elasticsearch Workshop

SLOP MATCHING

▫︎How many times you are allowed to move a term in order to make the query and document match?

▫︎Slop(n)

238

Page 239: Elasticsearch Workshop

SLOP MATCHING

239

quick brown fox

quick fox

quick fox↳

Document

Query

Slop(1)

Page 240: Elasticsearch Workshop

SLOP MATCHING

240

quick brown fox

fox quick

fox quick ↵

Document

Query

Slop(1)

↳quick foxSlop(2)

↳quick foxSlop(3)

Page 241: Elasticsearch Workshop

3-5 FUZZY MATCHING

241

Page 242: Elasticsearch Workshop

FUZZY MATCHING

▫︎quick brown fox → fast brown foxes

▫︎Johnny Walker → Johnnie Walker

▫︎Shcwarzenneger → Schwarzenegger

242

Page 243: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎One-character edits:

▫︎Substitution

▫︎Insertion

▫︎Deletion

▫︎Transposition of two adjacent characters

243

Page 244: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎One-character substitution:

▫︎ fox → box

244

Page 245: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎Insertion of a new character:

▫︎sic → sick

245

Page 246: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎Deletion of a character:

▫︎black → back

246

Page 247: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎Transposition of two adjacent characters:

▫︎star → tsar

247

Page 248: Elasticsearch Workshop

DAMERAU-LEVENSHTEIN EDIT DISTANCE

▫︎Converting bieber into beaver 1. Substitute: bieber → biever 2. Substitute: biever → baever 3. Transpose: baever → beaver

▫︎Edit distance of 3

248

Page 249: Elasticsearch Workshop

FUZINESS

▫︎80% of human misspellings have an Edit Distance of 1

▫︎Elasticsearch supports a maximum Edit Distance of 2

▫︎fuziness operator

249

Page 250: Elasticsearch Workshop

FUZZINESS EXAMPLE

./part-3/load-surprise-data.sh

250

Page 251: Elasticsearch Workshop

GET /example/surprise/_search

{ "query": { "match": { "text": { "query": "surprize" } } } }

QUERY WITHOUT FUZZINESS

251

Page 252: Elasticsearch Workshop

QUERY WITHOUT FUZZINESS

curl -X GET -d

@part-3/surprize-query.json

'localhost:9200/example/surprise/_search?pretty'

252

Page 253: Elasticsearch Workshop

"hits": { "total": 0, "max_score": null, "hits": [ ] }

QUERY WITHOUT FUZZINESS

253

Page 254: Elasticsearch Workshop

GET /example/surprise/_search

{ "query": { "match": { "text": { "query": "surprize", "fuzziness": "1" } } } }

QUERY WITH FUZZINESS

254

Page 255: Elasticsearch Workshop

QUERY WITH FUZZINESS

curl -X GET -d

@part-3/surprize-fuzzy-query.json

'localhost:9200/example/surprise/_search?pretty'

255

Page 256: Elasticsearch Workshop

"hits": [ { "_index": "example", "_type": "surprise", "_id": "1", "_score": 0.19178301, "_source":{ "text": "Surprise me!"} }]

QUERY WITH FUZZINESS

256

Page 257: Elasticsearch Workshop

AUTO-FUZINESS

▫︎0 for strings of one or two characters

▫︎1 for strings of three, four or five characters

▫︎2 for strings of more than five characters

257

Page 258: Elasticsearch Workshop

PART 4Data modeling

258

Page 259: Elasticsearch Workshop

4-1 INSIDE A CLUSTER

259

Page 260: Elasticsearch Workshop

NODES AND CLUSTERS

▫︎A node is a machine running Elasticsearch

▫︎A cluster is a set of nodes in the same network and with the same cluster name

260

Page 261: Elasticsearch Workshop

SHARDS

▫︎A node stores data inside its shards

▫︎Shards are the smallest unit of scale and replication

▫︎Each shard is a completely independent Lucene index

261

Page 262: Elasticsearch Workshop

AN EMPTY CLUSTER

262

Page 263: Elasticsearch Workshop

GET /_cluster/health

CLUSTER HEALTH

263

Page 264: Elasticsearch Workshop

"cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 0, "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0

CLUSTER HEALTH

264

Page 265: Elasticsearch Workshop

PUT /blogs

"settings": { "number_of_shards": 3, "number_of_replicas": 1 }

ADD AN INDEX

265

Page 266: Elasticsearch Workshop

ADD AN INDEX

266

Page 267: Elasticsearch Workshop

GET /_cluster/health

CLUSTER HEALTH

267

Page 268: Elasticsearch Workshop

"cluster_name": "elasticsearch", "status": "yellow", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3

CLUSTER HEALTH

268

Page 269: Elasticsearch Workshop

ADD A BACKUP NODE

269

Page 270: Elasticsearch Workshop

GET /_cluster/health

CLUSTER HEALTH

270

Page 271: Elasticsearch Workshop

"cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3, "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0

CLUSTER HEALTH

271

Page 272: Elasticsearch Workshop

THREE NODES

272

Page 273: Elasticsearch Workshop

PUT /blogs

"settings": { "number_of_shards": 3, "number_of_replicas": 2 }

INCREASING THE NUMBER OF REPLICAS

273

Page 274: Elasticsearch Workshop

INCREASING THE NUMBER OF REPLICAS

274

Page 275: Elasticsearch Workshop

NODE 1 FAILS

275

Page 276: Elasticsearch Workshop

CREATING, INDEXING AND DELETING A DOCUMENT

276

Page 277: Elasticsearch Workshop

RETRIEVING A DOCUMENT

277

Page 278: Elasticsearch Workshop

4-2 RELATIONSHIPS

278

Page 279: Elasticsearch Workshop

RELATIONSHIPS MATTER

▫︎Blog Posts ↔ Comments

▫︎Bank Accounts ↔ Transactions

▫︎Orders ↔ Items

▫︎Directories ↔ Files

▫︎…

279

Page 280: Elasticsearch Workshop

SQL DATABASES

▫︎Entities have an unique primary key

▫︎Normalization:

▫︎Entity data is stored only once

▫︎Entities are referenced by primary key

▫︎Updates happen in only one place

280

Page 281: Elasticsearch Workshop

▫︎Entities are joined at query time

SQL DATABASES

SELECT Customer.name, Order.statusFROM Order, CustomerWHERE Order.customer_id = Customer.id

281

Page 282: Elasticsearch Workshop

SQL DATABASES

▫︎Changes are ACID

▫︎Atomicity

▫︎Consistency

▫︎Isolation

▫︎Durability

282

Page 283: Elasticsearch Workshop

ATOMICITY

▫︎If one part of the transaction fails, the entire transaction fails

▫︎…even in the event of power failure, crashes or errors

▫︎"all or nothing”

283

Page 284: Elasticsearch Workshop

CONSISTENCY

▫︎Any transaction will bring the database from one valid state to another

▫︎State must be valid according to all defined rules:

▫︎Constraints

▫︎Cascades

▫︎Triggers

284

Page 285: Elasticsearch Workshop

ISOLATION

▫︎The concurrent execution of transactions results in the same state that would be obtained if transactions were executed serially

▫︎Concurrency Control

285

Page 286: Elasticsearch Workshop

DURABILITY

▫︎A transaction will remain committed

▫︎…even in the event of power failure, crashes or errors

▫︎Non-volatile memory

286

Page 287: Elasticsearch Workshop

SQL DATABASES

▫︎Joining entities at query time is expensive

▫︎Impractical with multiple nodes

287

Page 288: Elasticsearch Workshop

ELASTICSEARCH

▫︎Treats the world as flat

▫︎An index is a flat collection of independent documents

▫︎A single document should contain all information to match a search request

288

Page 289: Elasticsearch Workshop

ELASTICSEARCH

▫︎ACID support for changes on single documents

▫︎No ACID transactions on multiple documents

289

Page 290: Elasticsearch Workshop

ELASTICSEARCH

▫︎Indexing and searching are fast and lock-free

▫︎Massive amounts of data can be spread across multiple nodes

290

Page 291: Elasticsearch Workshop

ELASTICSEARCH

▫︎But we need relationships!

291

Page 292: Elasticsearch Workshop

ELASTICSEARCH

▫︎Application-side joins

▫︎Data denormalization

▫︎Nested objects

▫︎Parent/child relationships

292

Page 293: Elasticsearch Workshop

4-3 APPLICATION-SIDE JOINS

293

Page 294: Elasticsearch Workshop

APPLICATION-SIDE JOINS

▫︎Emulates a relational database

▫︎Joins at application level

▫︎(index, type, id) = primary key

294

Page 295: Elasticsearch Workshop

PUT /example/user/1

{ "name": "John Smith", "email": "[email protected]", "born": "1970-10-24" }

EXAMPLE

295

Page 296: Elasticsearch Workshop

PUT /example/blogpost/2

{ "title": "Relationships", "body": "It's complicated", "user": 1 }

EXAMPLE

296

Page 297: Elasticsearch Workshop

EXAMPLE

▫︎(example, user, 1) = primary key

▫︎Store only the id

▫︎Index and type are hard-coded into the application logic

297

Page 298: Elasticsearch Workshop

GET /example/blogpost/_search

"query": { "filtered": { "filter": { "term": { "user": 1 } } } }

EXAMPLE

298

Page 299: Elasticsearch Workshop

EXAMPLE

▫︎Blogposts written by “John”:

▫︎Find ids of users with name “John”

▫︎Find blogposts that match the user ids

299

Page 300: Elasticsearch Workshop

GET /example/user/_search

"query": { "match": { "name": "John" } }

EXAMPLE

300

Page 301: Elasticsearch Workshop

▫︎For each user id from the first query:

GET /example/blogpost/_search

"query": { "filtered": { "filter": { "term": { "user": <ID> } } } }

EXAMPLE

301

Page 302: Elasticsearch Workshop

ADVANTAGES

▫︎Data is normalized

▫︎Change user data in just one place

302

Page 303: Elasticsearch Workshop

DISADVANTAGES

▫︎Run extra queries to join documents

▫︎We could have millions of users named “John”

▫︎Less efficient than SQL joins:

▫︎Several API requests

▫︎Harder to optimize

303

Page 304: Elasticsearch Workshop

WHEN TO USE

▫︎First entity has a small number of documents and they hardly change

▫︎First query results can be cached

304

Page 305: Elasticsearch Workshop

4-4 DATA DENORMALIZATION

305

Page 306: Elasticsearch Workshop

DATA DENORMALIZATION

▫︎No joins

▫︎Store redundant copies of the data you need to query

306

Page 307: Elasticsearch Workshop

PUT /example/user/1

{ "name": "John Smith", "email": "[email protected]", "born": "1970-10-24" }

EXAMPLE

307

Page 308: Elasticsearch Workshop

PUT /example/blogpost/2

{ "title": "Relationships", "body": "It's complicated", "user": { "id": 1, "name": "John Smith" } }

EXAMPLE

308

Page 309: Elasticsearch Workshop

GET /example/blogpost/_search

"query": { "bool": { "must": [ { "match": { "title": "relationships" }}, { "match": { "user.name": "John" }} ]}}

EXAMPLE

309

Page 310: Elasticsearch Workshop

ADVANTAGES

▫︎Speed

▫︎No need for expensive joins

310

Page 311: Elasticsearch Workshop

DISADVANTAGES

▫︎Uses more disk space (cheap)

▫︎Update the same data in several places

▫︎scroll and bulk APIs can help

▫︎Concurrency issues

▫︎Locking can help

311

Page 312: Elasticsearch Workshop

WHEN TO USE

▫︎Need for fast search

▫︎Denormalized data does not change very often

312

Page 313: Elasticsearch Workshop

4-5 NESTED OBJECTS

313

Page 314: Elasticsearch Workshop

MOTIVATION

▫︎Elasticsearch supports ACID when updating single documents

▫︎Querying related data in the same document is faster (no joins)

▫︎We want to avoid denormalization

314

Page 315: Elasticsearch Workshop

PUT /example/blogpost/1

{ "title": "Nest eggs", "body": "Making money...", "tags": [ "cash", "shares" ], "comments": […] }

THE PROBLEM WITH MULTILEVEL OBJECTS

315

Page 316: Elasticsearch Workshop

[{ "name": "John Smith", "comment": "Great article", "age": 28, "stars": 4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this", "age": 31,"stars": 5, "date": "2014-10-22" }]

THE PROBLEM WITH MULTILEVEL OBJECTS

316

Page 317: Elasticsearch Workshop

GET /example/blogpost/_search

"query": { "bool": { "must": [ {"match": {"name": "Alice"}}, {"match": {"age": "28"}} ]}}

THE PROBLEM WITH MULTILEVEL OBJECTS

317

Page 318: Elasticsearch Workshop

[{ "name": "John Smith", "comment": "Great article", "age": 28, "stars": 4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this", "age": 31,"stars": 5, "date": "2014-10-22" }]

THE PROBLEM WITH MULTILEVEL OBJECTS

318

Page 319: Elasticsearch Workshop

THE PROBLEM WITH MULTILEVEL OBJECTS

▫︎Alice is 31, not 28!

▫︎It matched the age of John

▫︎This is because indexed documents are stored as a flattened dictionary

▫︎The correlation between Alice and 31 is irretrievably lost

319

Page 320: Elasticsearch Workshop

{"title": [eggs, nest], "body": [making, money], "tags": [cash, shares], "comments.name": [alice, john, smith, white], "comments.comment": [article, great, like, more, this], "comments.age": [28, 31], "comments.stars": [4, 5], "comments.date": [2014-09-01, 2014-10-22]}

THE PROBLEM WITH MULTILEVEL OBJECTS

320

Page 321: Elasticsearch Workshop

NESTED OBJECTS

▫︎Nested objects are indexed as hidden separate documents

▫︎Relationships are preserved

▫︎Joining nested documents is very fast

321

Page 322: Elasticsearch Workshop

{"comments.name": [john, smith], "comments.comment": [article, great], "comments.age": [28], "comments.stars": [4], "comments.date": [2014-09-01]}

{"comments.name": [alice, white], "comments.comment": [like, more, this], "comments.age": [31], "comments.stars": [5], "comments.date": [2014-10-22]}

NESTED OBJECTS

322

Page 323: Elasticsearch Workshop

{ "title": [eggs, nest], "body": [making, money], "tags": [cash, shares] }

NESTED OBJECTS

323

Page 324: Elasticsearch Workshop

NESTED OBJECTS

▫︎Need to be enabled by updating the mapping of the index

324

Page 325: Elasticsearch Workshop

PUT /example

"mappings": { "blogpost": { "properties": { "comments": { "type": "nested", "properties": { "name": {"type": "string"}, "comment": {"type": "string"}, "age": {"type": "short"}, "stars": {"type":"short"}, "date": {"type": "date"} }}}}}

MAPPING A NESTED OBJECT

325

Page 326: Elasticsearch Workshop

GET /example/blogpost/_search

"query": { "bool": { "must": [ {"match": {"title": "eggs"}} {"nested": <NESTED QUERY>} ] } }

QUERYING A NESTED OBJECT

326

Page 327: Elasticsearch Workshop

"nested": { "path": "comments", "query": { "bool": { "must": [ {"match": {"comments.name": "john"}}, {"match": {"comments.age": 28}} ]}}}

NESTED QUERY

327

Page 328: Elasticsearch Workshop

THERE’S MORE

▫︎Nested filters

▫︎Nested aggregations

▫︎Sorting by nested fields

328

Page 329: Elasticsearch Workshop

ADVANTAGES

▫︎Very fast query-time joins

▫︎ACID support (single documents)

▫︎Convenient search using nested queries

329

Page 330: Elasticsearch Workshop

DISADVANTAGES

▫︎To add, change or delete a nested object, the whole document must be reindexed

▫︎Search requests return the whole document

330

Page 331: Elasticsearch Workshop

WHEN TO USE

▫︎When there is one main entity with a limited number of closely related entities

▫︎Ex: blogposts and comments

▫︎Inefficient if there are too many nested objects

331

Page 332: Elasticsearch Workshop

4-6 PARENT-CHILD RELATIONSHIP

332

Page 333: Elasticsearch Workshop

PARENT-CHILD RELATIONSHIP

▫︎One-to-many relationship

▫︎Similar to the nested model

▫︎Nested objects live in the same document

▫︎Parent and children are completely separate documents

333

Page 334: Elasticsearch Workshop

EXAMPLE

▫︎Company with branches and employees

▫︎Branch is the parent

▫︎Employee are children

334

Page 335: Elasticsearch Workshop

PUT /company

"mappings": { "branch": {}, "employee": { "_parent": { "type": "branch" } } }

EXAMPLE

335

Page 336: Elasticsearch Workshop

PUT /company/branch/london

{ "name": "London Westminster", "city": "London", "country": "UK" }

EXAMPLE

336

Page 337: Elasticsearch Workshop

PUT /company/employee/1?parent=london

{ "name": "Alice Smith", "born": "1970-10-24", "hobby": "hiking" }

EXAMPLE

337

Page 338: Elasticsearch Workshop

GET /company/branch/_search

"query": { "has_child": { "type": "employee", "query": { "range": { "born": { "gte": "1980-01-01" } }}}}

FINDING PARENTS BY THEIR CHILDREN

338

Page 339: Elasticsearch Workshop

GET /company/employee/_search

"query": { "has_parent": { "type": "branch", "query": { "match": { "country": "UK" } }}}

FINDING CHILDREN BY THEIR PARENTS

339

Page 340: Elasticsearch Workshop

THERE’S MORE

▫︎min_children and max_children

▫︎Children aggregations

▫︎Grandparents and grandchildren

340

Page 341: Elasticsearch Workshop

ADVANTAGES

▫︎Parent document can be updated without reindexing the children

▫︎Child documents can be updated without affecting the parent

▫︎Child documents can be returned in search results without the parent

341

Page 342: Elasticsearch Workshop

ADVANTAGES

▫︎Parent and children live on the same shard

▫︎Faster than application-side joins

342

Page 343: Elasticsearch Workshop

DISADVANTAGES

▫︎Parent document and all of its children must live on the same shard

▫︎5 to 10 times slower than nested queries

343

Page 344: Elasticsearch Workshop

WHEN TO USE

▫︎One-to-many relationships

▫︎When index-time is more important than search-time performance

▫︎Otherwise, use nested objects

344

Page 345: Elasticsearch Workshop

REFERENCES

345

Page 346: Elasticsearch Workshop

MAIN REFERENCE

▫︎Elasticsearch, The Definitive guide

▫︎Gormley & Tong

▫︎O'Reilly

346

Page 347: Elasticsearch Workshop

OTHER REFERENCES

▫︎"Jepsen: simulating network partitions in DBs", http://github.com/aphyr/jepsen

▫︎"Call me maybe: Elasticsearch 1.5.0", http://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

▫︎"Call me maybe: MongoDB stale reads", http://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads

347

Page 348: Elasticsearch Workshop

OTHER REFERENCES

▫︎"Elasticsearch Data Resiliency Status", http://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

▫︎"Solr vs. Elasticsearch — How to Decide?", http://blog.sematext.com/2015/01/30/solr-elasticsearch-comparison/

348

Page 349: Elasticsearch Workshop

OTHER REFERENCES

▫︎"Changing Mapping with Zero Downtime", http://www.elastic.co/blog/changing-mapping-with-zero-downtime

349

Page 350: Elasticsearch Workshop

Felipe Dornelas felipedornelas.com

@felipead

THANK YOU