99 Problems, ButThe Search Ain’t OneAndrei Zmievski • ConFoo •!Mar 9, 2011
who am i?
curl http://localhost:9200/speaker/info/andrei
{“name”: “Andrei Zmievski”, “works”: “Analog Co-op”, “projects”: [“PHP”, “PHP-GTK”, “Smarty”, “Unicode/i18n”], “likes”: [“coding”, “beer”, “brewing”, “photography”], “twitter”: “@a”, “email”: “[email protected]”}
what is elasticsearch?
a search engine for the NoSQL generation
domain-driven
distributed
RESTful
Hitchhiker’s Guide to the Galaxy (no, really)
document model
document-oriented
JSON-based
schema-free
based on Lucene
multi-tenancy
distributed, out of the box
engine
3 easy steps
1. index!"#$%&'()*+%,--./00$1!2$,13-/45660!17803.92:9#0;%&<=
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7==-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N=
requ
est
>
%%%%?1:?/-#"9
%%%%?OB7<9P?/?!178?
%%%%?O-I.9?/?3.92:9#?
%%%%?OB<?/?;?
Nresp
onse
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?;?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
!!!!"#$#%&"!'!()%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?;?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
total number of hits
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
!!!!!!"*+,-./"!'!"0$,1")%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?;?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse the index of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
!!!!!!"*#23."!'!"43.%5.6")%%%%%%?OB<?%/%?;?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse the type of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the id of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the hit score
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
7!!!!",%8."'!"9,-6.+!:8+.;45+")!!!!"#%&5"'!"<<!=6$>&.84)!>?#!#@.!A.%60@!9+,B#!C,.")!!!!"&+5.4"'!D"0$-+,E")!">..6")!"3@$#$E6%3@2"F)!!!!"#G+##.6"'!"%")!!!!"@.+E@#"'!(HIJ%N%J%N%N
resp
onse
the original doc contents
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%"#$$5"!'!K)%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?;?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the execution time
3. profit
that’s up to you
demo
distributed model
provides:
performance
resiliency (high-availability)
shards
a portion of the document space
each one is a separate Lucene index
thus, many per-index settings are available
document is sharded by its _id value
but can be assigned (routed) to a shard deterministically
zero-conf discovery
zen (multicast and unicast)
cloud (EC2 via API)
auto-routing
master node:
maintains cluster state
reassigns shards if nodes leave/join cluster
any node can process the search request
the query is handled via scatter-gather mechanism
replicas
each shard can have 1 or more replicas
# of replicas can be updated dynamically after index creation
replicas can be used for querying in parallel
shard allocationnode 1
start with a single node
shard allocation
PUT /person { “index”: { “number_of_shards”: 2, “number_of_replicas”: 1}}
node 1person1person2
shard allocationnode 1person1person2
node 2person1person2
start the second node
shard allocationnode 1 node 2 node 3 node 4person1person2
person1person2
start 2 more nodes
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
start 2 more nodes
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
hashed to shard 1PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
replicated
PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
PUT /person/info/2{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
hashed to shard 2
PUT /person/info/2{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
replicated
PUT /person/info/2{ … }
scatter-gathernode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
demo
transactional model
per-document consistency
no need to commit/flush
uses write-ahead transaction log
write consistency (W) can be controlled
one, quorum, or all
(near) real-time search
1 second refresh rate by default
_refresh API also
index storage
node data considered transient
can be stored in local file system, JVM heap, native OS memory, or FS & memory combination
persistent storage requires a gateway
gateways
persistent store for cluster state and indices
asynchronous, translog-based write strategy
allows full recovery if a cluster restart is needed
supported gateways:local
shared FS
Hadoop via HDFS
S3
mapping
describes document structure to the search engine
automatically created with sensible defaults
explicit mapping can be provided (generally, a good idea)
can run into merge conflicts
mapping
important meta fields:
_source
_all
there are more
mapping types
simple:
string, integer/long, float/double, boolean, and null)
complex:
array, object
sample mapping
>?"39#?/%%%%%%?<9#B!:?E
%?-B-$9?/%%%%%?W17X-%(27B!?E
%?-2H3?/%%%%%%G?.#18B$B7H?E%?<9F"HHB7H?E%?.,.?JE
%?.13-W2-9?/%%?56;6&;5&55+;M/;Y/;5?E
%?.#B1#B-I?/%%5Ndocu
men
t
>?.13-?/%>
%%?.#1.9#-B93?%/%>
%%%%?"39#?/%%%%%%>?-I.9?/%?3-#B7H?E%?B7<9P?/%?71-O272$IZ9<?NE
%%%%?@9332H9?/%%%>?-I.9?/%?3-#B7H?E%[F113-\/%;UVNE
%%%%?-2H3?/%%%%%%>?-I.9?/%?3-#B7H?E%?B7!$"<9OB7O2$$?/%?71?NE
%%%%?.13-W2-9?%/%>?-I.9?%/%?<2-9?E%[3-1#9\/%[71\NE
%%%%?.#B1#B-I?%/%>?-I.9?%/%?B7-9H9#?N
NNN
map
ping
analyzers
break down (tokenize) and normalize fields during indexing and query strings at search time
analyzer = tokenizer + token filters (0 or more)
*-27<2#<%A72$IZ9#%S
%%%*-27<2#<%+1:97BZ9#%]
%%%%%%%*-27<2#<%+1:97%^B$-9#%]
%%%%%%%_1K9#!239%+1:97%^B$-9#%]
%%%%%%%*-1.%+1:97%^B$-9#
analyzers
analyzers, tokenizers, and filters can be customizedB7<9P/
%%272$I3B3/
%%%%272$IZ9#/
%%%%%%.?&%,E/%%%%%%%%-I.9/%!"3-1@
%%%%%%%%-1:97BZ9#/%3-27<2#<
%%%%%%%%8B$-9#/%G3-27<2#<E%$1K9#!239E%3-1.E
%%%%%%%%%%%%%%%%%23!BB81$<B7HE%.1#-9#*-9@Jelas
ticse
arch
.ym
l
`
?-B-$9?/%>?-I.9?/%?3-#B7H?E%?272$IZ9#?/%?9"$27H?NE
`
map
ping
API
API conventions
append ?pretty=true to get readable JSON
boolean values: false/0/off = false, rest is true
JSONP support via callback parameter
API structure
http://host:port/[index]/[type]/[_action/id]
GET http://es:9200/_status
GET http://es:9200/twitter/_status
POST http://es:9200/twitter/tweet/1
GET http://es:9200/twitter/tweet/1
API structure
http://host:port/[index]/[type]/[_action/id]
GET http://es:9200/twitter/tweet/_search
GET http://es:9200/twitter/user/_search
GET http://es:9200/twitter/tweet,user/_search
GET http://es:9200/twitter,facebook/_search
GET http://es:9200/_search
API query example>
%%%%?R"9#I?/%>
%%%%%%%%?8B$-9#9<?/%>
%%%%%%%%%%%%?R"9#I?/%>
%%%%%%%%%%%%%%%%?R"9#IO3-#B7H?/%>
%%%%%%%%%%%%%%%%%%%%?R"9#I?/%?811%F2#?E
%%%%%%%%%%%%%%%%%%%%?<982"$-O1.9#2-1#?/%?AaW?E
%%%%%%%%%%%%%%%%%%%%?8B9$<3?/%G?-B-$9?E%?<93!#B.-B17?JE
%%%%%%%%%%%%%%%%%%%%?F113-?/%5U6
%%%%%%%%%%%%%%%%N
%%%%%%%%%%%%NE
%%%%%%%%%%%%?8B$-9#?/%>
%%%%%%%%%%%%%%%%?#27H9?/%>?<2-9?/%>?H-?/%?56;;&6T&64?NN
%%%%%%%%%%%%N
%%%%%%%%N
%%%%NE
%%%%?8#1@/%;6E
%%%%?3BZ9?/%;6
N
API {core}
index
bulk
delete
delete by query
get
count
search
query
from/size paging
sort
highlighting
selective fields
API {indices}
create
delete
open/close
get/put/delete mapping
refresh
optimize
snapshot
update settings
analyze
status
flush
Query DSL
term / terms
range
prefix
bool
fuzzy
wildcard
query_string
default_operator
analyzer
phrase_slop
etc
filters
share some similar features with queries (term, range, etc)
why use a filter?
filters
faster than queries
cached (depends on the filter)
the cache is used for different queries against the same filter
no scoring
more useful ones: term, terms, range, prefix, and, or, not, exists, missing, query
facets
provide aggregated data based on the search request
terms, histogram, date histogram, range, statistical, and more
geo search
implemented as filters (and a facet)
geo_distance
geo_bounding_box
geo_polygon
interfaces
REST
Java /!Groovy
Language clients (REST/Thrift):
pyes, PHP (standalone and symfony), Ruby, Perl
Flume sink implementation
data import
ES is not the primary data store (usually)
to import/synchronize data:
write an agent (Gearman, message queues, etc)
use rivers (CouchDB, RabbitMQ, Twitter)
10 more features
versioning
index aliases
parent/child docs
scripting
dynamic mapping templates
load balancing nodes
plugins
more_like_this
multi_field mapping
percolation
References
http://github.com/elasticsearch/elasticsearch
http://groups.google.com/a/elasticsearch.com/group/users/
IRC: #elasticsearch on irc.freenode.net
twitter: @elasticsearch