real-time search in drupal with elasticsearch @moldcamp
TRANSCRIPT
Real-time search in Drupal. Meet Elasticsearch
By Alexei Gorobetsasgorobets
Elasticsearch
Flexible and powerful open source, distributed real-time search and analytics engine for the cloud
Why use Elasticsearch?
● RESTful API● Open Source● JSON over HTTP● based on Lucene● distributed● highly available● schema free● massively scalable
Setup in 2 steps:
1. Extract the archive2. > bin/elasticsearch
How to use it?
> curl -XGET localhost:9200/?pretty
> curl -XGET localhost:9200/?pretty
{"ok" : true,"status" : 200,"name" : "Infinity","version" : {
"number" : "0.90.1","snapshot_build" : false,"lucene_version" : "4.3"
},"tagline" : "You Know, for Search"
}
> curl -XGET localhost:9200/?pretty
action (verb)
> curl -XGET localhost:9200/?pretty
node + port
> curl -XGET localhost:9200/?pretty
path
> curl -XGET localhost:9200/?pretty
query string
Let's index some data
> PUT /index/type/id
Where?It's very similar to database in SQL
> PUT /index/type/id
What?Table
Content type,Entity type,
any kind of type you decide
> PUT /index/type/id
Which?Node ID,Entity ID,
any kind of serial ID
> PUT /mysite/node/1 -d
{"nid": "1","status": "1","title": "Hello elasticsearch","body": "First elasticsearch document"
}
> PUT /mysite/node/1 -d
{"nid": "1","status": "1","title": "Hello elasticsearch","body": "First elasticsearch document"
}
{"ok":true,"_index":"mysite","_type":"node","_id":"1","_version":1
}
Let's GET some data
> GET /mysite/node/1{ "_index" : "mysite", "_type" : "node", "_id" : "1", "_version" : 1, "exists" : true, "_source" : { "nid":"1", "status":"1", "title":"Hello elasticsearch", "body":"First elasticsearch document" }
> GET /mysite/node/1?fields=title,body
Get specific fields
> GET /mysite/node/1?fields=title,body
Get specific fields
> GET /mysite/node/1/_source
Get source only
Let's UPDATE some data
> PUT /mysite/node/1 -d
{"status":"0"
}
> PUT /mysite/node/1 -d
{"ok":true,"_index":"mysite","_type":"node","_id":"1","_version":2
}
{"status":"0"
}
UPDATE = DELETE + PUT
Let's DELETE some data
> DELETE /mysite/node/1
> DELETE /mysite/node/1
{"ok":true,"found":true,"_index":"mysite","_type":"node","_id":"1","_version":3
}
Distributed, Highly Available
> PUT /new_index -d '{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 2 }}'
Concurrency, Version control
> PUT /myapp/node/1?version=1{ "title": "hi girl"}
> PUT /myapp/node/1?version=1{ "title": "hi girl"}
{ "_index": "myapp", "_type": "node", "_id": "1", "_version": 1, "created": false}
> PUT /myapp/node/1?version=1{ "title": "hey boy"}
# 200
> PUT /myapp/node/1?version=1{ "title": "hey boy"}
# 409
> version conflict, current [2], provided [1]
Let's SEARCH for something
> GET /_search
> GET /_search
{"took" : 32,"timed_out" : false,"_shards" : {
"total" : 20,"successful" : 20,"failed" : 0
},"hits" : { results... }
}
Let's SEARCH in multiple indices and types
> GET /index/_search
> GET /index/type/_search
> GET /index1,index2/_search
> GET /myapp_*/type, entity_*/_search
Let's PAGINATE results
> GET /_search?size=10&from=20
size = results per pagefrom = starting from
Let's search oldschool
> GET /_search?q=title:elasticsearch
> GET /_search?q=nid:60
+title:awesome +status:1 +created:[1369917354 TO *]
?q=title:awesome%20%2Bcreated:[1369917354%20TO%20*]%2Bstatus:1
+title:awesome +status:1 +created:[1369917354 TO *]
The ugly encoding =)
Query DSL style
> GET /_search -d
{"query": {
"match": "awesome"}
}
> GET /_search -d
{"query": {
"match" : { "title" : { "query" : "+awesome -poor", "boost" : 2.0, }}
}}
Mappings and types
Core types* string* number* date* boolean
Complex types* array type* object type* nested type
Others:ip typegeo pointgeo shapeattachments
Define type mapping
> PUT /myapp/node -d
{ "node" : { "properties" : { "message" : {
"type" : "string", "store" : true
} } }}
Indexed fields
Full text
analyzed
== is splitted into terms
Term
not analyzed
== is stored as is
> PUT /myapp/node -d
{ "node" : { "properties" : { "name" : {
"type" : "string", "store" : true,“index”: “not_analyzed”
} } }}
Dynamic mapping
Analysis and indexing
Inverted index
1. “The quick brown fox jumped over the lazy dog”
2. “Quick brown foxes leap over lazy dogs in summer”
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
Analyzer
Tokenizers
● standard● keyword● whitespace● ngram
TokenFilters
standardlowercasestoptruncatesnowball
> GET /_analyze?analyzer=standard -d 'this is a test baby'
{ "tokens" : [ { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "baby", "start_offset" : 15, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 5 } ]}
Autocomplete fields
Queries & Filters
Queries & Filters
full text search
relevance score
heavy
not cacheable
exact match
show or hide
lightning fast
cacheable
Combine Filters & Queries
> GET /_search -d
{"query": {
"filtered": {"query": {
"match": { "title": "awesome" }},"filter": {
"term": { "type": "article" }}
} }
}
and Sorting
> GET /_search -d
{"query": {
"filtered": {"query": {
"match": { "title": "awesome" }},"filter": {
"term": { "type": "article" }}
} }"sort": {"date":"desc"}
}
Relevance. Explain API
Term frequencyHow often does the term appear in the field? The more often, the more relevant.
Inverse document frequency
How often does each term appear in the index? The more often, the less relevant. T
Field norm
How long is the field? The longer it is, the less likely it is that words in the field will be relevant.
and Facets
Facets on Amazon
> GET /_search -d
{"facets": {
"home_team": {"terms": {
"field": "field_home_team"}
}}
}
> GET /_search -d
{"facets": {
"home_team": {"terms": {
"field": "field_home_team"}
}}
}
Give your facet a name
> GET /_search -d
{"facets": {
"home_team": {"terms": {
"field": "field_home_team"}
}}
}
Your facet filter can be:
● Terms● Range● Histogram● Date Histogram● Filter● Query● Statistical● Terms Stats● Geo Distance
"facets" : { "home_team" : { "_type" : "terms", "missing" : 203, "total" : 100, "other" : 42, "terms" : [ { "term" : "hou", "count" : 8 }, { "term" : "sln", "count" : 6 }, ...
STOP! I want this in Drupal?
Available modules:
Elasticsearch Elasticsearch ConnectorSearch API elasticsearch
Development directions:
1. Search API implementation2. Field Storage API3. Alternative backends
Available modules:
Elasticsearch Elasticsearch ConnectorSearch API elasticsearch
Field Storage API implementation
Elasticsearch field storage sandbox by Damien TournoudStarted in July 2011
Field Storage API implementation
Elasticsearch field storage sandbox by Damien TournoudStarted in July 2011
Elasticsearch EntityFieldQuery sandbox https://drupal.org/sandbox/asgorobets/2073151
Let's DEMO
Let the Search be with you