elastic{search} blueprint - pycon.it
TRANSCRIPT
Elastic{Search} BlueprintPyCon7 - Firenze - 2016-04-16Speaker: Christian “Strap” Strappazzon
$ whoami
★ GS1 Italy - IT Specialist
★ Passionate programmer
★ From Codementor: “You’re not the dev every team needs, but you’re the dev every team deserves.”
★ Spend time on reading technical books
★ Python Milano Organizer
★ BBQ Master
★ Dad, family addicted.
Objective
Image from: http://ibaldi.blog.tiscali.it/lavori/
Overview on “ELK-B Stack” with a focus on Elasticsearch and Python/Django integration.
Your homeworks: get some informations from this presentation and then go deeper if you want to use these tools in your current/next projects.
Why am I here?● Google Site Search service was ending
○ we exceeded the yearly query quota allocated
○ service downgrade with ads○ possible service suspension
● In the past we (they) used Solr, but the current hype was on Elasticsearch
● It was a good time to try a new tool● Performance, Elasticsearch is fast
○ indexed ~350 webpage and ~150 pdf in less than 4 minutes, index ~55Mb
○ search comes in milliseconds and provide the limit for you
● Last but not least… Community voted my talk - THANKS!!! - and then I do my best! :-)
Let’s begin!image from http://aragec.org/bip+bip.html
Agenda
➢ The Open Source “ELK-B Stack” and commercial products
○ logstash, beats, kibana, sense, elasticsearch
➢ Python/Django Tools
○ haystack install/configure and some other related projects
➢ Final Thoughts
➢ Q & A
The Open Source “ELK-B Stack” and commercial products
Images from: http://elastic.co
Goodbye ELK-B(ee)
Say “Heya” to Elastic Stackand X-Pack
From Elastic{ON} 16
https://www.elastic.co/blog/heya-elastic-stack-and-x-pack
MarvelMonitoring
WatcherAlerting
ShieldSecurity
Hadoop Connector
SenseConsole
Other plugins...
Graph, Reporting
Collect, parse and enrich data
Collect, parse and ship
Store, search, analyze
Visualize and explore data
Images from: http://elastic.co
LogstashCollect, Enrich and Transport
Logstash is a data pipeline that helps you process logs and other event data from a variety of systems. With 200 plugins and counting, Logstash can connect to a variety of sources and stream data at scale to a central analytics system.
Apache License 2.0
What is a log:
➔ log is a record of activity by system, application, etc ➔ a timestamp and some data
What kind of problem try to solve:
➔ every application and device logs in its own special way
➔ each logs can be analyzed separately➔ search across logs is difficult due to a different
formats➔ logs are spread around your servers➔ many servers and different kind of logs➔ ssh + grep aren’t scalable➔ expert required to read the log
Image from: http://elastic.co
LogstashInstall and Configure
➔ Install◆ require jvm 1.7+◆ download and unzip ◆ prepare a logstash.conf config file◆ run ./bin/logstash agent -f logstash.conf
➔ Configure◆ create one or more config file◆ grok is regex powered, over 80 patterns,
custom patterns◆ http://grokdebug.herokuapp.com
input {# Apache log, mail log, app log ...
}
filter {# Grok, GeoIP, Date ...
}
output {# Elasticsearch, Graphite ...
}
https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html
Image from: http://elastic.co
BeatsCollect, Parse and Ship
Beats is the platform for building lightweight, open source data shippers for many types of operational data you want to enrich with Logstash, search and analyze in Elasticsearch, and visualize in Kibana.
Written in Go, simple to deploy: download and install/unzip, configure the yaml file and run the daemon with sudo.
Apache License, Version 2.0
Type of beats:
➔ libbeat: for building more beats➔ Packetbeat: tap into your wire data➔ Topbeat: gather infrastructure metrics➔ Filebeat: analyze log files in real time➔ Winlogbeat: gather insight from windows
event logs➔ {Future}beats: there's oh-so-much more
to come
Image from: http://elastic.co
KibanaVisualize and Explore Data
➔ Flexible analytics and visualization platform
➔ Real-time summary and charting of streaming data
➔ Intuitive interface for a variety of users➔ Instant sharing and embedding of
dashboards➔ Apache license, Version 2.0➔ Easy to install:
◆ require a modern browser◆ download and unzip◆ set elasticsearch.url to point ES
instance ◆ run binary
Image from: http://elastic.co
Image from: https://michael.bouvy.net
Sense - Visually Interact with Elasticsearch REST APIs
Sense is a visual console that provides auto-complete, auto-indentation, and syntaxchecking all through a Kibana plugin.
Some features:
➔ multiple requests➔ auto formatting➔ keyboard shortcuts➔ history (500 requests)
Apache License, Version 2.0Image from: http://elastic.co
WARNING!A lot of information incoming...
ElasticsearchSearch, store and analyze
Elasticsearch is a search server based on Lucene.
It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.
Apache License, Version 2.0
Features
❖ Distributed, scalable and resilient➢ design for scale-out, high availability
❖ Developer friendly➢ API first, schemaless, native JSON, client
libraries for any languages❖ Real-time search & analytics
➢ real time aggregations, geospatial, full-text search, query structured and unstructured data
Image from: http://elastic.co
ElasticsearchNode and Cluster
Node
➔ A running instance of elasticsearch (JVM process)
Cluster
➔ Multiple nodes working together
Image from: http://elastic.co
Default node➔ master eligible➔ holds data➔ indexing,
aggregations, query…
Dedicated master node➔ master eligible➔ no data
Data node➔ holds data➔ indexing,
aggregations, query...
Client node➔ no data➔ know the state of the
cluster➔ routing
Node Types
ElasticsearchIndex & Shards
Index
➔ An index is a lightweight container for data
Shard
➔ A single piece of an Elasticsearch index➔ Indexes are partitioned into shards so they can be distributed across multiple nodes➔ Each shard is a standalone Lucene index➔ A shard is either a primary or a replica➔ By default shards are copied for high availability➔ Replica shards are always on different nodes from each other and their primary shard➔ Searches may be performed against primary or replica
Image from: http://elastic.co
ElasticsearchInverted Index
In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents (named in contrast to a Forward Index, which maps from documents to content)
from wikipedia https://en.wikipedia.org/wiki/Inverted_index
Image from: http://elastic.co
id | content-----------------------------------------------------1 | The quick brown fox jumped over the lazy dog2 | Quick brown foxes leap over lazy dogs in summer-----------------------------------------------------
Term Doc_1 Doc_2-------------------------Fast | | XThe | X |brown | X | Xdog | X |dogs | | Xfox | X |foxes | | Xin | | Xjumped | X |lazy | X | Xleap | | Xover | X | Xquick | X |summer | | Xthe | X |-------------------------
ElasticsearchLet’s talk about Search! :-)
❖ Different type of search➢ suggestions, synonyms, autocomplete, filters, aggregations
❖ Iterative process➢ relevance tuning, accuracy, classification
❖ No downtime➢ depends on the cluster
Image from: http://elastic.co
ElasticsearchMapping
When you insert a JSON document into ES, automatically ES creates a mapping with data detection.
Mapping is composed by field:
➔ each field requires a type➔ no change of field type once added➔ adding new field➔ changing field type requires re-indexing➔ fields can have a boost
Fields types: analyzed string, float, boolean, double, date, integer, not analyzed string, long, binary
Image from: http://elastic.co
{ "myidx" : { "mappings" : { "meetup" : { "properties" : { "message" : { "type" : "string" }, "post_date" : { "type" : "date", "format" : "dateOptionalTime" } } } } }}
ElasticsearchAnalyzing Text
❖ Tokenizer➢ breaks up text into tokens
❖ Filters➢ applied to tokens in sequence
❖ Analyzers➢ associated with fields in mapping, can be
customized, applied at index and query time
Image from: http://elastic.co
..."analyzer": {
"italian": { "tokenizer": "standard", "filter": [ "italian_elision", "lowercase", "italian_stop", "italian_keywords", "italian_stemmer" ]
}}...
ElasticsearchIndex Alias
➔ Alias is a view of one or more indexes➔ Can be filtered➔ Decouple application from indexes➔ Lightweight➔ Used on re-index with no downtime, atomic operation
Image from: http://elastic.co
POST /_aliases{ "actions": [ { "remove": { "index": "pyconit_v1", "alias": "pyconit" }}, { "add": { "index": "pyconit_v2", "alias": "pyconit" }} ]}
ElasticsearchQuerying the data
Elasticsearch provides a full Query DSL based on JSON to define queries.
Some options are:
➔ boost on fields at query time➔ full-text and Term query➔ score on result➔ filter result➔ aggregate
The documentation rocks! You’ll find everything you need. Trust me! :-)
Image from: http://elastic.co
{ "multi_match" : {
"query" : "this is a test","fields" : [ "subject^3", "message" ]
}}
{"regexp":{
"name.first":{ "value":"s.*y", "boost":1.2 }
}}
ElasticsearchInstall and Run
➔ require JVM 1.7+➔ download and unzip➔ run ./bin/elasticsearch➔ edit config file, some options could be given from command line
Pretty simple! :-)
Image from: http://elastic.co
https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html
OK, but...Let’s talk about Django… :-)
HaystackModular search for django
Haystack provides modular search for Django with an abstraction layer for different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.)
➔ It's a django app➔ Elasticsearch backend depends on elasticsearch-py➔ Provide signals, multiple routing, search query API similar to django ORM➔ Lack of documentation, but enough to start➔ You get your hands dirty if you want more➔ Currently only supports ElasticSearch 1.x. ElasticSearch 2.x is not supported yet, if you would like
to help, please see #1247.➔ BSD License
Image from: http://haystacksearch.org/
HaystackInstall and Settings
(env) $ pip install django-haystack# add 'haystack' to INSTALLED_APPS# add in settings.py HAYSTACK_CONNECTIONS which backend to use, e.g.:...HAYSTACK_CONNECTIONS = {
'default': {'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine','URL': 'http://127.0.0.1:9200/','INDEX_NAME': 'haystack',
},}
Image from: http://haystacksearch.org/
HaystackHandling Data
(env) $ ./manage.py startapp search(env) $ cd search(env) $ touch search_indexes.py
# Edit with your editor of choice# ... Vim or Emacs? Fight! # @raymondh
import datetimefrom haystack import indexesfrom myapp.models import Note
class NoteIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date')
def get_model(self): return Note
def index_queryset(self, using=None): """Used when the entire index for model is updated.""" now = datetime.datetime.now() return self.get_model().objects.filter(pub_date__lte=now)
(env) $ ./manage.py rebuild_index
Image from: http://haystacksearch.org/
HaystackSetup Search View and URL
# inside urls.py
(r'^search/', include('haystack.urls')),
# Override the search/search.html default template
{# search.html #}...
<form method="get" action="."> {{ form.as_p }} <p><input type="submit" value="Search"></p> ...
{% for result in page.object_list %} <p> <a href="{{ result.object.get_absolute_url }}"> {{ result.object.title }}</a> </p>{% empty %} <p>No results found.</p>{% endfor %}
Pay attention! Don't use result.object.something, use instead the fields on your index.e.g. result.title, because result.object.title hits the database!
Image from: http://haystacksearch.org/
That’s it!Ok… Let’s talk a bit on customizations...
HaystackCustomization - The Hard Part
Custom Backendhttps://github.com/bennylope/elasticstackhttps://github.com/wingify/superelasticsearchhttps://github.com/Jiydam/haystack-elasticsearch-raw-queryhttps://wellfire.co/learn/custom-haystack-elasticsearch-backend/http://www.stamkracht.com/extending-haystacks-elasticsearch-backend/http://stackoverflow.com/questions/27802628/search-for-multiple-words-elasticsearch-haystackhttp://cstrap.blogspot.it/2015/06/dealing-with-elasticsearch-reindex-and.html
Attachmenthttps://gist.github.com/frague59/aab071f0bdce5b010ce4http://cstrap.blogspot.it/2015/06/django-haystack-elasticsearch-index-pdf.html
I told you so… Here’s your homework... ;-)
Image from: http://haystacksearch.org/
Final Thoughts
❖ Use haystack if you will up and running in (almost) no time❖ Take some time on elasticsearch API❖ Learn to use the elasticsearch-py client provided from elastic❖ Avoid hitting the database by preparing a good mapping❖ Tuning take time, not on the bare metal but on search contents❖ Indices alias is your friend ❖ Good search needs good content❖ You learn a lot of things on text processing❖ Have Fun! :-)
Image from: http://www.focusonanimation.com/les-trois-courts-bip-bip-et-le-coyote-en-3d-6399/
Links Summary
https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.htmlhttps://www.elastic.co/products/beatshttps://www.elastic.co/guide/en/kibana/current/index.htmlhttps://www.elastic.co/guide/en/sense/current/index.htmlhttps://www.elastic.co/learnhttps://www.elastic.co/use-cases/green-man-gaminghttps://www.elastic.co/v5https://info.elastic.co/cloud-enterprise.htmlhttps://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-is-broken.htmlhttps://www.elastic.co/blog/changing-mapping-with-zero-downtimehttps://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.htmlhttp://haystacksearch.org/http://django-haystack.readthedocs.org/en/latest/https://github.com/elastic/elasticsearch-pyhttps://qbox.io/blog/series/elasticsearch-python-django-series
Join Us on Slack! :-) https://pythonmilano.herokuapp.com Image from: http://xmastime.blogspot.it/
Thanks!
Answers?Credits: Valentino Volonghi… some PyCon Italy ago…
Keep in touch! @cstrap on Twitter, Github, Bitbucket, LinkedIn