elastic{search} blueprint - pycon.it

Elastic{Search} BlueprintPyCon7 - Firenze - 2016-04-16Speaker: Christian “Strap” Strappazzon

$ whoami

★ GS1 Italy - IT Specialist

★ Passionate programmer

★ From Codementor: “You’re not the dev every team needs, but you’re the dev every team deserves.”

★ Spend time on reading technical books

★ Python Milano Organizer

★ BBQ Master

★ Dad, family addicted.

Objective

Image from: http://ibaldi.blog.tiscali.it/lavori/

Overview on “ELK-B Stack” with a focus on Elasticsearch and Python/Django integration.

Your homeworks: get some informations from this presentation and then go deeper if you want to use these tools in your current/next projects.

Why am I here?● Google Site Search service was ending

○ we exceeded the yearly query quota allocated

○ service downgrade with ads○ possible service suspension

● In the past we (they) used Solr, but the current hype was on Elasticsearch

● It was a good time to try a new tool● Performance, Elasticsearch is fast

○ indexed ~350 webpage and ~150 pdf in less than 4 minutes, index ~55Mb

○ search comes in milliseconds and provide the limit for you

● Last but not least… Community voted my talk - THANKS!!! - and then I do my best! :-)

Let’s begin!image from http://aragec.org/bip+bip.html

Agenda

➢ The Open Source “ELK-B Stack” and commercial products

○ logstash, beats, kibana, sense, elasticsearch

➢ Python/Django Tools

○ haystack install/configure and some other related projects

➢ Final Thoughts

➢ Q & A

The Open Source “ELK-B Stack” and commercial products

Images from: http://elastic.co

Goodbye ELK-B(ee)

Say “Heya” to Elastic Stackand X-Pack

From Elastic{ON} 16

https://www.elastic.co/blog/heya-elastic-stack-and-x-pack



MarvelMonitoring

WatcherAlerting

ShieldSecurity

Hadoop Connector

SenseConsole

Other plugins...

Graph, Reporting

Collect, parse and enrich data

Collect, parse and ship

Store, search, analyze

Visualize and explore data

Images from: http://elastic.co

LogstashCollect, Enrich and Transport

Logstash is a data pipeline that helps you process logs and other event data from a variety of systems. With 200 plugins and counting, Logstash can connect to a variety of sources and stream data at scale to a central analytics system.

Apache License 2.0

What is a log:

➔ log is a record of activity by system, application, etc ➔ a timestamp and some data

What kind of problem try to solve:

➔ every application and device logs in its own special way

➔ each logs can be analyzed separately➔ search across logs is difficult due to a different

formats➔ logs are spread around your servers➔ many servers and different kind of logs➔ ssh + grep aren’t scalable➔ expert required to read the log

Image from: http://elastic.co

LogstashInstall and Configure

➔ Install◆ require jvm 1.7+◆ download and unzip ◆ prepare a logstash.conf config file◆ run ./bin/logstash agent -f logstash.conf

➔ Configure◆ create one or more config file◆ grok is regex powered, over 80 patterns,

custom patterns◆ http://grokdebug.herokuapp.com

input {# Apache log, mail log, app log ...

}

filter {# Grok, GeoIP, Date ...

}

output {# Elasticsearch, Graphite ...

}

https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html


BeatsCollect, Parse and Ship

Beats is the platform for building lightweight, open source data shippers for many types of operational data you want to enrich with Logstash, search and analyze in Elasticsearch, and visualize in Kibana.

Written in Go, simple to deploy: download and install/unzip, configure the yaml file and run the daemon with sudo.

Apache License, Version 2.0

Type of beats:

➔ libbeat: for building more beats➔ Packetbeat: tap into your wire data➔ Topbeat: gather infrastructure metrics➔ Filebeat: analyze log files in real time➔ Winlogbeat: gather insight from windows

event logs➔ {Future}beats: there's oh-so-much more

to come


KibanaVisualize and Explore Data

➔ Flexible analytics and visualization platform

➔ Real-time summary and charting of streaming data

➔ Intuitive interface for a variety of users➔ Instant sharing and embedding of

dashboards➔ Apache license, Version 2.0➔ Easy to install:

◆ require a modern browser◆ download and unzip◆ set elasticsearch.url to point ES

instance ◆ run binary


Image from: https://michael.bouvy.net

Sense - Visually Interact with Elasticsearch REST APIs

Sense is a visual console that provides auto-complete, auto-indentation, and syntaxchecking all through a Kibana plugin.

Some features:

➔ multiple requests➔ auto formatting➔ keyboard shortcuts➔ history (500 requests)

Apache License, Version 2.0Image from: http://elastic.co

WARNING!A lot of information incoming...

ElasticsearchSearch, store and analyze

Elasticsearch is a search server based on Lucene.

It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents.

Apache License, Version 2.0

Features

❖ Distributed, scalable and resilient➢ design for scale-out, high availability

❖ Developer friendly➢ API first, schemaless, native JSON, client

libraries for any languages❖ Real-time search & analytics

➢ real time aggregations, geospatial, full-text search, query structured and unstructured data


ElasticsearchNode and Cluster

Node

➔ A running instance of elasticsearch (JVM process)

Cluster

➔ Multiple nodes working together


Default node➔ master eligible➔ holds data➔ indexing,

aggregations, query…

Dedicated master node➔ master eligible➔ no data

Data node➔ holds data➔ indexing,

aggregations, query...

Client node➔ no data➔ know the state of the

cluster➔ routing

Node Types

ElasticsearchIndex & Shards

Index

➔ An index is a lightweight container for data

Shard

➔ A single piece of an Elasticsearch index➔ Indexes are partitioned into shards so they can be distributed across multiple nodes➔ Each shard is a standalone Lucene index➔ A shard is either a primary or a replica➔ By default shards are copied for high availability➔ Replica shards are always on different nodes from each other and their primary shard➔ Searches may be performed against primary or replica


ElasticsearchLet’s talk about Search! :-)

❖ Different type of search➢ suggestions, synonyms, autocomplete, filters, aggregations

❖ Iterative process➢ relevance tuning, accuracy, classification

❖ No downtime➢ depends on the cluster


ElasticsearchMapping

When you insert a JSON document into ES, automatically ES creates a mapping with data detection.

Mapping is composed by field:

➔ each field requires a type➔ no change of field type once added➔ adding new field➔ changing field type requires re-indexing➔ fields can have a boost

Fields types: analyzed string, float, boolean, double, date, integer, not analyzed string, long, binary


{ "myidx" : { "mappings" : { "meetup" : { "properties" : { "message" : { "type" : "string" }, "post_date" : { "type" : "date", "format" : "dateOptionalTime" } } } } }}

ElasticsearchAnalyzing Text

❖ Tokenizer➢ breaks up text into tokens

❖ Filters➢ applied to tokens in sequence

❖ Analyzers➢ associated with fields in mapping, can be

customized, applied at index and query time


..."analyzer": {

"italian": { "tokenizer": "standard", "filter": [ "italian_elision", "lowercase", "italian_stop", "italian_keywords", "italian_stemmer" ]

}}...

ElasticsearchIndex Alias

➔ Alias is a view of one or more indexes➔ Can be filtered➔ Decouple application from indexes➔ Lightweight➔ Used on re-index with no downtime, atomic operation


POST /_aliases{ "actions": [ { "remove": { "index": "pyconit_v1", "alias": "pyconit" }}, { "add": { "index": "pyconit_v2", "alias": "pyconit" }} ]}

ElasticsearchQuerying the data

Elasticsearch provides a full Query DSL based on JSON to define queries.

Some options are:

➔ boost on fields at query time➔ full-text and Term query➔ score on result➔ filter result➔ aggregate

The documentation rocks! You’ll find everything you need. Trust me! :-)


{ "multi_match" : {

"query" : "this is a test","fields" : [ "subject^3", "message" ]

}}

{"regexp":{

"name.first":{ "value":"s.*y", "boost":1.2 }

}}

ElasticsearchInstall and Run

➔ require JVM 1.7+➔ download and unzip➔ run ./bin/elasticsearch➔ edit config file, some options could be given from command line

Pretty simple! :-)


https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html

OK, but...Let’s talk about Django… :-)

HaystackModular search for django

Haystack provides modular search for Django with an abstraction layer for different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.)

➔ It's a django app➔ Elasticsearch backend depends on elasticsearch-py➔ Provide signals, multiple routing, search query API similar to django ORM➔ Lack of documentation, but enough to start➔ You get your hands dirty if you want more➔ Currently only supports ElasticSearch 1.x. ElasticSearch 2.x is not supported yet, if you would like

to help, please see #1247.➔ BSD License

Image from: http://haystacksearch.org/

https://github.com/django-haystack/django-haystack/issues/1247

HaystackInstall and Settings

(env) $ pip install django-haystack# add 'haystack' to INSTALLED_APPS# add in settings.py HAYSTACK_CONNECTIONS which backend to use, e.g.:...HAYSTACK_CONNECTIONS = {

'default': {'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine','URL': 'http://127.0.0.1:9200/','INDEX_NAME': 'haystack',

},}


HaystackHandling Data

(env) $ ./manage.py startapp search(env) $ cd search(env) $ touch search_indexes.py

# Edit with your editor of choice# ... Vim or Emacs? Fight! # @raymondh

import datetimefrom haystack import indexesfrom myapp.models import Note

class NoteIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date')

def get_model(self): return Note

def index_queryset(self, using=None): """Used when the entire index for model is updated.""" now = datetime.datetime.now() return self.get_model().objects.filter(pub_date__lte=now)

(env) $ ./manage.py rebuild_index


https://twitter.com/raymondh

HaystackSetup Search View and URL

# inside urls.py

(r'^search/', include('haystack.urls')),

# Override the search/search.html default template

{# search.html #}...

<form method="get" action="."> {{ form.as_p }} <p><input type="submit" value="Search"></p> ...

{% for result in page.object_list %} <p> <a href="{{ result.object.get_absolute_url }}"> {{ result.object.title }}</a> </p>{% empty %} <p>No results found.</p>{% endfor %}

Pay attention! Don't use result.object.something, use instead the fields on your index.e.g. result.title, because result.object.title hits the database!


That’s it!Ok… Let’s talk a bit on customizations...

HaystackCustomization - The Hard Part

Custom Backendhttps://github.com/bennylope/elasticstackhttps://github.com/wingify/superelasticsearchhttps://github.com/Jiydam/haystack-elasticsearch-raw-queryhttps://wellfire.co/learn/custom-haystack-elasticsearch-backend/http://www.stamkracht.com/extending-haystacks-elasticsearch-backend/http://stackoverflow.com/questions/27802628/search-for-multiple-words-elasticsearch-haystackhttp://cstrap.blogspot.it/2015/06/dealing-with-elasticsearch-reindex-and.html

Attachmenthttps://gist.github.com/frague59/aab071f0bdce5b010ce4http://cstrap.blogspot.it/2015/06/django-haystack-elasticsearch-index-pdf.html

I told you so… Here’s your homework... ;-)


https://github.com/bennylope/elasticstack

https://github.com/bennylope/elasticstack

https://github.com/wingify/superelasticsearch

https://github.com/wingify/superelasticsearch

https://github.com/Jiydam/haystack-elasticsearch-raw-query

https://github.com/Jiydam/haystack-elasticsearch-raw-query

https://wellfire.co/learn/custom-haystack-elasticsearch-backend/

https://wellfire.co/learn/custom-haystack-elasticsearch-backend/

http://www.stamkracht.com/extending-haystacks-elasticsearch-backend/

http://www.stamkracht.com/extending-haystacks-elasticsearch-backend/

http://stackoverflow.com/questions/27802628/search-for-multiple-words-elasticsearch-haystack

http://stackoverflow.com/questions/27802628/search-for-multiple-words-elasticsearch-haystack

http://cstrap.blogspot.it/2015/06/dealing-with-elasticsearch-reindex-and.html

http://cstrap.blogspot.it/2015/06/dealing-with-elasticsearch-reindex-and.html

https://gist.github.com/frague59/aab071f0bdce5b010ce4

https://gist.github.com/frague59/aab071f0bdce5b010ce4

http://cstrap.blogspot.it/2015/06/django-haystack-elasticsearch-index-pdf.html

http://cstrap.blogspot.it/2015/06/django-haystack-elasticsearch-index-pdf.html

Final Thoughts

❖ Use haystack if you will up and running in (almost) no time❖ Take some time on elasticsearch API❖ Learn to use the elasticsearch-py client provided from elastic❖ Avoid hitting the database by preparing a good mapping❖ Tuning take time, not on the bare metal but on search contents❖ Indices alias is your friend ❖ Good search needs good content❖ You learn a lot of things on text processing❖ Have Fun! :-)

Image from: http://www.focusonanimation.com/les-trois-courts-bip-bip-et-le-coyote-en-3d-6399/

Links Summary

https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.htmlhttps://www.elastic.co/products/beatshttps://www.elastic.co/guide/en/kibana/current/index.htmlhttps://www.elastic.co/guide/en/sense/current/index.htmlhttps://www.elastic.co/learnhttps://www.elastic.co/use-cases/green-man-gaminghttps://www.elastic.co/v5https://info.elastic.co/cloud-enterprise.htmlhttps://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-is-broken.htmlhttps://www.elastic.co/blog/changing-mapping-with-zero-downtimehttps://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.htmlhttp://haystacksearch.org/http://django-haystack.readthedocs.org/en/latest/https://github.com/elastic/elasticsearch-pyhttps://qbox.io/blog/series/elasticsearch-python-django-series

Join Us on Slack! :-) https://pythonmilano.herokuapp.com Image from: http://xmastime.blogspot.it/



https://www.elastic.co/products/beats

https://www.elastic.co/products/beats

https://www.elastic.co/guide/en/kibana/current/index.html

https://www.elastic.co/guide/en/kibana/current/index.html

https://www.elastic.co/guide/en/sense/current/index.html

https://www.elastic.co/guide/en/sense/current/index.html

https://www.elastic.co/learn

https://www.elastic.co/learn

https://www.elastic.co/use-cases/green-man-gaming

https://www.elastic.co/use-cases/green-man-gaming

https://www.elastic.co/v5

https://www.elastic.co/v5

https://info.elastic.co/cloud-enterprise.html

https://info.elastic.co/cloud-enterprise.html

https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-is-broken.html

https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-is-broken.html

https://www.elastic.co/blog/changing-mapping-with-zero-downtime

https://www.elastic.co/blog/changing-mapping-with-zero-downtime

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

http://haystacksearch.org/

http://haystacksearch.org/

http://django-haystack.readthedocs.org/en/latest/

http://django-haystack.readthedocs.org/en/latest/

https://github.com/elastic/elasticsearch-py

https://github.com/elastic/elasticsearch-py

https://qbox.io/blog/series/elasticsearch-python-django-series

https://qbox.io/blog/series/elasticsearch-python-django-series

https://pythonmilano.herokuapp.com

https://pythonmilano.herokuapp.com

Thanks!

Answers?Credits: Valentino Volonghi… some PyCon Italy ago…

Keep in touch! @cstrap on Twitter, Github, Bitbucket, LinkedIn

elastic{search} blueprint - pycon.it

Documents