Download - Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
Dan Noble@dwnoble
Background
• Technologist at The HumanGeo• We use elasticsearch to build social media
analysis tools• 100MM documents indexed• 600GB+ index size• Author of Python elasticsearch driver “rawes”
https://github.com/humangeo/rawes
Overview
• What is elasticsearch?• Scaling with elasticsearch• How can I use elasticsearch to help with
analytics?• Use Case: Social Media Analytics
What is elasticsearch?
Search Engine
• Open source• Distributed• Automatic failover• Crazy fast
Search Engine
• Actively maintained• REST API• JSON messages• Lucene based
Search
• Simple case: one host• One index containing a set of articles
Elasticsearch “Cluster”
Host
Index: Articles
Distributed Search
• Too much data?• Add another host• Indices can be broken up into “shards” and live on different machines
Elasticsearch “Cluster”
Host
Articles (a)
Host
Articles (b)
Redundancy
• Shards can be replicated to improve availability
Elasticsearch Cluster
Host
Articles (a)
Host
Articles (b)
Articles (b) Articles (a)
Node Auto Discovery
• Say we add a third host• elasticsearch will automatically start moving shards
to this new host to distribute load
Elasticsearch Cluster
Host
Articles (a)
Host
Articles (b)
Host
Articles (b) Articles (a)
Articles (b)
Articles (a)
Failover
• Say a host goes down• Shards on that host are no longer available for search• Elasticsearch automatically rebuilds these two shards on other hosts
Elasticsearch Cluster
Host
Articles (a)
Host
Articles (b)
Host
Articles (b)
Articles (a)
Articles (b)
Articles (a)Articles (b) Articles (a)
QueryingElasticsearch Cluster
Host
Articles (a)
Host
Articles (b)
Host
Articles (b)
Articles(a)
Client
(Web Application)
Search for articles
Send request to other shards if
needed
Can query against any host
Query: “Barack Obama”
REST API
• JSON query syntax• Developer friendly• Easy to get started
Python Exampleimport raweses = rawes.Elastic('elastic-00:9200')
es.get('articles/_search', data={ "query": { "filtered" : { "query" : { "query_string" : {
"query" : "Barack Obama" } } } }})
Community
Elasticsearch Summary
• Scales horizontally• Redundancy• Configures itself automatically• Developer friendly
Analytics and elasticsearch
• Date Histograms• Statistical facets• Geospatial queries• All with arbitrary search parameters• Again: Fast
Use Case: Social Media Analysis
• Use social media APIs to search for data on a topic of interest
• 100MM documents indexed• Sentiment analysis• Location extraction (“Geotagging”)
Sample Documentes.post('articles/facebook', data={ ”date": "2012-09-01 08:37:55", "tags": { "sentiment": { "positive": 0.36, "negative": 0.10 } "geotags": [{ "term" : "Cairo", "location" : "30.0566,31.2262”, “type” : “geo_point” }], "search_terms": [ "Mohamed Morsi" ] }, "item": { "publisher: "Facebook" "source_domain": "www.facebook.com", "author": "James Smith", "source_url": "http://www.facebook.com/5551231234/posts/414141414141", "content_text": "Mohamed Morsi visits Iran for first time since 1979 ....", "title": "James Smith posted a note to Facebook", "author_url: "http://www.facebook.com/profile.php?id=5551231234" }})
Analytical Queries
Date Histogram for Sentimentes.get('articles/_search', data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_histogram" : { "date_histogram" : { "key_field" : "date_of_information.$date", "value_field" : "tags.sentiment.positive", "interval" : "day" } } }})
Date Histogram for Sentiment
Statistical Facet for Sentiment: Query
es.get('articles/_search', data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_stats" : { "statistical" : { "field" : "tags.sentiment.positive" } } }})
Statistical Facet for Sentiment: Result
{ "facets": { "sentiment_stats": { "_type": "statistical", "count": 8825, "max": 0.375, "mean": 0.008503991588291782, "min": 0.0, "std_deviation": 0.021251077265305472, "sum_of_squares": 4.623648343200283, "total": 75.04772576667497, "variance": 0.00045160828493598306 } }, "hits": { "hits": [], "max_score": 1.1120162, "total": 8825 }, "took": 60}
Top Keywordses.get('articles/_search', data={ "query" : { "match_all" : {} }, "facets" : { "search_terms" : { "terms" : { "field" : "tags.search_terms", "size" : 3 } } }})
Top Search Terms
Geospatial searches.get('articles/_search', data={ "query" : { "filtered" : { "filter" : { "geo_distance" : { "distance" : ”20km", "tags.geotags.location" : { "lat" : 30, "lon" : 31 } } } } }})
Questions