scaling analytics with elasticsearch

Download Scaling Analytics with elasticsearch

Post on 28-Nov-2014

9.980 views

Category:

Technology

0 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1. Scaling Analytics with elasticsearch Dan Noble @dwnoble
  • 2. Background Technologist at The HumanGeo We use elasticsearch to build social media analysis tools 100MM documents indexed 600GB+ index size Author of Python elasticsearch driver rawes https://github.com/humangeo/rawes
  • 3. Overview What is elasticsearch? Scaling with elasticsearch How can I use elasticsearch to help with analytics? Use Case: Social Media Analytics
  • 4. What is elasticsearch?
  • 5. Search Engine Open source Distributed Automatic failover Crazy fast
  • 6. Search Engine Actively maintained REST API JSON messages Lucene based
  • 7. Search Elasticsearch Cluster Host Index: Articles Simple case: one host One index containing a set of articles
  • 8. Distributed Search Elasticsearch Cluster Host Host Articles (a) Articles (b) Too much data? Add another host Indices can be broken up into shards and live on different machines
  • 9. Redundancy Elasticsearch Cluster Host Host Articles (a) Articles (b) Articles (b) Articles (a) Shards can be replicated to improve availability
  • 10. Node Auto Discovery Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a) Say we add a third host elasticsearch will automatically start moving shards to this new host to distribute load
  • 11. Failover Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a) Say a host goes down Shards on that host are no longer available for search Elasticsearch automatically rebuilds these two shards on other hosts
  • 12. Querying Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles(a) Query: Barack ObamaCan query against Client Search for articles any host (Web Application) Send request to other shards if needed
  • 13. REST API JSON query syntax Developer friendly Easy to get started
  • 14. Python Exampleimport raweses = rawes.Elastic(elastic-00:9200)es.get(articles/_search, data={ "query": { "filtered" : { "query" : { "query_string" : { "query" : "Barack Obama" } } } }})
  • 15. Community
  • 16. Elasticsearch Summary Scales horizontally Redundancy Configures itself automatically Developer friendly
  • 17. Analytics and elasticsearch Date Histograms Statistical facets Geospatial queries All with arbitrary search parameters Again: Fast
  • 18. Use Case: Social Media Analysis Use social media APIs to search for data on a topic of interest 100MM documents indexed Sentiment analysis Location extraction (Geotagging)
  • 19. Sample Documentes.post(articles/facebook, data={ date": "2012-09-01 08:37:55", "tags": { "sentiment": { "positive": 0.36, "negative": 0.10 } "geotags": [{ "term" : "Cairo", "location" : "30.0566,31.2262, type : geo_point }], "search_terms": [ "Mohamed Morsi" ] }, "item": { "publisher: "Facebook" "source_domain": "www.facebook.com", "author": "James Smith", "source_url": "http://www.facebook.com/5551231234/posts/414141414141", "content_text": "Mohamed Morsi visits Iran for first time since 1979 ....", "title": "James Smith posted a note to Facebook", "author_url: "http://www.facebook.com/profile.php?id=5551231234" }})
  • 20. Analytical Queries
  • 21. Date Histogram for Sentimentes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_histogram" : { "date_histogram" : { "key_field" : "date_of_information.$date", "value_field" : "tags.sentiment.positive", "interval" : "day" } } }})
  • 22. Date Histogram for Sentiment
  • 23. Statistical Facet for Sentiment: Queryes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_stats" : { "statistical" : { "field" : "tags.sentiment.positive" } } }})
  • 24. Statistical Facet for Sentiment: Result{ "facets": { "sentiment_stats": { "_type": "statistical", "count": 8825, "max": 0.375, "mean": 0.008503991588291782, "min": 0.0, "std_deviation": 0.021251077265305472, "sum_of_squares": 4.623648343200283, "total": 75.04772576667497, "variance": 0.00045160828493598306 } }, "hits": { "hits": [], "max_score": 1.1120162, "total": 8825 }, "took": 60}
  • 25. Top Keywordses.get(articles/_search, data={ "query" : { "match_all" : {} }, "facets" : { "search_terms" : { "terms" : { "field" : "tags.search_terms", "size" : 3 } } }})
  • 26. Top Search Terms
  • 27. Geospatial searches.get(articles/_search, data={ "query" : { "filtered" : { "filter" : { "geo_distance" : { "distance" : 20km", "tags.geotags.location" : { "lat" : 30, "lon" : 31 }