Transcript
Page 1: Scaling Analytics with elasticsearch

Scaling Analytics with elasticsearch

Dan Noble@dwnoble

Page 2: Scaling Analytics with elasticsearch

Background

• Technologist at The HumanGeo• We use elasticsearch to build social media

analysis tools• 100MM documents indexed• 600GB+ index size• Author of Python elasticsearch driver “rawes”

https://github.com/humangeo/rawes

Page 3: Scaling Analytics with elasticsearch

Overview

• What is elasticsearch?• Scaling with elasticsearch• How can I use elasticsearch to help with

analytics?• Use Case: Social Media Analytics

Page 4: Scaling Analytics with elasticsearch

What is elasticsearch?

Page 5: Scaling Analytics with elasticsearch

Search Engine

• Open source• Distributed• Automatic failover• Crazy fast

Page 6: Scaling Analytics with elasticsearch

Search Engine

• Actively maintained• REST API• JSON messages• Lucene based

Page 7: Scaling Analytics with elasticsearch

Search

• Simple case: one host• One index containing a set of articles

Elasticsearch “Cluster”

Host

Index: Articles

Page 8: Scaling Analytics with elasticsearch

Distributed Search

• Too much data?• Add another host• Indices can be broken up into “shards” and live on different machines

Elasticsearch “Cluster”

Host

Articles (a)

Host

Articles (b)

Page 9: Scaling Analytics with elasticsearch

Redundancy

• Shards can be replicated to improve availability

Elasticsearch Cluster

Host

Articles (a)

Host

Articles (b)

Articles (b) Articles (a)

Page 10: Scaling Analytics with elasticsearch

Node Auto Discovery

• Say we add a third host• elasticsearch will automatically start moving shards

to this new host to distribute load

Elasticsearch Cluster

Host

Articles (a)

Host

Articles (b)

Host

Articles (b) Articles (a)

Articles (b)

Articles (a)

Page 11: Scaling Analytics with elasticsearch

Failover

• Say a host goes down• Shards on that host are no longer available for search• Elasticsearch automatically rebuilds these two shards on other hosts

Elasticsearch Cluster

Host

Articles (a)

Host

Articles (b)

Host

Articles (b)

Articles (a)

Articles (b)

Articles (a)Articles (b) Articles (a)

Page 12: Scaling Analytics with elasticsearch

QueryingElasticsearch Cluster

Host

Articles (a)

Host

Articles (b)

Host

Articles (b)

Articles(a)

Client

(Web Application)

Search for articles

Send request to other shards if

needed

Can query against any host

Query: “Barack Obama”

Page 13: Scaling Analytics with elasticsearch

REST API

• JSON query syntax• Developer friendly• Easy to get started

Page 14: Scaling Analytics with elasticsearch

Python Exampleimport raweses = rawes.Elastic('elastic-00:9200')

es.get('articles/_search', data={    "query": {        "filtered" : {            "query" : {                "query_string" : {

                    "query" : "Barack Obama" } } } }})

Page 15: Scaling Analytics with elasticsearch

Community

Page 16: Scaling Analytics with elasticsearch

Elasticsearch Summary

• Scales horizontally• Redundancy• Configures itself automatically• Developer friendly

Page 17: Scaling Analytics with elasticsearch

Analytics and elasticsearch

• Date Histograms• Statistical facets• Geospatial queries• All with arbitrary search parameters• Again: Fast

Page 18: Scaling Analytics with elasticsearch

Use Case: Social Media Analysis

• Use social media APIs to search for data on a topic of interest

• 100MM documents indexed• Sentiment analysis• Location extraction (“Geotagging”)

Page 19: Scaling Analytics with elasticsearch

Sample Documentes.post('articles/facebook', data={    ”date": "2012-09-01 08:37:55",    "tags": {        "sentiment": {            "positive": 0.36,            "negative": 0.10        }        "geotags": [{ "term" : "Cairo", "location" : "30.0566,31.2262”, “type” : “geo_point” }],        "search_terms": [            "Mohamed Morsi"        ]     },    "item": {        "publisher: "Facebook"        "source_domain": "www.facebook.com",        "author": "James Smith",        "source_url": "http://www.facebook.com/5551231234/posts/414141414141",        "content_text": "Mohamed Morsi visits Iran for first time since 1979 ....",        "title": "James Smith posted a note to Facebook",        "author_url: "http://www.facebook.com/profile.php?id=5551231234"    }})

Page 20: Scaling Analytics with elasticsearch

Analytical Queries

Page 21: Scaling Analytics with elasticsearch

Date Histogram for Sentimentes.get('articles/_search', data={    "query" : {        "query_string" : {            "query" : "Mohamed Morsi"        }    },    "facets" : {        "sentiment_histogram" : {            "date_histogram" : {                "key_field" : "date_of_information.$date",                "value_field" : "tags.sentiment.positive",                "interval" : "day"            }        }    }})

Page 22: Scaling Analytics with elasticsearch

Date Histogram for Sentiment

Page 23: Scaling Analytics with elasticsearch

Statistical Facet for Sentiment: Query

es.get('articles/_search', data={    "query" : {        "query_string" : {            "query" : "Mohamed Morsi"        }    },    "facets" : {        "sentiment_stats" : {            "statistical" : {                "field" : "tags.sentiment.positive"            }        }    }})

Page 24: Scaling Analytics with elasticsearch

Statistical Facet for Sentiment: Result

{ "facets": { "sentiment_stats": { "_type": "statistical", "count": 8825, "max": 0.375, "mean": 0.008503991588291782, "min": 0.0, "std_deviation": 0.021251077265305472, "sum_of_squares": 4.623648343200283, "total": 75.04772576667497, "variance": 0.00045160828493598306 } }, "hits": { "hits": [], "max_score": 1.1120162, "total": 8825 }, "took": 60}

Page 25: Scaling Analytics with elasticsearch

Top Keywordses.get('articles/_search', data={ "query" : { "match_all" : {} }, "facets" : { "search_terms" : { "terms" : { "field" : "tags.search_terms", "size" : 3 } } }})

Page 26: Scaling Analytics with elasticsearch

Top Search Terms

Page 27: Scaling Analytics with elasticsearch

Geospatial searches.get('articles/_search', data={ "query" : { "filtered" : { "filter" : { "geo_distance" : { "distance" : ”20km", "tags.geotags.location" : { "lat" : 30, "lon" : 31 } } } } }})

Page 28: Scaling Analytics with elasticsearch

Questions


Top Related