real time analytics using hadoop and elasticsearch

Real time analytics using

Hadoop

Elasticsearch

ABHISHEK ANDHAVARAPU

Thank you Sponsors!

About Me

• Currently working as Software

Engineer (Data Platform) at

Allegiance Software Inc.

• Passion for Distributed

System, Data visualizations.

• Masters in Distributed

Systems.

• abhishek376.wordpress.com

Agenda

Use Case.

Architecture.

Elasticsearch 101.

Lessons learnt.

Legacy Architecture

Current Architecture

Why Hadoop ?

Elasticsearch 101

• Document oriented search engine Json based, apache

lucene under covers.

• Schema free.

• Its distributed, supports aggregations similar to group by .

• Uses bit sets to efficiently cache.

• It’s fast. Super fast.

• Its has REST and Java based API’s

Elasticsearch CRUDIndex a person:

curl -XPUT ‘localhost:9200/person/1’ -d '{

"first_name" : "Abhishek",

"last_name" : "Andhavarapu"

Get a person:

curl -XGET 'localhost:9200/person/1'

Delete a person:

curl -XDELETE ‘localhost:9200/person/1’

Update a person:

curl -XPOST 'localhost:9200/person/1/_update' -d '{

"doc" : {

"first_name" : "Abhi"

Elasticsearch data

Node2Node1

Replicas

Node2Node1

Blue - Replica

Red - Primary

Node2Node1

Blue - Replica

Red - Primary

Node4Node3

More nodes..

Node2Node1

Blue - Replica

Red - Primary

Node4Node3

Node down

Blue - Replica

Red - Primary

Node4Node3

Node down

Promoted to Primary

Re-replicated

Elasticsearch 101

• Lucene is under covers.

• Each index (like a database) is made up of multiple

shards(lucene instance).

• Shards are distributed amongst all nodes in the

cluster.

• In case of failure or the addition of new nodes

shards are automatically moved from one to

another.

How is it Fast ?

Distributed execution

Client

Node 2Node 1

S1S0S1S0

Red - Primary

Blue - Replica

• Import data from SQL database

in to Hive. (Extract)

• Run the necessary

computations using

Hadoop/Hive. (Transform)

• Push the data in to

Elasticsearch. (Load)

• Run queries against

Elasticsearch.

Current Elasticsearch Cluster

• 9 bare metal boxes

• 128 GB RAM

• 2X SSD

• 10 GB Ethernet

• 2X 10 core Xeon Processors

• 2X 30GB Elasticsearch instances per box

• 1 Elasticsearch load balancing instance to handle index requests

Zabbix

What’s slow ?

Any request that takes more than 300ms is slow

Lessons Learnt

Concurrency

• More replication for more currency. Updates are costly.

• More shards much faster.

• SQL 3 to 5k per minute

Filter Cache

• All the filters have a cache flag that controls if they

are cached or not.

• Once the filter cache is warmed, all the requests are

served from the memory.

• Defaults - 10% for the filter cache.

• LRU.

• Bit Sets.

Field Data

• For sorting, aggegration etc.. all the field values are

loaded in to memory called field data.

• By default its unbounded.

• Expensive to build, its recommended to hold this in

memory.

• They are circuit breakers to protect against this.

• If the query is gonna use more than 60% of the JVM

heap it will kill the query.

JVM memory - Friend or Foe ?

Once a node is down, it causes the other nodes to replicate which are still serving requests causing additional heap pressure

Getting Bad

Solution ?

More memory.

Not necessarily more boxes.

Elasticsearch Cons

• Not commodity hardware 6K (Hadoop) vs 10K (SSD)

• GC issues.

• Circuit breakers doesn’t protect you against everything.

• No built in security. Use ngnix proxy with authentication.

• Learning curve.

• Lot of updates hurt. Filter cache should be rebuilt, merges etc..

Thank you

• abhishek376.wordpress.com

• abhishek376@gmail.com

• Twitter : abhishek376We are Hiring !!

real time analytics using hadoop and elasticsearch

Technology

realtime analytics + hadoop 2.0

integrating elasticsearch for real-time analytics with...

big data analytics and hadoop

log -analytics with apache-flume elasticsearch hdfs kibana

using logstash and elasticsearch analytics capabilities as...

elasticsearch for hadoop - sample chapter

query log analytics - using logstash, elasticsearch and...

search and analytics (using elasticsearch)

future of-hadoop-analytics

predictive analytics with hadoop

scaling analytics with elasticsearch

using couchbase and elasticsearch for real-time data...

using oracle r advanced analytics for hadoop (oraah) · •...

(bdt209) launch: amazon elasticsearch for real-time data...

tuning elasticsearch for multi-terabyte analytics

open analytics. data analytics con hadoop

search and analytics (using elasticsearch) - costin...

social analytics via hadoop

elasticsearch distributed search & analytics on bigdata made...

hadoop* analytics with cloudian solution reference ... ·...