real time analytics using hadoop and elasticsearch

Real time analytics using

Hadoop

and

Elasticsearch

ABHISHEK ANDHAVARAPU

by

Thank you Sponsors!

About Me

• Currently working as Software

Engineer (Data Platform) at

Allegiance Software Inc.

• Passion for Distributed

System, Data visualizations.

• Masters in Distributed

Systems.

• abhishek376.wordpress.com

http://abhishek376.wordpress.com

Agenda

Use Case.

Architecture.

Elasticsearch 101.

Demo.

Lessons learnt.

Legacy Architecture

5

Current Architecture

Why Hadoop ?

Elasticsearch 101

• Document oriented search engine Json based, apache

lucene under covers.

• Schema free.

• Its distributed, supports aggregations similar to group by .

• Uses bit sets to efficiently cache.

• It’s fast. Super fast.

• Its has REST and Java based API’s

Elasticsearch CRUDIndex a person:

curl -XPUT ‘localhost:9200/person/1’ -d '{

"first_name" : "Abhishek",

"last_name" : "Andhavarapu"

}’

Get a person:

curl -XGET 'localhost:9200/person/1'

Delete a person:

curl -XDELETE ‘localhost:9200/person/1’

Update a person:

curl -XPOST 'localhost:9200/person/1/_update' -d '{

"doc" : {

"first_name" : "Abhi"

}

}'

Elasticsearch data

Node2Node1

S1S0

Shard

Replicas

Node2Node1

S1 S1

S0S0

Blue - Replica

Red - Primary

Shard

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

More nodes..

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

Node down

Node1

S0

Blue - Replica

Red - Primary

Node4Node3

A1 S0

Node down

S1

S1

Promoted to Primary

Re-replicated

Elasticsearch 101

• Lucene is under covers.

• Each index (like a database) is made up of multiple

shards(lucene instance).

• Shards are distributed amongst all nodes in the

cluster.

• In case of failure or the addition of new nodes

shards are automatically moved from one to

another.

How is it Fast ?

Distributed execution

Client

Node 2Node 1

S1S0S1S0

Query

Red - Primary

Blue - Replica

DEMO

• Import data from SQL database

in to Hive. (Extract)

• Run the necessary

computations using

Hadoop/Hive. (Transform)

• Push the data in to

Elasticsearch. (Load)

• Run queries against

Elasticsearch.

Current Elasticsearch Cluster

• 9 bare metal boxes

• 128 GB RAM

• 2X SSD

• 10 GB Ethernet

• 2X 10 core Xeon Processors

• 2X 30GB Elasticsearch instances per box

• 1 Elasticsearch load balancing instance to handle index requests

Zabbix

What’s slow ?

Any request that takes more than 300ms is slow

Lessons Learnt

Concurrency

• More replication for more currency. Updates are costly.

• More shards much faster.

• SQL 3 to 5k per minute

Filter Cache

• All the filters have a cache flag that controls if they

are cached or not.

• Once the filter cache is warmed, all the requests are

served from the memory.

• Defaults - 10% for the filter cache.

• LRU.

• Bit Sets.

Field Data

• For sorting, aggegration etc.. all the field values are

loaded in to memory called field data.

• By default its unbounded.

• Expensive to build, its recommended to hold this in

memory.

• They are circuit breakers to protect against this.

• If the query is gonna use more than 60% of the JVM

heap it will kill the query.

JVM memory - Friend or Foe ?

Once a node is down, it causes the other nodes to replicate which are still serving requests causing additional heap pressure

Getting Bad

Solution ?

More memory.

Not necessarily more boxes.

Elasticsearch Cons

• Not commodity hardware 6K (Hadoop) vs 10K (SSD)

• GC issues.

• Circuit breakers doesn’t protect you against everything.

• No built in security. Use ngnix proxy with authentication.

• Learning curve.

• Lot of updates hurt. Filter cache should be rebuilt, merges etc..

Thank you

• abhishek376.wordpress.com

• [email protected]

• Twitter : abhishek376We are Hiring !!

http://abhishek376.wordpress.com

mailto:[email protected]

real time analytics using hadoop and elasticsearch

Technology