real time analytics using hadoop and elasticsearch

27
Real time analytics using Hadoop and Elasticsearch ABHISHEK ANDHAVARAPU by

Upload: abhishek-andhavarapu

Post on 01-Jul-2015

657 views

Category:

Technology


2 download

DESCRIPTION

Real time analytics using Hadoop and Elasticsearch

TRANSCRIPT

Page 1: Real time analytics using Hadoop and Elasticsearch

Real time analytics using

Hadoop

and

Elasticsearch

ABHISHEK ANDHAVARAPU

by

Page 2: Real time analytics using Hadoop and Elasticsearch

Thank you Sponsors!

Page 3: Real time analytics using Hadoop and Elasticsearch

About Me

• Currently working as Software

Engineer (Data Platform) at

Allegiance Software Inc.

• Passion for Distributed

System, Data visualizations.

• Masters in Distributed

Systems.

• abhishek376.wordpress.com

Page 4: Real time analytics using Hadoop and Elasticsearch

Agenda

Use Case.

Architecture.

Elasticsearch 101.

Demo.

Lessons learnt.

Page 5: Real time analytics using Hadoop and Elasticsearch

Legacy Architecture

5

Page 6: Real time analytics using Hadoop and Elasticsearch

Current Architecture

Page 7: Real time analytics using Hadoop and Elasticsearch

Why Hadoop ?

Page 8: Real time analytics using Hadoop and Elasticsearch

Elasticsearch 101

• Document oriented search engine Json based, apache

lucene under covers.

• Schema free.

• Its distributed, supports aggregations similar to group by .

• Uses bit sets to efficiently cache.

• It’s fast. Super fast.

• Its has REST and Java based API’s

Page 9: Real time analytics using Hadoop and Elasticsearch

Elasticsearch CRUDIndex a person:

curl -XPUT ‘localhost:9200/person/1’ -d '{

"first_name" : "Abhishek",

"last_name" : "Andhavarapu"

}’

Get a person:

curl -XGET 'localhost:9200/person/1'

Delete a person:

curl -XDELETE ‘localhost:9200/person/1’

Update a person:

curl -XPOST 'localhost:9200/person/1/_update' -d '{

"doc" : {

"first_name" : "Abhi"

}

}'

Page 10: Real time analytics using Hadoop and Elasticsearch

Elasticsearch data

Node2Node1

S1S0

Shard

Page 11: Real time analytics using Hadoop and Elasticsearch

Replicas

Node2Node1

S1 S1

S0S0

Blue - Replica

Red - Primary

Shard

Page 12: Real time analytics using Hadoop and Elasticsearch

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

More nodes..

Page 13: Real time analytics using Hadoop and Elasticsearch

Node2Node1

S1S0

Blue - Replica

Red - Primary

Node4Node3

S1 S0

Node down

Page 14: Real time analytics using Hadoop and Elasticsearch

Node1

S0

Blue - Replica

Red - Primary

Node4Node3

A1 S0

Node down

S1

S1

Promoted to Primary

Re-replicated

Page 15: Real time analytics using Hadoop and Elasticsearch

Elasticsearch 101

• Lucene is under covers.

• Each index (like a database) is made up of multiple

shards(lucene instance).

• Shards are distributed amongst all nodes in the

cluster.

• In case of failure or the addition of new nodes

shards are automatically moved from one to

another.

Page 16: Real time analytics using Hadoop and Elasticsearch

How is it Fast ?

Distributed execution

Client

Node 2Node 1

S1S0S1S0

Query

Red - Primary

Blue - Replica

Page 17: Real time analytics using Hadoop and Elasticsearch

DEMO

• Import data from SQL database

in to Hive. (Extract)

• Run the necessary

computations using

Hadoop/Hive. (Transform)

• Push the data in to

Elasticsearch. (Load)

• Run queries against

Elasticsearch.

Page 18: Real time analytics using Hadoop and Elasticsearch

Current Elasticsearch Cluster

• 9 bare metal boxes

• 128 GB RAM

• 2X SSD

• 10 GB Ethernet

• 2X 10 core Xeon Processors

• 2X 30GB Elasticsearch instances per box

• 1 Elasticsearch load balancing instance to handle index requests

Page 19: Real time analytics using Hadoop and Elasticsearch

Zabbix

What’s slow ?

Any request that takes more than 300ms is slow

Page 20: Real time analytics using Hadoop and Elasticsearch

Lessons Learnt

Page 21: Real time analytics using Hadoop and Elasticsearch

Concurrency

• More replication for more currency. Updates are costly.

• More shards much faster.

• SQL 3 to 5k per minute

Page 22: Real time analytics using Hadoop and Elasticsearch

Filter Cache

• All the filters have a cache flag that controls if they

are cached or not.

• Once the filter cache is warmed, all the requests are

served from the memory.

• Defaults - 10% for the filter cache.

• LRU.

• Bit Sets.

Page 23: Real time analytics using Hadoop and Elasticsearch

Field Data

• For sorting, aggegration etc.. all the field values are

loaded in to memory called field data.

• By default its unbounded.

• Expensive to build, its recommended to hold this in

memory.

• They are circuit breakers to protect against this.

• If the query is gonna use more than 60% of the JVM

heap it will kill the query.

Page 24: Real time analytics using Hadoop and Elasticsearch

JVM memory - Friend or Foe ?

Once a node is down, it causes the other nodes to replicate which are still serving requests causing additional heap pressure

Page 25: Real time analytics using Hadoop and Elasticsearch

Getting Bad

Solution ?

More memory.

Not necessarily more boxes.

Page 26: Real time analytics using Hadoop and Elasticsearch

Elasticsearch Cons

• Not commodity hardware 6K (Hadoop) vs 10K (SSD)

• GC issues.

• Circuit breakers doesn’t protect you against everything.

• No built in security. Use ngnix proxy with authentication.

• Learning curve.

• Lot of updates hurt. Filter cache should be rebuilt, merges etc..

Page 27: Real time analytics using Hadoop and Elasticsearch

Thank you

• abhishek376.wordpress.com

[email protected]

• Twitter : abhishek376We are Hiring !!