go big quick with elasticsearch

Go Big Quick

Jason SchellerPlatform & Content Analytics, Eikon

Pricing & Text Analytics Platform

• Mission - Ingest, enrich, store, analyze everything. Provide a single platform for search and analytics capabilities over any hosted content. Serve as a platform for future innovation.

• Content

• Twitter (~675 Tweets/sec, 15 days history)

• News (~40 articles/sec, 18 months history)

• Research (40 million docs, 3 million/year)

• Filings (29 million docs, 2.5 million/year)

• Trade data (500k RICS, 30K/sec, 10 years)

• Various metadata and derived content sets

Pricing & Text Analytics Platform

Infrastructure

IBM Streams30 servers

18 servers86 TB

Where to start?

Max Shard

Shard 0

JMeter

Max Shard• Disk space• Request load• RAM usage

Maximum Shard Size

• This same experiment will also give you the ratio of data to index size, which is great for planning. Just make sure you’re using your real analyzer settings.

• The rest is just math!

• Don’t forget to account for:

• Memory required to facet & sort

• Replica shards

• Data compression

Max Total Index Size / Max Shard Size = # Nodes

SPREADSHEET

But do I always use Max Shards?

ALLOCATION & HARDWARE

Cluster Allocation• Elasticsearch will figure out which node should host which shard. Let it! Its

better than you at figuring this out and moving shards around.

• Well mostly….

• Let’s say you have indices A – D, 4 shards each, 0 replicas, 4 nodes. Elasticsearch might arrange your shards like this based on the size of each shard.

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Cluster Allocation• But what about other considerations?

• Hot spotting

• Access frequency

• Connectivity for River-based ingestion

• Heterogeneous hardware

C4D4C3

B3A3B4A4B2A2

D2C2D3D1

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

Slow HW Slow HW Fast HW Fast HW

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Cluster Allocation – Heterogeneous Hardware• Suppose you know that indices A and B get queried 1000s of times per

second, but C and D are only hit ~1 a second. Maybe bought some better hardware to host A and B and don’t want to waste those machines on C and D.

• Is this a good allocation?

• Not really. The slower machines will slow all queries to A & B. And I’m not getting my money’s worth from that better hardware!

C4D4C2

B3A1B4A4B2A2

D2C3D3D1

Cluster Allocation – Heterogeneous Hardware• Wouldn’t this be better?

• Shard allocation settings allow us to “control” which nodes host which indices without ever specifying specific machines or IPs.

A1C1 B1

B3A1B4A4

D2C3D3

Cluster Allocation – Heterogeneous Hardware

A1C1 B1

B3A1B4A4

D2C3D3

node.hardware: slow node.hardware: fast

Index.routing.allocation.require.hardware: fast

Node Settings Node Settings

Index Settings: A & B

Cluster Allocation – Heterogeneous Hardware

Slow HW Fast HW Fast HW Fast HW

A1C1 B1

D2C3D3

• Is this ok? …Sure, why not?!

Cluster Allocation – Archive Example• We can use the same feature for large data sets of a time-based feed. Say

we keep an index for all news ever. People are generally searching the most recent 12 months, not the last 30 years.

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HWSlow

HWSlow HW

Slow HW

Slow HW Slow

HWSlow HW

Slow HW

Slow HW Slow

HW Slow HW

Slow HW

Fast HW

go big quick with elasticsearch

slow node

fast index

allocation hardware

better hardware

shard allocation settings

good allocation

max shard index shard

max shards

Technology

elasticsearch 5.x - new tricks - 2017-02-08 - elasticsearch...

meetup elasticsearch : « booster votre magento avec...

big-data-technologien - Überblick - ihk-nuernberg.de ·...

elasticsearch documentation

sgnext elasticsearch

elasticsearch - europython · elasticsearch • mostly by...

elasticsearch & docker

too big, too quick?

elasticsearch 20150107

elasticsearch a quick introduction

elasticsearch quick intro (english)

elasticsearch introduction › content › download › 6739...

elasticsearch @ keboola

the elasticsearch-kibana plugin for fuel...

séminaire big data alter way - elasticsearch - octobre 2014

combining solr and elasticsearch to improve autosuggestion...

combine apache hadoop and elasticsearch to get the most of...

elasticsearch basics

elasticsearch documentation€¦ · elasticsearch...

using apache spark for generating elasticsearch indices...