from lucene to elasticsearch, a short explanation of horizontal scalability

29
Scaling Lucene The event of ElasticSearch Stéphane Gamard

Upload: stephane-gamard

Post on 17-Jul-2015

473 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Scaling Lucene The event of ElasticSearch

Stéphane Gamard

Page 2: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Scalability

• Index Size - The number of entries upon which we act

• QPS - Number of requests serviced per second

• Time to operation - Time taken to be operational

Scalability is defined in 3 main axis:

Page 3: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene

• IR library - Purely focused on Tf-iDf

• Bounded by native resources - Vertical scaling

• NRT Inverse Lookup - Segments

In a nutshell, Lucene does not scale. why?

Page 4: From Lucene to Elasticsearch, a short explanation of horizontal scalability

LuceneSegments: the lucene storage

just a “bunch of files”

Page 5: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a “document” perspective

{#hello, #world}

{#there, #is, #a, #brown, #fox}

{#the, … , #kitchen}

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#a T1

#is T2

#fox T45

Dictionary Inverse Lookup

Segment

Page 6: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingFactors of growth

T1 {#1, #33}

T2 {#2, … , #87}

T45 {#2, …}

#a T1

#is T2

#fox T45

Dictionary Inverse Lookup

• Dictionary Size - NLP*

• New Inverse Entries

Segment

Page 7: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a storage perspective

Segment

Page 8: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a storage perspective

Segment

Page 9: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a storage perspective

Segment

Page 10: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a storage perspective

Segment

IndexReader(s)

IndexWriter

Page 11: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingIn a storage perspective

IndexReader(s)

IndexWriter

Lucene Index

Page 12: From Lucene to Elasticsearch, a short explanation of horizontal scalability

LuceneSegments: the lucene storage

just a “bunch of files”

Page 13: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene IndexingThe wonderful world of merging segments

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-

segment-merges.html

Page 14: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene Wrap-up

• A collection of segments

• One or multiple IndexReader

• A single IndexWriter

A Lucene Index is:

Page 15: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Lucene Wrap-upA single Lucene Index scales to:

• Index- Available HDD/Ram for segments

• QPS - number of IndexReader threads

• T-to-Op - Speed at which indexWriter can ingest (IOPs)

It can only scale vertically!!!

Page 16: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ElasticsearchAlso known as the commodity scaling of Lucene ;)

There is no magic…

It’s about partitioning,

Using an index of indexes as its index.

Page 17: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ElasticsearchA shard is the magic sauce of web scale

Lucene Lucene Lucene Lucene Lucene

Elasticsearch Index

Page 18: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ElasticsearchDocument Indexing

Lucene Lucene Lucene Lucene Lucene

• Distributed

• Routing

Page 19: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ElasticsearchRequest

Lucene Lucene Lucene Lucene Lucene

• Parallel

• Aggregated

{search: {…}}

Page 20: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ElasticsearchIn a nutshell

• Distributed - Distribute IndexWriter per shard

• Parallel - Parallelise request IndexReader per shard

Page 21: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ClusteringHow to leverage ES to scale Lucene

Lucene

• 2 Threads - 1 searcher, 1 writer

• 2G ram - Lucene Cache

• 30G disk - Index size

Sample sizing for xM indexed documents

Page 22: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Elasticsearch Index

Clustering

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Lucene

2T/2G/30G

Single Machine Scope: 8Core 16G ram 500G hdd

can sustain 4 times xM documents

Page 23: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Clustering

# Documents

QPS

1 machine -> 4 * xM documents

Page 24: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Clustering2 machines -> 2 * 4 * xM documents

# Documents

QPS

• 4 Threads - 3 searcher, 1 writer

• 4G ram - Lucene Cache

• 60G disk - Index size

Page 25: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Clustering

# Documents

QPS

4 machines -> 2 * 4 * xM documents

twice more QPS

Page 26: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Clustering

# Documents

QPS Is there a limit to this scalability?

Page 27: From Lucene to Elasticsearch, a short explanation of horizontal scalability

Clustering

# Documents

QPS

• 8 Threads - 7 searcher, 1 writer

• 8G ram - Lucene Cache

• 120G disk - Index size

4 machines -> 4 * 4 * xM documents

Page 28: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ClusteringThe rules of thumbs

• Threads - are the core of the scalability factors

• IOPs - is generally the limiting factor to horizontal scaling

• Ram - is generally the limiting factor of vertical scaling

ES is generally excellent with its parameters

Page 29: From Lucene to Elasticsearch, a short explanation of horizontal scalability

ClusteringHealth

• Redundancy - auto-balance shards for best possible HA

• Timing - Warmup and Commit points

• Latency - Result merging (especially on remote aggregations)