elasticsearch arcihtecture & what's new in version 5

ELASTICSEARCH ARCHITECTURE & WHAT’S

NEW IN VERSION5H. BURAK TUNGUT

SOFTWARE ARCHITECT

03.02.2017

WHAT’S NEW IN ELASTICSEARCH 5

• New Data Structures

• Indexing Performance

• Ingest Node

• Painless Scripting

NEW DATA STRUCTURES

• Multi Dimensional Points

• Text & Keyword

Multi Dimensional Points

• Based k-d tree (Solution of range search and nearest neighbor search)

• Support for byte[], IPv6, BigInteger, BigDecimal, 2D .. And higher.

• Allowing 8D (versus 1) points and 16bytes (versus 8bytes) limit per dimension.

• %36 faster at querying, %71 faster at indexing, %66 less disk and %85 less memory consumption.

• !!! New half_float and scaled_float

k-d Tree

NEW DATA STRUCTURES

• Multi Dimensional Points

• Text & Keyword

Text & Keyword

• Causing problem in case of using different use-cases on same field.

• Splitted to text and keyword on same field.

• Wanna do full-text search? Use foo path.

• Wanna do exact match or aggregation? Use foo.keyword path.

Indexing Performance

• Concurrent update performance improvements

• Reduced locking when fsync and translog

• Async fsync support

• %25 - %80 indexing improvement depends on use-case

Ingest Node

• %{IP:CLIENT} %{WORD:METHOD} %{URIPATHPARAM:REQUEST} %{NUMBER:BYTES}

%{NUMBER:DURATION}

Painless Scripting

• New scripting langauge Painless

• Promoted as fast, safe, secure and enabled by default

• 4 times fast as compared Groovy, Javascript and Python

• With Reindex API and Ingest Node powerful way to manipulate documents

Parent Child vs Nested

• Parent/child types are good at normalization and updating

• Child docs can be searched without parent

• Nested types good at searching performance

Use nested types, if data can be duplicated, it is efficent way

Use parent/child types, for real independently updateable documents

Architecture

Hierarchy

•Cluster

•Node

• Index

• Types

• Document

Sharding

• About scaling and failover

• Primary Shards (one lucene instance)

• Default 5 per index

• Executes simultaneously

• Replica Shards (duplication)

• Default 1 per primary shard

• A use case example with 1000 documents with more than one PS and just one PS

DevOps

Memory Optimization

• Default heap size is 1GB, it must be changed!

• More is better? We have 64GB RAM, should we give 64GB to Elasticsearch?

• More RAM = More in-memory caching = better performance, it is accepted!

• But we can get in trouble with Lucene!

• Lucene segments are stored in individual files, they are immutable. Ready for caching everytime.

• Most of case shows that Lucene deserves %50 of available total memory, like ES.

• (Case of using aggs on analyzed string field)

Do not cross with 32GB

• JVM has a feature that called compressed oops (ordinary object pointers)

• We know that objects are allocated in heap and pointers linked to these area block’s

• In 32 bit systems

• The heap size is limited to 4GB (2^32 bytes)

• We need more! Compressed oops

• In 64 bit systems

• The heap size is limited to 16 exabytes

• It is enough. But the bandwith and CPU cache is not enough for that.

Build and Run ES in Docker

• docker network create es-net

• docker run --rm -p 9200:9200 -p 9300:9300 --name=es0 --network=es-net elasticsearch:latest -E

cluster.name=burak -E network.host=172.18.0.2 -E node.name=node0 -E

discovery.zen.ping.unicast.hosts="172.18.0.3:9300

• docker run --rm -p 9201:9200 -p 9301:9300 --name=es1 --network=es-net elasticsearch:latest -E

cluster.name=burak -E network.host=172.18.0.3 -E node.name=node1

Thread Pool

• Types

• Fixed

• Scaling

• Size

• Queue Size

• Processor limits

• Generic : scaling

• Index : #availableprocessor thread, 200 queue size

• Search : (3*#availableprocessor)/2 + 1 thread, 1000 queue size

• Get : #availableprocessor thread, 1000 queue size

• ...

Shard Allocation

• Not detailed in this presentation

• CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_INCOMING_RECOVERIES

• CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_OUTGOING_RECOVERIES

• CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW

• CLUSTER.INFO.UPDATE.INTERVAL

• ...

Monitoring

• http://localhost:9200/_cluster/stats

• http://localhost:9200/_nodes/stats

• http://localhost:9200/product_season/_stats

• Mervel | XPack

http://localhost:9200/_cluster/stats

http://localhost:9200/_nodes/stats

http://localhost:9200/product_season/_stats

Query Examples

Full Text Search

• Match

• Match Phrase

• Match Phrase Prefix

• Match All

• Common Terms (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html)

• Q.String (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

Term Level Queries

• Term

• Range

• Prefix

• Wildcard

• Regexp

• Fuzziness (Levenshtein distance)

Compound Queries

• Constant score

• Bool query (must-should-should with boosting)

• Function score (sum, multiply, max | min_score)

Joining Queries

• Nested Query

• Child / Parent Queries

elasticsearch arcihtecture & what's new in version 5

Engineering