elasticsearch arcihtecture & what's new in version 5
TRANSCRIPT
ELASTICSEARCH ARCHITECTURE & WHAT’S
NEW IN VERSION5H. BURAK TUNGUT
SOFTWARE ARCHITECT
03.02.2017
WHAT’S NEW IN ELASTICSEARCH 5
• New Data Structures
• Indexing Performance
• Ingest Node
• Painless Scripting
NEW DATA STRUCTURES
• Multi Dimensional Points
• Text & Keyword
Multi Dimensional Points
• Based k-d tree (Solution of range search and nearest neighbor search)
• Support for byte[], IPv6, BigInteger, BigDecimal, 2D .. And higher.
• Allowing 8D (versus 1) points and 16bytes (versus 8bytes) limit per dimension.
• %36 faster at querying, %71 faster at indexing, %66 less disk and %85 less memory consumption.
• !!! New half_float and scaled_float
k-d Tree
NEW DATA STRUCTURES
• Multi Dimensional Points
• Text & Keyword
Text & Keyword
• Causing problem in case of using different use-cases on same field.
• Splitted to text and keyword on same field.
• Wanna do full-text search? Use foo path.
• Wanna do exact match or aggregation? Use foo.keyword path.
Indexing Performance
• Concurrent update performance improvements
• Reduced locking when fsync and translog
• Async fsync support
• %25 - %80 indexing improvement depends on use-case
Ingest Node
• %{IP:CLIENT} %{WORD:METHOD} %{URIPATHPARAM:REQUEST} %{NUMBER:BYTES}
%{NUMBER:DURATION}
Painless Scripting
• New scripting langauge Painless
• Promoted as fast, safe, secure and enabled by default
• 4 times fast as compared Groovy, Javascript and Python
• With Reindex API and Ingest Node powerful way to manipulate documents
Parent Child vs Nested
• Parent/child types are good at normalization and updating
• Child docs can be searched without parent
• Nested types good at searching performance
Use nested types, if data can be duplicated, it is efficent way
Use parent/child types, for real independently updateable documents
Architecture
Hierarchy
•Cluster
•Node
• Index
• Types
• Document
Sharding
• About scaling and failover
• Primary Shards (one lucene instance)
• Default 5 per index
• Executes simultaneously
• Replica Shards (duplication)
• Default 1 per primary shard
• A use case example with 1000 documents with more than one PS and just one PS
DevOps
Memory Optimization
• Default heap size is 1GB, it must be changed!
• More is better? We have 64GB RAM, should we give 64GB to Elasticsearch?
• More RAM = More in-memory caching = better performance, it is accepted!
• But we can get in trouble with Lucene!
• Lucene segments are stored in individual files, they are immutable. Ready for caching everytime.
• Most of case shows that Lucene deserves %50 of available total memory, like ES.
• (Case of using aggs on analyzed string field)
Do not cross with 32GB
• JVM has a feature that called compressed oops (ordinary object pointers)
• We know that objects are allocated in heap and pointers linked to these area block’s
• In 32 bit systems
• The heap size is limited to 4GB (2^32 bytes)
• We need more! Compressed oops
• In 64 bit systems
• The heap size is limited to 16 exabytes
• It is enough. But the bandwith and CPU cache is not enough for that.
Build and Run ES in Docker
• docker network create es-net
• docker run --rm -p 9200:9200 -p 9300:9300 --name=es0 --network=es-net elasticsearch:latest -E
cluster.name=burak -E network.host=172.18.0.2 -E node.name=node0 -E
discovery.zen.ping.unicast.hosts="172.18.0.3:9300
• docker run --rm -p 9201:9200 -p 9301:9300 --name=es1 --network=es-net elasticsearch:latest -E
cluster.name=burak -E network.host=172.18.0.3 -E node.name=node1
Thread Pool
• Types
• Fixed
• Scaling
• Size
• Queue Size
• Processor limits
• Generic : scaling
• Index : #availableprocessor thread, 200 queue size
• Search : (3*#availableprocessor)/2 + 1 thread, 1000 queue size
• Get : #availableprocessor thread, 1000 queue size
• ...
Shard Allocation
• Not detailed in this presentation
• CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_INCOMING_RECOVERIES
• CLUSTER.ROUTING.ALLOCATION.NODE_CONCURRENT_OUTGOING_RECOVERIES
• CLUSTER.ROUTING.ALLOCATION.DISK.WATERMARK.LOW
• CLUSTER.INFO.UPDATE.INTERVAL
• ...
Monitoring
• http://localhost:9200/_cluster/stats
• http://localhost:9200/_nodes/stats
• http://localhost:9200/product_season/_stats
• Mervel | XPack
Query Examples
Full Text Search
• Match
• Match Phrase
• Match Phrase Prefix
• Match All
• Common Terms (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html)
• Q.String (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
Term Level Queries
• Term
• Range
• Prefix
• Wildcard
• Regexp
• Fuzziness (Levenshtein distance)
Compound Queries
• Constant score
• Bool query (must-should-should with boosting)
• Function score (sum, multiply, max | min_score)
Joining Queries
• Nested Query
• Child / Parent Queries