elasticsearch data analyses

71
Elasticsearch Elasticsearch Timed Data Analyses By Alaa Elhadba @aelhadba

Upload: alae-samer

Post on 13-Jan-2017

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Elasticsearch Data Analyses

Elasticsearch

Elasticsearch Timed Data Analyses

By Alaa Elhadba@aelhadba

Page 2: Elasticsearch Data Analyses

Table of Contents

- Hot-Cold Architecture

- Data High Availability

- Data design at large scale

- Search Execution

- Time framed indices

- Aggregations

Page 3: Elasticsearch Data Analyses

Hot-Cold Architecture

Page 4: Elasticsearch Data Analyses

Hot-Cold Architecture

Hot Data Nodes

Perform indexingHold most recent dataUse SSD storage, Writing is an Intensive IO operation

Cold Data Nodes

Handle read only operationsCan use large spinning disks

Page 5: Elasticsearch Data Analyses

Hot-Cold Configuration

node.box_type: hot

elasticsearch.yaml

Shard 2

Node

Shard 1

Node

node.box_type: cold

elasticsearch.yaml

Page 6: Elasticsearch Data Analyses

Data Availability

Availability Zone 1

Availability Zone 2

Page 7: Elasticsearch Data Analyses

Data Availability

Availability Zone 1

Availability Zone 2

Page 8: Elasticsearch Data Analyses

Data Availability

Availability Zone 1

Availability Zone 2Availability Zone / Rack failure ? Shard Allocation Awareness

Page 9: Elasticsearch Data Analyses

Shard Allocation Awareness

Availability Zone 1

Availability Zone 2

Page 10: Elasticsearch Data Analyses

Shard Allocation Awareness

Availability Zone 1

Availability Zone 2

1

2

1

21

2

3

1

2

3

Page 11: Elasticsearch Data Analyses

Shard Allocation Awareness

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

cluster.routing.allocation.awareness.attributes: rack_2

Availability Zone 1 Availability Zone 2

Page 12: Elasticsearch Data Analyses

Shard Allocation Awareness

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

cluster.routing.allocation.awareness.attributes: rack_2

Availability Zone 1 Availability Zone 2

Page 13: Elasticsearch Data Analyses

Shard Allocation Awareness

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

cluster.routing.allocation.awareness.attributes: rack_2

Availability Zone 1 Availability Zone 2

Page 14: Elasticsearch Data Analyses

Shard Allocation Awareness

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

● Use Forced Awareness to avoid the extra load

of reallocation of missing shards

cluster.routing.allocation.awareness.attributes: rack_2

Availability Zone 1 Availability Zone 2

Page 15: Elasticsearch Data Analyses

Shard Allocation Awareness

cluster.routing.allocation.awareness.attributes: rack_1

● Data replication is spanned across AZs

● No two copies of same shard on the same rack

● Elasticsearch is fully aware of shard distribution

● Awareness can be set based cluster or index

● Elasticsearch will prefer using local shards

● Always balance your nodes across AZs

● Routing Allocation Awareness can be updated

on a live cluster

● Use Forced Awareness to avoid the extra load

of reallocation of missing shards

cluster.routing.allocation.awareness.attributes: rack_2

Availability Zone 1 Availability Zone 2

Make sure you can handle the load with less nodes!

Page 16: Elasticsearch Data Analyses

Forced Awareness

● Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone.

● Avoid extra of reallocating unassigned shards after rack failure.

● Allow no single point of failure for your system.● Make sure you can handle the load with less nodes.

cluster.routing.allocation.awareness.force.zone.values: zone1,zone2

cluster.routing.allocation.awareness.attributes: rack1,zone1

Page 17: Elasticsearch Data Analyses

Data design at large scale

Page 18: Elasticsearch Data Analyses

Searching

Shard 4

Shard 2

Query

Result

Node

Node

Shard 3

Node

Shard 1

Node

Page 19: Elasticsearch Data Analyses

Searching

Shard 4

Shard 2

Query

Result

Node

Node

Shard 3

Node

Shard 1

Node

How to avoid asking all shards ?

Page 20: Elasticsearch Data Analyses

Searching

Shard 4

Shard 2

Query

Result

Node

Node

Shard 3

Node

Shard 1

Node

How to avoid asking all shards ? Routing

I know my shards!

Page 21: Elasticsearch Data Analyses

Routing

PUT my_index/my_type/my_id?routing=shard1

GET my_index/_search?routing=shard1,shard2

● Avoid calling all shards● Dedicated shards per purpose● Talk to one dedicated shard● Eliminate Network Traffic● Better Performance● Handle sharding on your own

Page 22: Elasticsearch Data Analyses

Routing

PUT my_index/my_type/my_id?routing=shard1

GET my_index/_search?routing=shard1,shard2

● Avoid calling all shards● Dedicated shards per purpose● Talk to one dedicated shard● Eliminate Network Traffic● Better Performance● Handle sharding on your own

But, Once in, Never out● Routing must be always specified

Page 23: Elasticsearch Data Analyses

Routing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

Page 24: Elasticsearch Data Analyses

Routing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

I MUST KNOW EVERYTHING!

Page 25: Elasticsearch Data Analyses

Talking to data

Page 26: Elasticsearch Data Analyses

Aliasing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

today yesterday 3_days_ago

Page 27: Elasticsearch Data Analyses

Aliasing

1 2 3 1 2 3 1 2

21.06.2016 20.06.2016 19.06.2016

today yesterday 3_days_ago

1 2 3

22.06.2016

Page 28: Elasticsearch Data Analyses

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

today yesterday 3_days_ago

1 2 3

22.06.2016

Page 29: Elasticsearch Data Analyses

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

today yesterday 3_days_ago

1 2 3

22.06.2016

I MUST KNOW!it’s Better Performance

Page 30: Elasticsearch Data Analyses

Aliasing

1 2 3 1 2 3

21.06.2016 20.06.2016

1 2 3

22.06.2016

It’s a Data Problem!

today yesterday 3_days_ago

Page 31: Elasticsearch Data Analyses

Aliasing + Routing

1 2 3 1 2 3

21.06.2016 20.06.2016

1 2 3

22.06.2016

It’s a Data Problem!

today yesterday 3_days_agotoday_returns recent_returns

Page 32: Elasticsearch Data Analyses

Aliasing + Routing + Search

IndexIndex Shard

Alias

Shard slice

Page 33: Elasticsearch Data Analyses

Search Execution Preference

Elasticsearch targets shards and replicas in round-robin manner. Each shard is queried similarly

_primary Query only primary shards (latest info from index or optimize for writing path)

_primary_first Query primary first in available

_replica Query replica shard only

_replica_first Query replica first in available

_local Query shards available on the current node

_only_node:node_id Query a specific node

_only_nodes:* Query only a set of nodes

_prefer_node:node_id Query a prefered noe

_shards:1,3 e,g _shards:1,3;_local Query specific shards with a preference

PUT _search?preference=_replica

Page 34: Elasticsearch Data Analyses

Time Framed Indices

Page 35: Elasticsearch Data Analyses

Data Flow

HOT Cold Closed

Backed_up

Trashed

Time

Page 36: Elasticsearch Data Analyses

Closing/Opening Index

➔ Closing an index

◆ Removes all shard allocations from the cluster ◆ But keeps the index data around ◆ Helps reduce the resources used on the cluster ◆ Consumes only disk space

➔ Opening an index

◆ Allows to open a closed index ◆ Note, those are not “milliseconds” time operation, opening an index can take a few seconds

to a couple of minutes ◆ Flushing before closing will reduce the opening time

Page 37: Elasticsearch Data Analyses

Index Templates

- Order allows you to override other templates

- Settings allows you to scale anytime

- Aliases can be defined on index creation

Page 38: Elasticsearch Data Analyses

Index Templates

Page 39: Elasticsearch Data Analyses

Time framed indices lifecycle

1. Use Index templates to generate mappings for new indices2. Use aliases to decouple your application from data logic3. Use hot nodes for fresh data4. Move old data to cold nodes5. Close old indices before deletion6. Change your time frame at any point to scale (Monthly, Weekly….)7. Use Routing if you have too many shards in a big cluster

Page 40: Elasticsearch Data Analyses

Data Flow

HOT Cold Closed

Backed_up

Trashed

Time

Page 41: Elasticsearch Data Analyses

Aggregations

Page 42: Elasticsearch Data Analyses

Aggregations Types

Buckets Metrics Pipeline

Page 43: Elasticsearch Data Analyses

Nested Bucket Aggregations

Page 44: Elasticsearch Data Analyses

Aggregation Query

Page 45: Elasticsearch Data Analyses

Aggregation Query

Better cachingFetch relevant documents

First segmentation

Nested segmentation

Page 46: Elasticsearch Data Analyses

Doc Values

- Why do we need this?

- Sorting, Aggregations, Some Scripting

- Doc Values

- Build columnar style data structure on disk

- Created at indexing time, stored as part of the segment

- Read like other pieces of the Lucene index

- Don't take up heap space

- Uses file system cache

- Default for not_analyzed string and numeric fields in 2.0+

Page 47: Elasticsearch Data Analyses

Raw Fields

- Use customer_name.raw for aggregations

- Use customer_name for search

Page 48: Elasticsearch Data Analyses

Aggregations Types

Buckets Metrics Pipeline

Page 49: Elasticsearch Data Analyses

Metrics Aggregations

- Avg Aggregation

- Cardinality Aggregation

- Extended Stats Aggregation

- Max Aggregation

- Min Aggregation

- Percentiles Aggregation

- Percentile Ranks Aggregation

- Scripted Metric Aggregation

- Stats Aggregation

- Sum Aggregation

- Top hits Aggregation

- Value Count Aggregation

Page 50: Elasticsearch Data Analyses

Extended Stats Aggregation

Page 51: Elasticsearch Data Analyses

Aggregation Search

Shard 4

Shard 2

Query

Result

Node

Node

Shard 3

Node

Shard 1

Node

Page 52: Elasticsearch Data Analyses

Scripted Metric Aggregation

- Init_script Executed first. Allows initialization of variables.- map_script Executed once after each document is collected. - combine_script Executed once on each shard after document collection is complete. - reduce_script Executed once on the coordinating node after all shards have returned their results.

Page 53: Elasticsearch Data Analyses

Buckets Aggregations

- Children Aggregation

- Date Histogram Aggregation

- Date Range Aggregation

- Filter Aggregation

- Filters Aggregation

- Global Aggregation

- Histogram Aggregation

- Missing Aggregation

- Range Aggregation

- Reverse nested Aggregation

- Sampler Aggregation

- Significant Terms Aggregation

- Terms Aggregation

Page 54: Elasticsearch Data Analyses

Date Histogram Aggregation

Page 55: Elasticsearch Data Analyses

Date Range Aggregation

Don’t forget!

Round your dates

Page 56: Elasticsearch Data Analyses

Missing Aggregations

Page 57: Elasticsearch Data Analyses

Range agg

Page 58: Elasticsearch Data Analyses

Histogram Aggregation

Page 59: Elasticsearch Data Analyses

Pipeline Aggregations

Pipeline

Page 60: Elasticsearch Data Analyses

Pipeline Aggregations

Parent

- Able to compute new buckets or new aggregations to a parent aggregation.

Sibling

- Able to compute new buckets or new aggregation on the same level.

Page 61: Elasticsearch Data Analyses

Siblings Aggregation

- min_bucket

- max_bucket

- sum_bucket

- avg_bucket

- stats_bucket

- extended_stats_bucket

- percentiles_bucket

Page 62: Elasticsearch Data Analyses

Average Aggregation

Page 63: Elasticsearch Data Analyses

Parent Pipeline Aggregation

- moving_avg

- derivative

- cumulative_sum

- bucket_script

- bucket_selector

- serial_diff

Page 64: Elasticsearch Data Analyses

Cumulative Sum Aggregation

Page 65: Elasticsearch Data Analyses

Derivative Aggregation

Page 66: Elasticsearch Data Analyses

Moving Average Aggregation

Page 67: Elasticsearch Data Analyses

Moving Average Aggregation

Page 68: Elasticsearch Data Analyses

Moving Average Aggregation

Prediction

Page 69: Elasticsearch Data Analyses

Bucket Selector Aggregation

Page 70: Elasticsearch Data Analyses

Bucket Script Aggregation

Page 71: Elasticsearch Data Analyses

The End