elasticsearch introduction to data model, search & aggregations

56
Elasticsearch Zalando Elasticsearch By Alaa Elhadba

Upload: alae-samer

Post on 13-Jan-2017

299 views

Category:

Technology


1 download

TRANSCRIPT

Elasticsearch

Zalando Elasticsearch By Alaa Elhadba

Table of Contents

Why Elasticsearch

Why Elasticsearch

Elasticsearch at scale

Index / Type

- An index is a collection of documents that should be grouped together for a common reason.

- A type is a collection of documents all share an identical (or very similar) schema

Sharding

Talking to data

Distribution

Elasticsearch node

Cluster_state: yellow

Scaling

Cluster

Cluster_state: yellow

Replication

Cluster

Cluster_state: Green

Replication

Cluster

Cluster_state: Green

Replication

Cluster

Cluster_state: Green

Replication

Cluster

Cluster_state: Red

Data Modeling

Schema

Type:

Index:

Doc_values:

Relationships

● Application Side Joins

● Parent-Child

● Nested objects ● Parent-child queries can be 5 to 10

times slower than the equivalent

nested query!

Searching

Searching

A filter asks a yes|no question of every document and is

used for fields that contain exact values

- Is a date within the range 2012 to 2015 ?

- Is the status “Approved” ?

- Is the language code “DE” ?

STRUCTURED SEARCH

A query calculates how relevant each document is to the

query, and assigns it a relevance , which is later used

to sort matching documents by relevance.

- Containing the word run, but maybe also matching

runs, running, jog, or sprint

UNSTRUCTURED SEARCH

Searching

A filter asks a yes|no question of every document and is

used for fields that contain exact values

- Is a date within the range 2012 to 2015 ?

- Is the status “Approved” ?

- Is the language code “DE” ?

STRUCTURED SEARCH

A query calculates how relevant each document is to the

query, and assigns it a relevance , which is later used

to sort matching documents by relevance.

- Containing the word run, but maybe also matching

runs, running, jog, or sprint

UNSTRUCTURED SEARCH

Terms Query Example

Unstructured Search (Full Text)

Quick brown foxes leap over lazy dogs in summer

Quick, brown, foxes, leap, over, lazy, dogs, in, summer

Quick, brown, foxes, leap, lazy, dogs, summer

Quick, brown, fox, leap, lazy, dog, summer

fast, brown, fox, jump, lazy, dog, summer

tsar -> star

Inverted Index

Relevance

Scoring & Relevance in Full-Text Search

Relevance is the algorithm to calculate how similar the contents of a field to a query.

TF/IDFTerm Frequency

How often does the term appear in the field?

Inverse Document Frequency

How often does each term appear in the index?

Field Length Norm

How long is the field?

Vector Space Model

The vector space model provides a way of comparing a multiterm query against a document.

- The model represents both the document and the query as vectors.

Vector Space Model

1. I am happy in summer.

2. After Christmas I’m a hippopotamus.

3. The happy hippopotamus helped Harry.

- By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document.

- If The angle between a document and the query is large, so it is of low relevance.

Constant Score

Field Value Factor

Field Value Factor

Script Scoring

Aggregations

Aggregation

Search Analytics

Business Requirement “Help me find the best documents ?”

“What do theses documents tell me about my business ?”

Enablers Matching, Relevance, Filtering, Auto-completion,...

Summaries, Patterns, Trends, Outliers, Predictions,

Visualization

- Aggregations help build complex summaries & analytics of the indexed data.

Aggregation

Terms Significant Terms

Bucket Aggregations

Nested Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Metrics Aggregations

● Extended Stats Aggregation

● Geo Bounds Aggregation

● Geo Centroid Aggregation

● Percentiles Aggregation

● Stats Aggregation

● Value Count Aggregation

● Avg, Sum, Min, Max Aggregations

Significant Terms

What’s uncommonly common about this sub-group ?

Significant Terms

- Significant_terms analyzes your data and finds terms that appear with a frequency that is statistically anomalous compared to the background data.

- It can uncover surprisingly sophisticated trends and correlation in your data.- Used in discovering anomalies

Significant Terms

Summarisehow their style differ from everyone else

Find all people who like these products

Significant Terms

Kibana: Data Visualization

Kibana