how we (almost) forgot lambda architecture and used elasticsearch

40
How We (Almost) Forgot Lambda Architecture and Used Elasticsearch

Upload: michael-stockerl

Post on 13-Jan-2017

266 views

Category:

Software


2 download

TRANSCRIPT

How We (Almost) ForgotLambda Architecture and Used Elasticsearch

Michael Stockerl Data Engineer

[email protected]@stockerlm

Gutefrage isthe question-answering plattform,where millions of people help each

other.

Views per month140 Mio.

Registered Users4 Mio.

Questions16.5 Mio.

Answers63 Mio.

gutefrage architecture:

The big picture

App Analytics

App Analytics

App Analytics

Lambda Architecture Example: Answer Score● Better sorting● Hide bad answers● Google Thin Content (SEO)

Incoming Event

Batch view

Live view

Join when read

Implemented with Lambda Architecture

Incoming Event

Batch view

Live view

Join when read

The batch layer

Incoming Event

Batch view

Live view

Join when read

The Speed Layer

Incoming Event

Batch view

Live view

Join when read

The Serving Layer

Learnings+ Reads are fast+ Spark helps building a Lambda Architecture- Still duplicate code and complexity- Each change needs an update of the batch view

Recent problem:

A new point system with user ranking

Points based on Feedback

Points based on Feedback

Like Question

Most Helpful Answer

Say Thank You Rate the Answer

Overall Ranking

Tag Ranking

Overall ranking with MySQL

SELECT user_id, SUM(points) as scoreFROM event_logWHERE created_at BETWEEN now() AND 90 Days agoGROUP BY user_idORDER BY score DESC

First results of performance test● Some queries were fast enough● BUT: 17 - 20 seconds queries in worst case scenario

Solution:

Aggregations of Elasticsearch

Aggregations in Elasticsearch

The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.

elasticsearch documentation

Aggregation for Top User List

"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,

"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }

Aggregation for Top User List

"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,

"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }

groupBy

Aggregation for Top User List

"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,

"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }

order by

Aggregation for Top User List

"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,

"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }

tune accuracy

Solution:

Aggregations of Elasticsearch

Query response times:

Worst case 2.5 seconds (MySQL 17s)

Request cache● Search on local shards● Cache local● Invalidated on changes● Hits.total, aggregations and suggestions

Request cache● Search on local shards● Cache local● Invalidated on changes● Hits.total, aggregations and suggestions

➔ Too much updates➔ A lot of cache misses

Split data:● Data of today: use index template to create index with first event● Historical data: index without changes

Incoming Event

historical data

data of today

Use filtered aliases to select data of time range

Incoming Event

historical data

data of todaytoday

90days

filtered alias

Use cached results from historical data

Incoming Event

historical data

data of todaytoday

90days

filtered alias

Cac

he

_search?request_cache=true

service

The next day

Incoming Event

historical data

data of yesterdaytoday

90days

filtered alias

Cac

he

_search?request_cache=true

service

data of today

Merge the old indices

Incoming Event

historical data

data of yesterdaytoday

90days

filtered alias

Cac

he

_search?request_cache=true

service

data of today

Warm cache already in merge job

Incoming Event

historical data

data of todaytoday

90days

filtered alias

Cac

he

_search?request_cache=true

service

Query response times:

Worst case 90ms (MySQL 17s)

Learnings:

● Improved internal reindex framework● Alias are always your friends● Request cache FTW● Cache miss, when you use index name instead of alias (?)● Results may not be 100% accurate (but no problem for us)

Questions?

We’re hiring…

● Web-Developer● We are looking for experts in the area of Search and

NLP interested in supporting us for a couple of days!

Please get in touch. :)