myheritage kakfa use cases - feb 2014 meetup

Post on 16-Apr-2017

571 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MyHeritage and Kafka

Author: Ran LevyFeb 2014

• MyHeritage use cases

• Possible solutions

• Kafka overview

• Actual implementation @MyHeritage

• Summary

Agenda

• Two major use case:

– Indexing to SuperSearch and Record Matching.

– Stats reporting to BI.

Use cases

• Indexing to SuperSearch and Record Matching

Use case 1

• Custom and non-scalable solution that involved changes processing and updating SuperSearch (SOLR over Lucene).

• Required solution should support:– Continuous mode.– High throughput.– Scaling up. – Repeating the process from some point.– Guaranteed order of processed items.– Reliable.– Multiple consumers.

Use case 1 – con’t

• Statistics reporting to BI system

Use case 2

• Required solution should support:

• High scale (~500GB of data / day).• Scale up – few hundred millions per day.• Repeating the process from some point.• Multiple consumers.

Use case 2 – con’t

MyHeritage use cases

• Possible solutions

• Kafka overview

• Actual implementation @MyHeritage

• Summary

Agenda

• So what we have considered ….– DB

• Queues

Possible Solutions

• Key point about queues

– Messages are deleted after consumed.– Messages are duplicated to support multiple readers.

Possible Solutions

MyHeritage use cases

Possible solutions

• Kafka overview

• Actual implementation @MyHeritage

• Summary

Agenda

• A high throughput distributed messaging system

– Fast– Scalable– Durable– Distributed by design– Simplicity (over functionality)

Kafka Overview

• Fast (very fast) – both for producer and consumer

Kafka Overview

Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

• Main entities– Producer – push data.– Consumer – pull data.– Brokers – load balance producers by partition.– Topic – feeds of messages belongs to the same logical category.

Kafka Overview

• Communication between the clients and the servers is done with a simple, high-performance TCP protocol.

• For each topic, the Kafka cluster maintains a partitioned log which is a commit-log (appends only).

Kafka Overview – some internals

• Messages stay on disk when consumed, deleted after defined TTL.

• The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.

• Each partition is replicated across a configurable number of servers for fault tolerance.

Kafka Overview – some internals

MyHeritage use cases

Possible solutions

Kafka overview

• Actual implementation @MyHeritage

• Summary

Agenda

High Level Overview

Broker 1

Family Tree changes Topic

part 1

part 2

part 32

Indexing

Consumers

RecordMatching

Logstash reader

Web

Producers

Daemons

Face recog.

Activity Topic

part 1

part 2

part 32

DRBD replica

Of Broker2

Broker 2

Family Tree changes Topic

part 1

part 2

part 32

Activity Topic

part 1

part 2

part 32

DRBD replica

Of Broker1

… ………

Kafka @Myheritage - producers

App ModuleApp

ModuleApp Module

Events System

Dispatch event

Subscriber

Subscriber

EventLoggerSubscriber

Notify

Notify

Notify

ILogWrite

ActivityManager

Dispatch

event

Kafka @Myheritage - producers

KafkaWriter

Topic

BrokersConfig

ISelector

ISerializer

ILogger

IStats

Kafka @Myheritage - producers

App ModuleApp

ModuleApp Module

Events System

Dispatch event

Subscriber

Subscriber

EventLoggerSubscriber

Notify

Notify

Notify

KafkaWriter

BrokerBroker

Attempt 1st broker(if failed) Attempt 2nd broker

Kafka @Myheritage – Consumers (Indexing)

EventProcessor

1 Per consumer type, reader per

partition

Broker 2

Broker 1

EventProcessorEventProcessor

Fetch event from part<x>, offset <z>

Fetch event from part<x>, offset <z>

IndexingQueue

IndexingWorkersIndexingWorkers

IndexingWorkers

Fetch work

SOLRUpdate item

KafkaWatermark

Get/update watermark

Add event to queue

MyHeritage use cases

Possible solutions

Kafka overview

Actual implementation @MyHeritage

• Summary

Agenda

Kafka is very fast and scalable system, that is extensively used at MyHeritage, and you would want to consider it for high scale systems you

are using.

Summary

Thank you and questions

ranl@myheritage.com

top related