myheritage kakfa use cases - feb 2014 meetup
TRANSCRIPT
MyHeritage and Kafka
Author: Ran LevyFeb 2014
• MyHeritage use cases
• Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• Two major use case:
– Indexing to SuperSearch and Record Matching.
– Stats reporting to BI.
Use cases
• Indexing to SuperSearch and Record Matching
Use case 1
• Custom and non-scalable solution that involved changes processing and updating SuperSearch (SOLR over Lucene).
• Required solution should support:– Continuous mode.– High throughput.– Scaling up. – Repeating the process from some point.– Guaranteed order of processed items.– Reliable.– Multiple consumers.
Use case 1 – con’t
• Statistics reporting to BI system
Use case 2
• Required solution should support:
• High scale (~500GB of data / day).• Scale up – few hundred millions per day.• Repeating the process from some point.• Multiple consumers.
Use case 2 – con’t
MyHeritage use cases
• Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• So what we have considered ….– DB
• Queues
Possible Solutions
• Key point about queues
– Messages are deleted after consumed.– Messages are duplicated to support multiple readers.
Possible Solutions
MyHeritage use cases
Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• A high throughput distributed messaging system
– Fast– Scalable– Durable– Distributed by design– Simplicity (over functionality)
Kafka Overview
• Fast (very fast) – both for producer and consumer
Kafka Overview
Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
• Main entities– Producer – push data.– Consumer – pull data.– Brokers – load balance producers by partition.– Topic – feeds of messages belongs to the same logical category.
Kafka Overview
• Communication between the clients and the servers is done with a simple, high-performance TCP protocol.
• For each topic, the Kafka cluster maintains a partitioned log which is a commit-log (appends only).
Kafka Overview – some internals
• Messages stay on disk when consumed, deleted after defined TTL.
• The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.
• Each partition is replicated across a configurable number of servers for fault tolerance.
Kafka Overview – some internals
MyHeritage use cases
Possible solutions
Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
High Level Overview
Broker 1
Family Tree changes Topic
part 1
part 2
part 32
Indexing
Consumers
RecordMatching
Logstash reader
Web
Producers
Daemons
Face recog.
Activity Topic
part 1
part 2
part 32
DRBD replica
Of Broker2
Broker 2
Family Tree changes Topic
part 1
part 2
part 32
Activity Topic
part 1
part 2
part 32
DRBD replica
Of Broker1
… ………
…
Kafka @Myheritage - producers
App ModuleApp
ModuleApp Module
Events System
Dispatch event
Subscriber
Subscriber
EventLoggerSubscriber
Notify
Notify
Notify
ILogWrite
ActivityManager
Dispatch
event
Kafka @Myheritage - producers
KafkaWriter
Topic
BrokersConfig
ISelector
ISerializer
ILogger
IStats
Kafka @Myheritage - producers
App ModuleApp
ModuleApp Module
Events System
Dispatch event
Subscriber
Subscriber
EventLoggerSubscriber
Notify
Notify
Notify
KafkaWriter
BrokerBroker
Attempt 1st broker(if failed) Attempt 2nd broker
Kafka @Myheritage – Consumers (Indexing)
EventProcessor
1 Per consumer type, reader per
partition
Broker 2
Broker 1
EventProcessorEventProcessor
Fetch event from part<x>, offset <z>
Fetch event from part<x>, offset <z>
IndexingQueue
IndexingWorkersIndexingWorkers
IndexingWorkers
Fetch work
SOLRUpdate item
KafkaWatermark
Get/update watermark
Add event to queue
MyHeritage use cases
Possible solutions
Kafka overview
Actual implementation @MyHeritage
• Summary
Agenda
Kafka is very fast and scalable system, that is extensively used at MyHeritage, and you would want to consider it for high scale systems you
are using.
Summary