myheritage kakfa use cases - feb 2014 meetup
Post on 16-Apr-2017
571 Views
Preview:
TRANSCRIPT
MyHeritage and Kafka
Author: Ran LevyFeb 2014
• MyHeritage use cases
• Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• Two major use case:
– Indexing to SuperSearch and Record Matching.
– Stats reporting to BI.
Use cases
• Indexing to SuperSearch and Record Matching
Use case 1
• Custom and non-scalable solution that involved changes processing and updating SuperSearch (SOLR over Lucene).
• Required solution should support:– Continuous mode.– High throughput.– Scaling up. – Repeating the process from some point.– Guaranteed order of processed items.– Reliable.– Multiple consumers.
Use case 1 – con’t
• Statistics reporting to BI system
Use case 2
• Required solution should support:
• High scale (~500GB of data / day).• Scale up – few hundred millions per day.• Repeating the process from some point.• Multiple consumers.
Use case 2 – con’t
MyHeritage use cases
• Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• So what we have considered ….– DB
• Queues
Possible Solutions
• Key point about queues
– Messages are deleted after consumed.– Messages are duplicated to support multiple readers.
Possible Solutions
MyHeritage use cases
Possible solutions
• Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
• A high throughput distributed messaging system
– Fast– Scalable– Durable– Distributed by design– Simplicity (over functionality)
Kafka Overview
• Fast (very fast) – both for producer and consumer
Kafka Overview
Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
• Main entities– Producer – push data.– Consumer – pull data.– Brokers – load balance producers by partition.– Topic – feeds of messages belongs to the same logical category.
Kafka Overview
• Communication between the clients and the servers is done with a simple, high-performance TCP protocol.
• For each topic, the Kafka cluster maintains a partitioned log which is a commit-log (appends only).
Kafka Overview – some internals
• Messages stay on disk when consumed, deleted after defined TTL.
• The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.
• Each partition is replicated across a configurable number of servers for fault tolerance.
Kafka Overview – some internals
MyHeritage use cases
Possible solutions
Kafka overview
• Actual implementation @MyHeritage
• Summary
Agenda
High Level Overview
Broker 1
Family Tree changes Topic
part 1
part 2
part 32
Indexing
Consumers
RecordMatching
Logstash reader
Web
Producers
Daemons
Face recog.
Activity Topic
part 1
part 2
part 32
DRBD replica
Of Broker2
Broker 2
Family Tree changes Topic
part 1
part 2
part 32
Activity Topic
part 1
part 2
part 32
DRBD replica
Of Broker1
… ………
…
Kafka @Myheritage - producers
App ModuleApp
ModuleApp Module
Events System
Dispatch event
Subscriber
Subscriber
EventLoggerSubscriber
Notify
Notify
Notify
ILogWrite
ActivityManager
Dispatch
event
Kafka @Myheritage - producers
KafkaWriter
Topic
BrokersConfig
ISelector
ISerializer
ILogger
IStats
Kafka @Myheritage - producers
App ModuleApp
ModuleApp Module
Events System
Dispatch event
Subscriber
Subscriber
EventLoggerSubscriber
Notify
Notify
Notify
KafkaWriter
BrokerBroker
Attempt 1st broker(if failed) Attempt 2nd broker
Kafka @Myheritage – Consumers (Indexing)
EventProcessor
1 Per consumer type, reader per
partition
Broker 2
Broker 1
EventProcessorEventProcessor
Fetch event from part<x>, offset <z>
Fetch event from part<x>, offset <z>
IndexingQueue
IndexingWorkersIndexingWorkers
IndexingWorkers
Fetch work
SOLRUpdate item
KafkaWatermark
Get/update watermark
Add event to queue
MyHeritage use cases
Possible solutions
Kafka overview
Actual implementation @MyHeritage
• Summary
Agenda
Kafka is very fast and scalable system, that is extensively used at MyHeritage, and you would want to consider it for high scale systems you
are using.
Summary
top related