when it absolutely, positively, has to be there: reliability guarantees in kafka, gwen shapira, jeff...

When it absolutely, positively, has to be there

Reliability Guarantees in Apache Kafka

@jeffholoman @gwenshap

Kafka• High Throughput• Low Latency• Scalable• Centralized• Real-time

“If data is the lifeblood of high technology, Apache Kafka is the circulatory system”

--Todd PalinoKafka SRE @ LinkedIn

If Kafka is a critical piece of our pipeline Can we be 100% sure that our data will get there? Can we lose messages? How do we verify? Who’s fault is it?

Distributed Systems Things Fail Systems are designed to

tolerate failure

We must expect failures and design our code and configure our systems to handle them

Network

Broker MachineClient Machine

Data Flow

Kafka Client

Broker

O/S Socket Buffer

NIC

NIC

Page Cache

Disk

Application Thread

O/S Socket Buffer

async

callback

✗

✗✗

✗

✗

✗✗✗ data

ack / exception

Client Machine

Kafka Client

O/S Socket Buffer

NIC

Application Thread

✗

✗✗Broker Machine

Broker

NIC

Page Cache

Disk

O/S Socket Buffer

miss

✗

✗

✗

✗Network

Data Flow

✗

data

offsets

ZK

Kafka✗

Replication is your friend Kafka protects against failures by replicating data The unit of replication is the partition One replica is designated as the Leader Follower replicas fetch data from the leader The leader holds the list of “in-sync” replicas

Replication and ISRs

0

1

2

0

1

2

0

1

2

Producer

Broker 100

Broker 101

Broker 102

Topic:Partitions

:Replicas:

my_topic33

Partition:

Leader:ISR:

1101

100,102

Partition:

Leader:ISR:

2102

101,100

Partition:

Leader:ISR:

0100

101,102

ISR

• 2 things make a replica in-sync– Lag behind leader

• replica.lag.time.max.ms – replica that didn’t fetch or is behind • replica.lag.max.messages – will go away has gone away in 0.9

– Connection to Zookeeper

Terminology• Acked

– Producers will not retry sending. – Depends on producer setting

• Committed– Consumers can read. – Only when message got to all

ISR.• replica.lag.time.max.ms

– how long can a dead replica prevent consumers from reading?

Replication• Acks = all

– only waits for in-sync replicas to reply.

Replica 3

100

Replica 2

100

Replica 1

100

Time

• Replica 3 stopped replicating for some reason

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

Acked in acks = all“committed”

Acked in acks = 1but not

“committed”

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

• One replica drops out of ISR, or goes offline• All messages are now acked and committed

• 2nd Replica drops out, or is offline

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

• Now we’re in trouble

✗

Replication• If Replica 2 or 3 come back online before the leader, you can will lose data.

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

All those are “acked” and “committed”

So what to do

• Disable Unclean Leader Election– unclean.leader.election.enable = false

• Set replication factor– default.replication.factor = 3

• Set minimum ISRs– min.insync.replicas = 2

Warning

• min.insync.replicas is applied at the topic-level.• Must alter the topic configuration manually if created

before the server level change• Must manually alter the topic < 0.9.0 (KAFKA-2114)

Replication• Replication = 3• Min ISR = 2

Replica 3

100

Replica 2

100

Replica 1

100

Time

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

• One replica drops out of ISR, or goes offline

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102

103104

Time

• 2nd Replica fails out, or is out of sync

Buffers in

Producer

Producer Internals• Producer sends batches of messages to a buffer

M3

Application Thread

Application Thread

Application Thread

send()M2 M1 M0

Batch 3Batch 2Batch 1

Fail? response

retry

Update Future

callback

drain

Metadata orException

Basics

• Durability can be configured with the producer configuration request.required.acks– 0 The message is written to the network (buffer)– 1 The message is written to the leader– all The producer gets an ack after all ISRs receive the data; the

message is committed

• Make sure producer doesn’t just throws messages away!– block.on.buffer.full = true

“New” Producer

• All calls are non-blocking async• 2 Options for checking for failures:

– Immediately block for response: send().get()– Do followup work in Callback, close producer after error threshold

• Be careful about buffering these failures. Future work? KAFKA-1955• Don’t forget to close the producer! producer.close() will block until in-flight txns

complete

• retries (producer config) defaults to 0 • message.send.max.retries (server config) defaults to 3• In flight requests could lead to message re-ordering

Consumer

• Three choices for Consumer API– Simple Consumer– High Level Consumer (ZookeeperConsumer)– New KafkaConsumer

New Consumer – attempt #1props.put("enable.auto.commit", "true");props.put("auto.commit.interval.ms", "10000"); KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); } } What if we crash

after 8 seconds?

Commit automatically every 10 seconds

New Consumer – attempt #2props.put("enable.auto.commit", "false");

KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);consumer.subscribe(Arrays.asList("foo", "bar"));

while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record); consumer.commitSync();

What are you really committing?



while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { processAndUpdateDB(record);

TopicPartition tp = new TopicPartition(record.topic(), record.partition()); OffsetAndMetadata oam = new OffsetAndMetadata(record.offset() +1); consumer.commitSync(Collections.singletonMap(tp,oam));

Is this fast enough?



int counter = 0;while (true) { ConsumerRecords<String, String> records = consumer.poll(500); for (ConsumerRecord<String, String> record : records) { counter ++; processAndUpdateDB(record); if (counter % 100 == 0) { TopicPartition tp = new TopicPartition(record.topic(), record.partition()); OffsetAndMetadata oam = new OffsetAndMetadata(record.offset() + 1); consumer.commitSync(Collections.singletonMap(tp, oam));

Almost.

Consumer OffsetsP0 P2 P3 P4 P5 P6

✗Commit

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Duplicates

Rebalance Listener

public class MyRebalanceListener implements ConsumerRebalanceListener { @Override public void onPartitionsAssigned(Collection<TopicPartition> partitions) { } @Override public void onPartitionsRevoked(Collection<TopicPartition> partitions) { commitOffsets(); }}

consumer.subscribe(Arrays.asList("foo", "bar"), new MyRebalanceListener());

Careful! This method will need to know the topic, partition and

offset of last record you got

At Least Once Consuming

1. Commit your own offsets - Set autocommit.enable = false

2. Use Rebalance Listener to limit duplicates3. Make sure you commit only what you are done processing4. Note: New consumer is single threaded – one consumer

per thread.

Exactly Once Semantics

• At most once is easy• At least once is not bad either – commit after 100% sure

data is safe• Exactly once is tricky

– Commit data and offsets in one transaction– Idempotent producer

Using External Store

• Don’t use commitSync() • Implement your own “commit” that saves both data and

offsets to external store.• Use the RebalanceListener to find the correct offset

Seeking right offsetpublic class SaveOffsetsOnRebalance implements ConsumerRebalanceListener { private Consumer<?,?> consumer; public void onPartitionsRevoked(Collection<TopicPartition> partitions) { // save the offsets in an external store using some custom code not described here for (TopicPartition partition : partitions) saveOffsetInExternalStore(consumer.position(partition)); } public void onPartitionsAssigned(Collection<TopicPartition> partitions) { // read the offsets from an external store using some custom code not described here for (TopicPartition partition : partitions) consumer.seek(partition, readOffsetFromExternalStore(partition)); }}

Monitoring for Data Loss

• Monitor for producer errors – watch the retry numbers• Monitor consumer lag – MaxLag or via offsets• Standard schema:

– Each message should contain timestamp and originating service and host• Each producer can report message counts and offsets to a special

topic• “Monitoring consumer” reports message counts to another special topic• “Important consumers” also report message counts• Reconcile the results

Be Safe, Not Sorry• Acks = all• Block.on.buffer.full = true• Retries = MAX_INT• ( Max.inflight.requests.per.connect = 1 )• Producer.close()• Replication-factor >= 3• Min.insync.replicas = 2• Unclean.leader.election = false• Auto.offset.commit = false• Commit after processing• Monitor!

when it absolutely, positively, has to be there: reliability guarantees in kafka, gwen shapira, jeff...

Engineering