kafka reliability guarantees atl kafka user group

45
1 © Cloudera, Inc. All rights reserved. Kafka Reliability Guarantees

Upload: jeff-holoman

Post on 10-Apr-2017

567 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Kafka Reliability Guarantees ATL Kafka User Group

1© Cloudera, Inc. All rights reserved.

Kafka Reliability Guarantees

Page 2: Kafka Reliability Guarantees ATL Kafka User Group

2© Cloudera, Inc. All rights reserved.

But First…What’s NEW???• Released 0.9.0 in late November• 87 Contributors, 523 JIRAs, Bunch o’ new Features.• Security!

• Kerberos/SASL Authentication• Authorization Plugin• SSL

• Kafka Connect• Quotas• New Consumer****

Page 3: Kafka Reliability Guarantees ATL Kafka User Group

3© Cloudera, Inc. All rights reserved.

Kafka• High Throughput• Low Latency• Scalable• Centralized• Real-time

Page 4: Kafka Reliability Guarantees ATL Kafka User Group

4© Cloudera, Inc. All rights reserved.

“If data is the lifeblood of high technology, Apache Kafka is the circulatory system”

--Todd PalinoKafka SRE @ LinkedIn

Page 5: Kafka Reliability Guarantees ATL Kafka User Group

5© Cloudera, Inc. All rights reserved.

If Kafka is a critical piece of our pipeline Can we be 100% sure that our data will get there? Can we lose messages? How do we verify? Who’s fault is it?

Page 6: Kafka Reliability Guarantees ATL Kafka User Group

6© Cloudera, Inc. All rights reserved.

Distributed Systems Things Fail Systems are designed to

tolerate failure

We must expect failures and design our code and configure our systems to handle them

Page 7: Kafka Reliability Guarantees ATL Kafka User Group

7© Cloudera, Inc. All rights reserved.

Network

Broker MachineClient Machine

Data Flow

Kafka Client

Broker

O/S Socket Buffer

NIC

NIC

Page Cache

Disk

Application Thread

O/S Socket Buffer

async

callback

✗✗✗ data

ack / exception

Page 8: Kafka Reliability Guarantees ATL Kafka User Group

8© Cloudera, Inc. All rights reserved.

Client Machine

Kafka Client

O/S Socket Buffer

NIC

Application Thread

✗✗Broker Machine

Broker

NIC

Page Cache

Disk

O/S Socket Buffer

miss

✗Network

Data Flow

data

offsets

ZK

Kafka

Page 9: Kafka Reliability Guarantees ATL Kafka User Group

9© Cloudera, Inc. All rights reserved.

Replication is your friend Kafka protects against failures by replicating data The unit of replication is the partition One replica is designated as the Leader Follower replicas fetch data from the leader The leader holds the list of “in-sync” replicas

Page 10: Kafka Reliability Guarantees ATL Kafka User Group

10© Cloudera, Inc. All rights reserved.

Replication and ISRs

0

1

2

0

1

2

0

1

2

Producer

Broker 100

Broker 101

Broker 102

Topic:Partition

s:Replicas

:

my_topic33

Partition:

Leader:ISR:

1101

100,102

Partition:

Leader:ISR:

2102

101,100

Partition:

Leader:ISR:

0100

101,102

Page 11: Kafka Reliability Guarantees ATL Kafka User Group

11© Cloudera, Inc. All rights reserved.

ISR• 2 things make a replica in-sync

• Lag behind leader• replica.lag.time.max.ms – replica that didn’t fetch or is behind • replica.lag.max.messages – will go away in 0.9

• Connection to Zookeeper

Page 12: Kafka Reliability Guarantees ATL Kafka User Group

12© Cloudera, Inc. All rights reserved.

Terminology• Acked

• Producers will not retry sending. • Depends on producer setting

• Committed• Consumers can read. • Only when message got to all ISR.

• replica.lag.time.max.ms • how long can a dead replica prevent consumers from reading?

Page 13: Kafka Reliability Guarantees ATL Kafka User Group

13© Cloudera, Inc. All rights reserved.

Replication• Acks = all

• only waits for in-sync replicas to reply.

Replica 3

100

Replica 2

100

Replica 1

100

Time

Page 14: Kafka Reliability Guarantees ATL Kafka User Group

14© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

• Replica 3 stopped replicating for some reason

Acked in acks = all“committed”

Acked in acks = 1but not

“committed”

Page 15: Kafka Reliability Guarantees ATL Kafka User Group

15© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

• One replica drops out of ISR, or goes offline• All messages are now acked and committed

Page 16: Kafka Reliability Guarantees ATL Kafka User Group

16© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

• 2nd Replica drops out, or is offline

Page 17: Kafka Reliability Guarantees ATL Kafka User Group

17© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

• Now we’re in trouble

Page 18: Kafka Reliability Guarantees ATL Kafka User Group

18© Cloudera, Inc. All rights reserved.

Replication• If Replica 2 or 3 come back online before the leader, you can will lose data.

Replica 3

100

Replica 2

100101

Replica 1

100101102103104Time

All those are “acked” and “committed”

Page 19: Kafka Reliability Guarantees ATL Kafka User Group

19© Cloudera, Inc. All rights reserved.

So what to do• Disable Unclean Leader Election

• unclean.leader.election.enable = false• Set replication factor

• default.replication.factor = 3• Set minimum ISRs

• min.insync.replicas = 2

Page 20: Kafka Reliability Guarantees ATL Kafka User Group

20© Cloudera, Inc. All rights reserved.

Warning• min.insync.replicas is applied at the topic-level.• Must alter the topic configuration manually if created before the server level change

• Must manually alter the topic < 0.9.0 (KAFKA-2114)

Page 21: Kafka Reliability Guarantees ATL Kafka User Group

21© Cloudera, Inc. All rights reserved.

Replication• Replication = 3• Min ISR = 2

Replica 3

100

Replica 2

100

Replica 1

100

Time

Page 22: Kafka Reliability Guarantees ATL Kafka User Group

22© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101

Time

• One replica drops out of ISR, or goes offline

Page 23: Kafka Reliability Guarantees ATL Kafka User Group

23© Cloudera, Inc. All rights reserved.

Replication

Replica 3

100

Replica 2

100101

Replica 1

100101102

103104

Time

• 2nd Replica fails out, or is out of sync

Buffers in

Producer

Page 24: Kafka Reliability Guarantees ATL Kafka User Group

24© Cloudera, Inc. All rights reserved.

Page 25: Kafka Reliability Guarantees ATL Kafka User Group

25© Cloudera, Inc. All rights reserved.

Producer Internals• Producer sends batches of messages to a buffer

M3

Application Thread

Application Thread

Application Thread

send()M2 M1 M0

Batch 3Batch 2Batch 1

Fail? response

retry

Update Future

callback

drain

Metadata orException

Page 26: Kafka Reliability Guarantees ATL Kafka User Group

26© Cloudera, Inc. All rights reserved.

Basics• Durability can be configured with the producer configuration request.required.acks• 0 The message is written to the network (buffer)• 1 The message is written to the leader• all The producer gets an ack after all ISRs receive the data; the message is committed

• Make sure producer doesn’t just throws messages away!• block.on.buffer.full = true

Page 27: Kafka Reliability Guarantees ATL Kafka User Group

27© Cloudera, Inc. All rights reserved.

“New” Producer• All calls are non-blocking async• 2 Options for checking for failures:

• Immediately block for response: send().get()• Do followup work in Callback, close producer after error threshold

• Be careful about buffering these failures. Future work? KAFKA-1955• Don’t forget to close the producer! producer.close() will block until in-flight txns complete

• retries (producer config) defaults to 0 • message.send.max.retries (server config) defaults to 3• In flight requests could lead to message re-ordering

Page 28: Kafka Reliability Guarantees ATL Kafka User Group

28© Cloudera, Inc. All rights reserved.

Page 29: Kafka Reliability Guarantees ATL Kafka User Group

29© Cloudera, Inc. All rights reserved.

Consumer• Three choices for Consumer API

• Simple Consumer• High Level Consumer• “New Consumer”

Page 30: Kafka Reliability Guarantees ATL Kafka User Group

30© Cloudera, Inc. All rights reserved.

New Consumer• Available in Kafka 0.9.0• Provides better control over offset management• Enhanced server-side group management

Page 31: Kafka Reliability Guarantees ATL Kafka User Group

31© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer Group

Consumer1

Consumer2

Consumer 3

Consumer 4

Page 32: Kafka Reliability Guarantees ATL Kafka User Group

32© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Commit?

Page 33: Kafka Reliability Guarantees ATL Kafka User Group

33© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Commit?

Page 34: Kafka Reliability Guarantees ATL Kafka User Group

34© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Auto-commit

enabled

✗Commit

Page 35: Kafka Reliability Guarantees ATL Kafka User Group

35© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Auto-commit

enabled

Page 36: Kafka Reliability Guarantees ATL Kafka User Group

36© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Auto-commit

enabled Consumer

Picks up here

Page 37: Kafka Reliability Guarantees ATL Kafka User Group

37© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit

Page 38: Kafka Reliability Guarantees ATL Kafka User Group

38© Cloudera, Inc. All rights reserved.

Consumer Offsets

P0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit

Offset commits

for all threads

Page 39: Kafka Reliability Guarantees ATL Kafka User Group

39© Cloudera, Inc. All rights reserved.

P0 P2 P3 P4 P5 P6

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Consumer Offsets

Auto-commit

DISABLED

Commit

Page 40: Kafka Reliability Guarantees ATL Kafka User Group

40© Cloudera, Inc. All rights reserved.

Consumer Recommendations• Set autocommit.enable = false• Manually commit offsets after the message data is processed / persisted consumer.commitOffsets();

• Run each consumer in it’s own thread

Page 41: Kafka Reliability Guarantees ATL Kafka User Group

41© Cloudera, Inc. All rights reserved.

New Consumer!• No Zookeeper! At all!• Rebalance listener• Commit:

• Commit• Commit async• Commit( offset)

• Seek(offset)

Page 42: Kafka Reliability Guarantees ATL Kafka User Group

42© Cloudera, Inc. All rights reserved.

Exactly Once Semantics• At most once is easy• At least once is not bad either – commit after 100% sure data is safe• Exactly once is tricky

• Commit data and offsets in one transaction• Idempotent producer

Page 43: Kafka Reliability Guarantees ATL Kafka User Group

43© Cloudera, Inc. All rights reserved.

Monitoring for Data Loss• Monitor for producer errors – watch the retry numbers• Monitor consumer lag – MaxLag or via offsets• Standard schema:

• Each message should contain timestamp and originating service and host• Each producer can report message counts and offsets to a special topic• “Monitoring consumer” reports message counts to another special topic• “Important consumers” also report message counts• Reconcile the results

Page 44: Kafka Reliability Guarantees ATL Kafka User Group

44© Cloudera, Inc. All rights reserved.

Be Safe, Not Sorry• Acks = all• Block.on.buffer.full = true• Retries = MAX_INT• ( Max.inflight.requests.per.connect = 1 )• Producer.close()• Replication-factor >= 3• Min.insync.replicas = 2• Unclean.leader.election = false• Auto.offset.commit = false• Commit after processing• Monitor!

Page 45: Kafka Reliability Guarantees ATL Kafka User Group

45© Cloudera, Inc. All rights reserved.

Thank you