intro-kafka

42
Kafka Introduction Rahul Shukla

Upload: rahul-shukla

Post on 10-Apr-2017

179 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: intro-kafka

Kafka Introduction

Rahul Shukla

Page 2: intro-kafka

2© 2015 Cloudera, Inc. All rights reserved.

Agenda

• Introduction•Concepts•Use Cases•Development•Administration

Page 3: intro-kafka

3© 2015 Cloudera and/or its affiliates. All rights reserved.

ConceptsBasic Kafka Concepts

Page 4: intro-kafka

4© 2015 Cloudera, Inc. All rights reserved.

Why Kafka

4

ClientClient SourceSource

Data Pipelines Start like this.

Page 5: intro-kafka

5© 2015 Cloudera, Inc. All rights reserved.

Why Kafka

5

ClientClient SourceSource

ClientClient

ClientClient

ClientClient

Then we reuse them

Page 6: intro-kafka

6© 2015 Cloudera, Inc. All rights reserved.

Why Kafka

6

ClientClient BackendBackend

ClientClient

ClientClient

ClientClient

Then we add additional endpoints to the existing sources

Another BackendAnother Backend

Page 7: intro-kafka

7© 2015 Cloudera, Inc. All rights reserved.

Why Kafka

7

ClientClient BackendBackend

ClientClient

ClientClient

ClientClient

Then it starts to look like this

Another BackendAnother Backend

Another BackendAnother Backend

Another BackendAnother Backend

Page 8: intro-kafka

8© 2015 Cloudera, Inc. All rights reserved.

Why Kafka

8

ClientClient BackendBackend

ClientClient

ClientClient

ClientClient

With maybe some of this

Another BackendAnother Backend

Another BackendAnother Backend

Another BackendAnother Backend

As distributed systems and services increasingly become part of a modern architecture, this makes for a fragile system

Page 9: intro-kafka

9© 2015 Cloudera, Inc. All rights reserved.

Kafka decouples Data Pipelines

Why Kafka

9

Source SystemSource System

Source SystemSource System

Source SystemSource System

Source SystemSource System

HadoopHadoop Security SystemsSecurity Systems

Real-time monitoringReal-time

monitoringData

WarehouseData

Warehouse

KafkaKafka

Producers

Brokers

Consumers

Page 10: intro-kafka

10© 2015 Cloudera, Inc. All rights reserved.

Key terminology

• Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

• Kafka maintains feeds of messages in categories called topics.

• Kafka maintains each topic in 1 or more partitions.

• Processes that publish messages to a Kafka topic are called producers.

• Processes that subscribe to topics and process the feed of published messages are called consumers.

• Communication between all components is done via a high performance simple binary API over TCP protocol

Page 11: intro-kafka

11© 2015 Cloudera, Inc. All rights reserved.

Efficiency

• Kafka achieves it’s high throughput and low latency primarily from two key concepts

• 1) Batching of individual messages to amortize network overhead and append/consume chunks together

• 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).• Implements linux sendfile() system call which skips unnecessary copies • Heavily relies on Linux PageCache

• The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.

• The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput.

• It automatically uses all the free memory on the machine

Page 12: intro-kafka

12© 2015 Cloudera, Inc. All rights reserved.

Efficiency - Implication

• In a system where consumers are roughly caught up with producers, you are essentially reading data from cache.

• This is what allows end-to-end latency to be so low

Page 13: intro-kafka

13© 2015 Cloudera, Inc. All rights reserved.

Architecture

13

ProducerProducer

ConsumerConsumer ConsumerConsumer

Producers

Kafka Cluster

Consumers

BrokerBroker BrokerBroker BrokerBroker BrokerBroker

ProducerProducer

ZookeeperZookeeper

Offsets

Page 14: intro-kafka

14© 2015 Cloudera, Inc. All rights reserved.

Topics - Partitions

• Topics are broken up into ordered commit logs called partitions.

• Each message in a partition is assigned a sequential id called an offset.

• Data is retained for a configurable period of time*

00 11 22 33 44 55 66 77 88 991010

1111

1212

1313

00 11 22 33 44 55 66 77 88 991010

1111

00 11 22 33 44 55 66 77 88 991010

1111

1212

1313

Partition 1

Partition 2

Partition 3

Writes

Old New

Page 15: intro-kafka

15© 2015 Cloudera, Inc. All rights reserved.

Message Ordering

• Ordering is only guaranteed within a partition for a topic• To ensure ordering:• Group messages in a partition by key (producer)• Configure exactly one consumer instance per partition within a

consumer group

Page 16: intro-kafka

16© 2015 Cloudera, Inc. All rights reserved.

Guarantees

• Messages sent by a producer to a particular topic partition will be appended in the order they are sent

• A consumer instance sees messages in the order they are stored in the log

• For a topic with replication factor N, Kafka can tolerate up to N-1 server failures without “losing” any messages committed to the log

Page 17: intro-kafka

17© 2015 Cloudera, Inc. All rights reserved.

Topics - Replication

• Topics can (and should) be replicated. • The unit of replication is the partition• Each partition in a topic has 1 leader and 0 or more replicas.• A replica is deemed to be “in-sync” if• The replica can communicate with Zookeeper• The replica is not “too far” behind the leader (configurable)

• The group of in-sync replicas for a partition is called the ISR (In-Sync Replicas)

• The Replication factor cannot be lowered

Page 18: intro-kafka

18© 2015 Cloudera, Inc. All rights reserved.

Topics - Replication

• Durability can be configured with the producer configuration request.required.acks• 0 The producer never waits for an ack• 1 The producer gets an ack after the leader replica has received the

data• -1 The producer gets an ack after all ISRs receive the data

• Minimum available ISR can also be configured such that an error is returned if enough replicas are not available to replicate data

Page 19: intro-kafka

19© 2015 Cloudera, Inc. All rights reserved.

• Producers can choose to trade throughput for durability of writes:

• Throughput can also be raised with more brokers… (so do this instead)!

• A sane configuration:

Durable Writes

Durability Behaviour Per Event Latency

Required Acknowledgements(request.required.acks)

Highest ACK all ISRs have received Highest -1

Medium ACK once the leader has received Medium 1

Lowest No ACKs required Lowest 0

Property Value

replication 3

min.insync.replicas 2

request.required.acks -1

Page 20: intro-kafka

20© 2015 Cloudera, Inc. All rights reserved.

Producer

• Producers publish to a topic of their choosing (push)• Load can be distributed• Typically by “round-robin”• Can also do “semantic partitioning” based on a key in the message

• Brokers load balance by partition• Can support async (less durable) sending• All nodes can answer metadata requests about: • Which servers are alive• Where leaders are for the partitions of a topic

Page 21: intro-kafka

21© 2015 Cloudera, Inc. All rights reserved.

Producer – Load Balancing and ISRs

00

11

22

00

11

22

00

11

22

ProducerProducer

Broker 100

Broker 101

Broker 102

Topic:Partitions:Replicas:

my_topic33

Partition:Leader:ISR:

1101100,102

Partition:Leader:ISR:

2102101,100

Partition:Leader:ISR:

0100101,102

Page 22: intro-kafka

22© 2015 Cloudera, Inc. All rights reserved.

Consumer

• Multiple Consumers can read from the same topic• Each Consumer is responsible for managing it’s own offset• Messages stay on Kafka…they are not removed after they are

consumed12345671234567

12345681234568

12345691234569

12345701234570

12345711234571

12345721234572

12345731234573

12345741234574

12345751234575

12345761234576

12345771234577

ConsumerConsumer

ProducerProducer ConsumerConsumer

ConsumerConsumer

12345771234577

Send

Write

Fetch

Fetch

Fetch

Page 23: intro-kafka

23© 2015 Cloudera, Inc. All rights reserved.

Consumer

• Consumers can go away

12345671234567

12345681234568

12345691234569

12345701234570

12345711234571

12345721234572

12345731234573

12345741234574

12345751234575

12345761234576

12345771234577

ConsumerConsumer

ProducerProducer

ConsumerConsumer

12345771234577

Send

Write

Fetch

Fetch

Page 24: intro-kafka

24© 2015 Cloudera, Inc. All rights reserved.

Consumer

• And then come back

12345671234567

12345681234568

12345691234569

12345701234570

12345711234571

12345721234572

12345731234573

12345741234574

12345751234575

12345761234576

12345771234577

ConsumerConsumer

ProducerProducer ConsumerConsumer

ConsumerConsumer

12345771234577

Send

Write

Fetch

Fetch

Fetch

Page 25: intro-kafka

25© 2015 Cloudera, Inc. All rights reserved.

Consumer - Groups

• Consumers can be organized into Consumer Groups

Common Patterns:• 1) All consumer instances in one group• Acts like a traditional queue with load balancing

• 2) All consumer instances in different groups• All messages are broadcast to all consumer instances

• 3) “Logical Subscriber” – Many consumer instances in a group• Consumers are added for scalability and fault tolerance • Each consumer instance reads from one or more partitions for a topic• There cannot be more consumer instances than partitions

Page 26: intro-kafka

26© 2015 Cloudera, Inc. All rights reserved.

Consumer - Groups

P0P0 P3P3 P1P1 P2P2

C1C1 C2C2 C3C3 C4C4 C5C5 C6C6

Kafka ClusterBroker 1 Broker 2

Consumer Group A

Consumer Group B

Consumer Groups provide isolation to topics and partitions

Page 27: intro-kafka

27© 2015 Cloudera, Inc. All rights reserved.

Consumer - Groups

P0P0 P3P3 P1P1 P2P2

C1C1 C2C2 C3C3 C4C4 C5C5 C6C6

Kafka ClusterBroker 1 Broker 2

Consumer Group A

Consumer Group B

Can rebalance themselves X

Page 28: intro-kafka

28© 2015 Cloudera, Inc. All rights reserved.

Delivery Semantics

• At least once• Messages are never lost but may be redelivered

• At most once • Messages are lost but never redelivered

• Exactly once• Messages are delivered once and only once

Default

Page 29: intro-kafka

29© 2015 Cloudera, Inc. All rights reserved.

Delivery Semantics

• At least once• Messages are never lost but may be redelivered

• At most once • Messages are lost but never redelivered

• Exactly once• Messages are delivered once and only once

Much Harder (Impossible??)

Page 30: intro-kafka

30© 2015 Cloudera, Inc. All rights reserved.

Getting Exactly Once Semantics

• Must consider two components• Durability guarantees when publishing a message• Durability guarantees when consuming a message

• Producer• What happens when a produce request was sent but a network error

returned before an ack?• Use a single writer per partition and check the latest committed

value after network errors

• Consumer• Include a unique ID (e.g. UUID) and de-duplicate. • Consider storing offsets with data

Page 31: intro-kafka

31© 2015 Cloudera, Inc. All rights reserved.

Use Cases

• Real-Time Stream Processing (combined with Spark Streaming)

• General purpose Message Bus

• Collecting User Activity Data

• Collecting Operational Metrics from applications, servers or devices

• Log Aggregation

• Change Data Capture

• Commit Log for distributed systems

Page 32: intro-kafka

32© 2015 Cloudera, Inc. All rights reserved.

Positioning

Should I use Kafka ?• For really large file transfers?

• Probably not, it’s designed for “messages” not really for files. If you need to ship large files, consider good-ole-file transfer, or breaking up the files and reading per line to move to Kafka.

• As a replacement for MQ/Rabbit/Tibco• Probably. Performance Numbers are drastically superior. Also gives the ability for

transient consumers. Handles failures pretty well.

• If security on the broker and across the wire is important?• Kafka-0.9.0.0 onwards kafka provides ssl security and authentication(experimental)

• To do transformations of data? • Kafka-0.9.0.0 onwards kafka introduce kafka-connect for this.

Page 33: intro-kafka

33© 2015 Cloudera and/or its affiliates. All rights reserved.

Developing with KafkaAPI and Clients

Page 34: intro-kafka

34© 2015 Cloudera, Inc. All rights reserved.

Kafka Clients

• Remember Kafka implements a binary TCP protocol.

• All clients except the JVM-based clients are maintained external to the code base.

Page 35: intro-kafka

35© 2015 Cloudera, Inc. All rights reserved.

Producerpublic static void main(String[] args) throws InterruptedException { String topic = "kafka"; String brokers = "localhost:9092"; String StringSerializer = "org.apache.kafka.common.serialization.StringSerializer"; Random rnd = new Random();

Map<String, Object> config = new HashMap<>(); config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers); config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer); config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer);

KafkaProducer producer = new KafkaProducer<String, String>(config);

while (true) { LocalDateTime runtime = LocalDateTime.now(); String ip = "192.168.2." + rnd.nextInt(255); String msg = runtime + " www.example.com " + ip; System.out.println(msg); ProducerRecord record = new ProducerRecord<String, String>(topic, ip, msg); producer.send(record); Thread.sleep(100); }}

Page 36: intro-kafka

36© 2015 Cloudera, Inc. All rights reserved.

Consumerpublic static void main(String[] args) throws InterruptedException { String topic = "kafka"; String brokers = "localhost:9092"; String stringSerializer = "org.apache.kafka.common.serialization.StringDeserializer";

Map<String, Object> config = new HashMap<>(); config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers); config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, stringSerializer); config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, stringSerializer); config.put(ConsumerConfig.GROUP_ID_CONFIG, "test");

KafkaConsumer consumer = new KafkaConsumer<String, String>(config); HashSet<String> topics = new HashSet<>(); topics.add(topic); consumer.subscribe(topics); while (true) { ConsumerRecords<String, String> records = consumer.poll(200); if (!records.isEmpty()) { for (ConsumerRecord<String, String> record : records) { System.out.println(record); } } }}

Page 37: intro-kafka

37© 2015 Cloudera, Inc. All rights reserved.

Kafka Clients

• Full list on the wiki, some examples…

Page 38: intro-kafka

38© 2015 Cloudera and/or its affiliates. All rights reserved.

AdministrationBasic

Page 39: intro-kafka

39© 2015 Cloudera, Inc. All rights reserved.

Installation / Configuration1. Download kafka from

– > wget http://apache.mirror.serversaustralia.com.au/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz

– > tar -xvzf kafka_2.11-0.10.0.0.tgz

– > cd kafka_2.11-0.10.0.0

2. Start Zookeeper server

– > bin/zookeeper-server-start.sh config/zookeeper.properties

3. Start Kafka sever

– > bin/kafka-server-start.sh config/server.properties

Page 40: intro-kafka

40© 2015 Cloudera, Inc. All rights reserved.

• Create a topic*:

• List topics:

• Write data:

• Read data:

Usage

bin/kafka­topics.sh ­­zookeeper zkhost:2181 ­­create ­­topic foo ­­replication­factor 1 ­­partitions 1

bin/kafka­topics.sh ­­zookeeper zkhost:2181 ­­list

cat data | bin/kafka­console­producer.sh ­­broker­list brokerhost:9092 ­­topic test

bin/kafka­console­consumer.sh ­­zookeeper zkhost:2181 ­­topic test ­­from­beginning

Page 41: intro-kafka

41© 2015 Cloudera, Inc. All rights reserved.

Kafka Topics - Admin

• Commands to administer topics are via shell script:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

bin/kafka-topics.sh --list --zookeeper localhost:2181

Topics can be created dynamically by producers…don’t do this

Page 42: intro-kafka

Thank you.