distributed systems for stream processing · distributed systems for stream processing apache kafka...

Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid

Upload: others

Post on 22-May-2020




0 download


Page 1: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Distributed systems for stream processing

Apache Kafka and Spark Structured Streaming

Alena Hall - @lenadroid

Page 2: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

ü HighScale&BigDataü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning

Alena Hall - lenadroid

Page 3: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data




Page 4: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Direct result of some action


Page 5: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Produced as a side effect


Page 6: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Continuous metrics


Page 7: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data


urgent not-so-urgent flexible


Page 8: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data


urgent not-so-urgent flexible

near-real-time~ seconds

real-time~ sub milliseconds

batch~minutes, hours, days, weeks


Page 9: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Event Ingestion Processing & Reaction

real-time micro-batch



Page 10: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Are data workflows flexible enough?


Page 11: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data


Simplicity. Scalability. Reliability


Page 12: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Meet Apache Kafka


Page 13: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Kafka is an open-source stream-

processing software platform developed by the

Apache Software Foundation written in Scala

and Java.


Page 14: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Brokers


Page 15: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Inside of a Kafka Topic

0 1 2 3 4

0 1 2 3

80 1 2 3 4 5 6 7


Page 16: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Topic Partition

80 1 2 3 4 5 6 7


Page 17: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Create a Kafka topic


Page 18: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data
Page 19: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Producers and Consumers


Page 20: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Systems for stream processing






Page 21: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Meet Apache Spark


Page 22: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Spark is a unified analytics engine for large-scale data

processing: batch, streaming, machine learning, graph

computation with access to data in hundreds of sources.


Page 23: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

ü Spark SQL and batch processing

ü Stream processing with Spark Streaming and Structured


ü * Continuous processing

ü Machine Learning with Mllib

ü Graph computations with GraphX

* Experimental lenadroid

Page 24: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Spark program


Page 25: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

How does Spark work?


Page 26: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data













Page 27: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Kafka + Apache Spark


Page 28: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Existing infrastructure and resources

ü Kafka cluster (HDInsight or other)

ü Spark cluster (Azure Databricks workspace, or other)

ü Peered Kafka and Spark Virtual Networks

ü Sources of data: Twitter & Slack


Page 29: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Databricks: Interactive Environment


Page 30: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data
Page 31: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data


Processing streams of events from multiple sources with Apache Kafka and Spark


Page 32: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Data sources: external, internal, ...

• Big number of data sources

• Most of the data sources are independent

• Sources of data used for many processing tasks & end-goals


Page 33: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Feedback from Slack

ü Sending messages to Slack


Page 34: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Listener for new Slack messages

ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic


Page 35: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Receiving events in Kafka topic

ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka


Page 36: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Sending Twitter feedback to Kafka

ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark


Page 37: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Analyzing feedback in real-time

ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel


Page 38: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka + Spark = Reliable, scalable, durable event ingestion

and efficient stream processing


Page 39: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

trigger(Trigger.Continuous("1 second"))

Low (~1 ms) end-to-end latency

At-least-once fault-tolerance guarantees

Not nearly all operations are supported yet

No automatic retries of failed tasks

Needs enough cluster power to operate


Page 40: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

EveryXseconds EveryXseconds EveryXseconds





Page 41: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

EveryXseconds EveryXseconds EveryXseconds






Page 42: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data


Page 43: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data




Page 44: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Thank you!

Apache Kafka: aka.ms/apache-kafka

Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure

Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm

Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic


Page 45: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Alena Hall - lenadroid

ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall