distributed systems for stream processing · distributed systems for stream processing apache kafka...
TRANSCRIPT
Distributed systems for stream processing
Apache Kafka and Spark Structured Streaming
Alena Hall - @lenadroid
ü HighScale&BigDataü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning
Alena Hall - lenadroid
Ever-increasing
Data
lenadroid
Direct result of some action
lenadroid
Produced as a side effect
lenadroid
Continuous metrics
lenadroid
Reaction
urgent not-so-urgent flexible
lenadroid
Reaction
urgent not-so-urgent flexible
near-real-time~ seconds
real-time~ sub milliseconds
batch~minutes, hours, days, weeks
lenadroid
Event Ingestion Processing & Reaction
real-time micro-batch
batch
lenadroid
Are data workflows flexible enough?
lenadroid
Challenges
Simplicity. Scalability. Reliability
lenadroid
Meet Apache Kafka
lenadroid
Apache Kafka is an open-source stream-
processing software platform developed by the
Apache Software Foundation written in Scala
and Java.
lenadroid
Kafka Brokers
lenadroid
Inside of a Kafka Topic
0 1 2 3 4
0 1 2 3
80 1 2 3 4 5 6 7
lenadroid
Kafka Topic Partition
80 1 2 3 4 5 6 7
lenadroid
Create a Kafka topic
lenadroid
Kafka Producers and Consumers
lenadroid
Systems for stream processing
Kafka
Storm
Spark
Flink
lenadroid
Meet Apache Spark
lenadroid
Apache Spark is a unified analytics engine for large-scale data
processing: batch, streaming, machine learning, graph
computation with access to data in hundreds of sources.
lenadroid
ü Spark SQL and batch processing
ü Stream processing with Spark Streaming and Structured
Streaming
ü * Continuous processing
ü Machine Learning with Mllib
ü Graph computations with GraphX
* Experimental lenadroid
Spark program
lenadroid
How does Spark work?
lenadroid
Sparkapplication(Driver)
ClusterManager
task
task
task
task
task
task
task
task
task
lenadroid
Apache Kafka + Apache Spark
lenadroid
Existing infrastructure and resources
ü Kafka cluster (HDInsight or other)
ü Spark cluster (Azure Databricks workspace, or other)
ü Peered Kafka and Spark Virtual Networks
ü Sources of data: Twitter & Slack
lenadroid
Databricks: Interactive Environment
lenadroid
Example
Processing streams of events from multiple sources with Apache Kafka and Spark
lenadroid
Data sources: external, internal, ...
• Big number of data sources
• Most of the data sources are independent
• Sources of data used for many processing tasks & end-goals
lenadroid
Feedback from Slack
ü Sending messages to Slack
lenadroid
Listener for new Slack messages
ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic
lenadroid
Receiving events in Kafka topic
ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka
lenadroid
Sending Twitter feedback to Kafka
ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark
lenadroid
Analyzing feedback in real-time
ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel
lenadroid
Kafka + Spark = Reliable, scalable, durable event ingestion
and efficient stream processing
lenadroid
trigger(Trigger.Continuous("1 second"))
Low (~1 ms) end-to-end latency
At-least-once fault-tolerance guarantees
Not nearly all operations are supported yet
No automatic retries of failed tasks
Needs enough cluster power to operate
lenadroid
EveryXseconds EveryXseconds EveryXseconds
WheneventIsatsource
Wheneventisprocessedtosink
Check-pointingepoch
lenadroid
EveryXseconds EveryXseconds EveryXseconds
WheneventIsatsource
Wheneventisprocessedtosink
Check-pointingepoch
~1ms
lenadroid
aka.ms/eventhubs-kafkalenadroid
Operator
Operators
lenadroid
Thank you!
Apache Kafka: aka.ms/apache-kafka
Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure
Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm
Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic
lenadroid
Alena Hall - lenadroid
ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall