guest lecture on spark streaming in stanford cme 323: distributed algorithms and optimization
TRANSCRIPT
![Page 1: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/1.jpg)
Streaming items through a cluster with Spark Streaming
Tathagata “TD” Das @tathadas
CME 323: Distributed Algorithms and OptimizationStanford, May 6, 2015
![Page 2: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/2.jpg)
Who am I?>Project Management Committee (PMC)
member of Apache Spark
>Lead developer of Spark Streaming
>Formerly in AMPLab, UC Berkeley
>Software developer at Databricks
>Databricks was started by creators of Spark to provide Spark-as-a-service in the cloud
![Page 3: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/3.jpg)
Big Data
![Page 4: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/4.jpg)
Big Streaming Data
![Page 5: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/5.jpg)
Why process Big Streaming Data?Fraud detection in bank transactions
Anomalies in sensor data
Cat videos in tweets
![Page 6: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/6.jpg)
How to Process Big Streaming Data
>Ingest – Receive and buffer the streaming
data
>Process – Clean, extract, transform the
data
>Store – Store transformed data for
consumption
Ingest data
Process data
Store results
Raw Tweets
![Page 7: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/7.jpg)
How to Process Big Streaming Data
>For big streams, every step requires a
cluster
>Every step requires a system that is
designed for it
Ingest data
Process data
Store results
Raw Tweets
![Page 8: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/8.jpg)
Stream Ingestion Systems
>Kafka – popular distributed pub-sub system
>Kinesis – Amazon managed distributed pub-sub
>Flume – like a distributed data pipe
Ingest data
Process data
Store results
Raw Tweets
AmazonKinesis
![Page 9: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/9.jpg)
Stream Ingestion Systems
>Spark Streaming – From UC Berkeley +
Databricks
>Storm – From Backtype + Twitter
>Samza – From LinkedIn
Ingest data
Process data
Store results
Raw Tweets
![Page 10: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/10.jpg)
Stream Ingestion Systems
>File systems – HDFS, Amazon S3, etc.
>Key-value stores – HBase, Cassandra, etc.
>Databases – MongoDB, MemSQL, etc.
Ingest data
Process data
Store results
Raw Tweets
![Page 11: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/11.jpg)
![Page 12: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/12.jpg)
Kafka Cluster
Producers and Consumers
>Producers publish data tagged by “topic”>Consumers subscribe to data of a particular
“topic”
Producer 1
Producer 2
(topicX, data1)
(topicY, data2)
(topicX, data3)
Topic X Consumer
Topic Y Consumer
(topicX, data1)
(topicX, data3)
(topicY, data2)
![Page 13: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/13.jpg)
>Topic = category of message, divided into partitions
>Partition = ordered, numbered stream of messages
>Producer decides which (topic, partition) to put each message in
Topics and Partitions
![Page 14: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/14.jpg)
>Topic = category of message, divided into partitions
>Partition = ordered, numbered stream of messages
>Producer decides which (topic, partition) to put each message in
>Consumer decides which (topic, partition) to pull messages from
- High-level consumer – handles fault-recovery with Zookeeper
- Simple consumer – low-level API for greater control
Topics and Partitions
![Page 15: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/15.jpg)
How to process Kafka messages?
Ingest data
Process data
Store results
Raw Tweets
>Incoming tweets received in distributed manner and buffered in Kafka
>How to process them?
![Page 16: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/16.jpg)
treaming
![Page 17: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/17.jpg)
What is Spark Streaming?
Scalable, fault-tolerant stream processing system
File systems
DatabasesDashboard
s
Flume
HDFS
Kinesis
Kafka
High-level API
joins, windows, …often 5x less code
Fault-tolerantExactly-once
semantics, even for stateful ops
IntegrationIntegrate with MLlib,
SQL, DataFrames, GraphX
![Page 18: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/18.jpg)
How does Spark Streaming work?
> Receivers chop up data streams into batches of few seconds
> Spark processing engine processes each batch and pushes out the results to external data stores
data streams
Receiv
ers
batches as RDDs
results as RDDs
![Page 19: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/19.jpg)
Spark Programming Model>Resilient distributed datasets (RDDs)
- Distributed, partitioned collection of objects
- Manipulated through parallel transformations
(map, filter, reduceByKey, …)
- All transformations are lazy, execution forced by actions
(count, reduce, take, …)
- Can be cached in memory across cluster
- Automatically rebuilt on failure
![Page 20: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/20.jpg)
Spark Streaming Programming Model>Discretized Stream (DStream)
- Represents a stream of data
- Implemented as a infinite sequence of RDDs
>DStreams API very similar to RDD API- Functional APIs in
- Create input DStreams from Kafka, Flume, Kinesis, HDFS, …
- Apply transformations
![Page 21: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/21.jpg)
Example – Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1))
StreamingContext is the starting point of all streaming functionality
Batch interval, by which streams will be chopped up
![Page 22: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/22.jpg)
Example – Get hashtags from Twitter val ssc = new StreamingContext(conf, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
Input DStream
batch @ t+1batch @ t batch @ t+2
tweets DStream
replicated and stored in memory as RDDs
Twitter Streaming API
![Page 23: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/23.jpg)
Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one DStream to create another DStream
transformed DStream
new RDDs created for every batch
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags Dstream[#cat, #dog, … ]
![Page 24: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/24.jpg)
Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsTextFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
every batch saved to HDFS
![Page 25: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/25.jpg)
Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })
foreachRDD: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Write to a database, update analytics UI, do whatever you want
![Page 26: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/26.jpg)
Example – Get hashtags from Twitter val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })
ssc.start()
all of this was just setup for what to do when streaming data is receiver
this actually starts the receiving and processing
![Page 27: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/27.jpg)
What’s going on inside?>Receiver buffers
tweets in Executors’ memory
>Spark Streaming Driver launches tasks to process tweets
Driver running
DStreams
Executors
launch tasks to process
tweets
Raw TweetsTw
itte
r R
ece
iver Buffere
d Tweets
Buffered
Tweets
Spark Cluster
![Page 28: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/28.jpg)
What’s going on inside?
Driver running
DStreams
Executors
launch tasks to process
data
Rece
iver
Kafka Cluster
Rece
iver
Rece
iver
receive data in parallel
Spark Cluster
![Page 29: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/29.jpg)
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
0 50 1000
0.5
1
1.5
2
2.5
3
3.5WordCount
1 sec2 sec
# Nodes in Cluster
Clus
ter T
hrou
ghpu
t (GB
/s)
0 20 40 60 80100
0
1
2
3
4
5
6
7Grep
1 sec2 sec
# Nodes in Cluster
Clus
ter T
hhro
ughp
ut (G
B/s)
![Page 30: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/30.jpg)
DStream of data
Window-based Transformationsval tweets = TwitterUtils.createStream(ssc, auth)
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window operation window length sliding interval
window length
sliding interval
![Page 31: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/31.jpg)
Arbitrary Stateful ComputationsSpecify function to generate new state based on previous state and new data- Example: Maintain per-user mood as state, and
update it with their tweets
def updateMood(newTweets, lastMood) => newMood
val moods = tweetsByUser.updateStateByKey(updateMood _)
![Page 32: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/32.jpg)
Integrates with Spark Ecosystem
Spark Core
Spark StreamingSpark SQL MLlib GraphX
![Page 33: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/33.jpg)
Combine batch and streaming processing
> Join data streams with static data sets
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with dataset
kafkaStream.transform { batchRDD =>
batchRDD.join(dataset)filter(...)
}
Spark Core
Spark StreamingSpark SQL MLlib GraphX
![Page 34: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/34.jpg)
Combine machine learning with streaming
> Learn models offline, apply them online
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
kafkaStream.map { event =>
model.predict(event.feature)
}
Spark Core
Spark StreamingSpark SQL MLlib GraphX
![Page 35: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/35.jpg)
Combine SQL with streaming
> Interactively query streaming data with SQL
// Register each batch in stream as table
kafkaStream.map { batchRDD =>
batchRDD.registerTempTable("latestEvents")
}
// Interactively query table
sqlContext.sql("select * from latestEvents")
Spark Core
Spark StreamingSpark SQL MLlib GraphX
![Page 36: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/36.jpg)
100+ known industry deployments
![Page 37: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/37.jpg)
Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
37
![Page 38: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/38.jpg)
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib to analyze neural activities
Laser microscope scans Zebrafish brain Spark Streaming interactive visualization
laser ZAP to kill neurons!http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
![Page 39: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/39.jpg)
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning algorithms on time series data of every neuron
Upto 2TB/hour and increasing with brain size
Upto 80 HPC nodeshttp://www.jeremyfreeman.net/share/talks/spark-summit-2014/
![Page 40: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/40.jpg)
Streaming Machine Learning Algos>Streaming Linear Regression
>Streaming Logistic Regression
>Streaming KMeans
http://www.jeremyfreeman.net/share/talks/spark-summit-east-2015/#/algorithms-repeat
![Page 41: Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms and Optimization](https://reader035.vdocuments.net/reader035/viewer/2022062420/55cb348bbb61eb7a098b46ba/html5/thumbnails/41.jpg)
Okay okay, how do I start off?>Online Streaming
Programming Guidehttp://spark.apache.org/docs/latest/streaming-programming-guide.html
>Streaming exampleshttps://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming