deep dive into spark streaming

Download Deep dive into spark streaming

Post on 17-Aug-2015

88 views

Category:

Internet

1 download

Embed Size (px)

TRANSCRIPT

  1. 1. NRT: Deep Dive into Spark Streaming
  2. 2. Agenda Spark Streaming Introduction Computing Model in Spark Streaming System Model & Architecture Fault-tolerance, Check pointing Comb on Spark Streaming
  3. 3. What is Spark Streaming? Extends Spark for doing large scale stream processing Efficient and fault-tolerant stateful stream processing Provides a simple batch-like API for implementing complex algorithms Spark Spark SQL Spark Streaming GraphX MLLib
  4. 4. Why Spark Streaming Scales to hundreds of nodes Achieves low latency Efficiently recover from failures Integrates with batch and interactive processing
  5. 5. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Spark Spark Streaming batches of X seconds live data stream processed results Chop up the live stream into batches of X seconds (Batch Size as low as 0.5 Second, Latency is about 1 second) Spark treats each batch of data as RDDs and processes them using RDD operations Finally, the processed results of the RDD operations are returned in batches
  6. 6. Agenda Spark Streaming Introduction Computing Model in Spark Streaming System Model & Architecture Fault-tolerance, Check pointing Comb on Spark Streaming
  7. 7. Spark Computing Model Discretized Stream (DStream) Represents a stream of data Implemented as a sequence of RDDs DStreams can be either Created from Streaming input sources Created by applying transformations on existing DStreams
  8. 8. Spark Streaming - Key Concepts DStream sequence of RDDs representing a stream of data - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations modify data from one DStream to another - Standard RDD operations map, countByValue, reduceByKey, join, - Stateful operations window, countByValueAndWindow, Output Operations send data to external entity - saveAsHadoopFiles saves to HDFS - foreach do anything with each batch of results
  9. 9. Example 1 Get hashtags from Twitter val tweets = ssc.twitterStream(, ) DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data batch @ t+1batch @ t batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed dataset) Twitter Streaming API
  10. 10. Example 1 Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap transformation: modify data in one DStream to create another DStream new DStream new RDDs created for every batch batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, ]
  11. 11. Example 1 Get hashtags from Twitter val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  12. 12. DStream of data Example 2 Count the hashtags over last 1 min val tweets = ssc.twitterStream(, ) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() sliding window operation window length sliding interval window length sliding interval
  13. 13. tagCounts Example 2 Count the hashtags over last 1 min val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue() hashTags t-1 t t+1 t+2 t+3 sliding window countByValue count over all the data in the window
  14. 14. Example 3 - Arbitrary Stateful Computations Maintain arbitrary state, track sessions - Maintain per-user mood as state, and update it with his/her tweets moods = tweets.updateStateByKey(tweet => updateMood(tweet)) updateMood(newTweets, lastMood) => newMood tweets t-1 t t+1 t+2 t+3 moods
  15. 15. Example 4 - Combine Batch and Stream Processing Do arbitrary Spark RDD computation within DStream - Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) }) Combine Live data Streams with Historical data Combine Streaming with MLLib, GraphX algos Query Streaming data using SQL
  16. 16. Agenda Spark Streaming Introduction Computing Model in Spark Streaming System Model & Architecture Fault-tolerance, Check pointing Comb on Spark Streaming
  17. 17. System - Components Network Input Tracker Keeps track of the data received by each network receiver and maps them to the corresponding input DStreams Job Scheduler Periodically queries the DStream graph to generate Spark jobs from received data, and hands them to Job Manager for execution Job Manager Maintains a job queue and executes the jobs in Spark ssc = new StreamingContext t = ssc.twitterStream() t.filter().foreach() Your program DStream graph Spark Context Network Input Tracker Job Manager Job Scheduler Spark Client RDD graph Scheduler Block manager Shuffle tracker Spark Worker Block manager Task threads Cluster Manager
  18. 18. DStream Graph Spark Streaming Program DStream Graph t = ssc.twitterStream() .map() t.foreach() t1 = ssc.twitterStream() t2 = ssc.twitterStream() t = t1.union(t2).map() t.saveAsHadoopFiles() t.map().foreach() t.filter().foreach() T M FE Twitter Input DStream Mapped DStream Foreach DStream T1 U M T2 M FFE FE FE Dummy DStream signifying an output operation
  19. 19. DStream Graph -> RDD Graphs -> Spark Jobs Every interval, RDD graph is computed from DStream graph For each output operation, a Spark action is created For each action, a Spark job is created to compute it T U M T M FFE FE FE DStream Graph B U M B M FA A A RDD Graph Spark actions Block RDDs with data received in last batch interval 3 Spark jobs
  20. 20. Execution Model Receiving Data Spark Streaming + Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data Received Block Manager Blocks replicated Block Manager Master Block Manager Blocks pushed
  21. 21. Execution Model Job Scheduling Spark Streaming + Driver Spark Workers Network Input Tracker Job Scheduler Sparks Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager JobQueue Jobs Block IDsRDDs
  22. 22. Agenda Spark Streaming Introduction Computing Model in Spark Streaming System Model & Architecture Fault-tolerance, Check pointing Comb on Spark Streaming
  23. 23. Fault-tolerance: Worker RDDs remember the operations that created them Batches of input data are replicated in memory for fault-tolerance Data lost due to worker failure, can be recomputed from replicated input data All transformed data is fault-tolerant, and exactly-once transformations input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  24. 24. Fault-tolerance: Driver Master saves the state of the DStreams to a checkpoint file Checkpoint file saved to HDFS periodically If master fails, it can be restarted using the checkpoint file Automated master fault recovery
  25. 25. Check Pointing - RDD Saving RDD to HDFS to prevent RDD graph from growing too large Done internally in Spark transparent to the user program Done lazily, saved to HDFS the first time it is computed red_rdd.checkpoint() HDFS file Contents of red_rdd saved to a HDFS file transparent to all child RDDs
  26. 26. Why is RDD checkpointing necessary? Stateful DStream operators can have infinite lineages Large lineages lead to Large closure of the RDD object large task sizes high task launch times High recovery times under failure data t-1 t t+1 t+2 t+3 states
  27. 27. Check Pointing - Metadata Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. Metadata includes: Configuration - The configuration that were used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. This is used to recover from failure of the node running the driver of the streaming application
  28. 28. Agenda Spark Streaming Introduction Computing Model in Spark Streaming System Model & Architecture Fault-tolerance, Check pointing Comb on Spark Streaming
  29. 29. Data Processing Window Operation Window based Computing countByWindow reduceByKeyAndWindow Scenarios App Metrics in time Window Cache/Persist events in time Window Notes: Checkpoint Performance tuning val eventCount = inputs.countByWindow(Seconds(30), Seconds(10)) eventCount.print() val idCount=lines.map(x=>(parser(x),1)).reduceByKey(_+_) val idWindowCount=idCount.reduceByKeyAndWindow( (a:Int,b:Int)=>(a+b),Seconds(30),Seconds(10)) idWindowCount.print()
  30. 30. Data Processing Session Operation Sessionization: updateStateByKey val events =inputs.map[(String, (Long, String))](event => { val eventObj = event.getBondValue() val id = eventObj.getAppInfo().getId() val eventName = eventObj.getEventInfo.getName() val time = System.currentTimeMillis() (eventName, (time, id)) }) val latestSessions = events.map[(String, (Long, Long, Long))](imp => { (imp._1, (imp._2._1, imp._2._1, 1)) }) .reduceByKey((imp1, imp2) => { (Math.min(imp1._1, imp2._1), Math.max(imp1._2, imp2._2), imp1._3 + imp2._3) }) .updateStateByKey(SessionObject.updateStatbyOfSessions) 1 1 2 How to Update Session Status?
  31. 31. def updateStatbyOfSessions(newBatch: Seq[(Long, Long, Long)], session: Option[(Long, Long, Long)] ): Option[(Long, Long, Long)] = { var result: Option[(Long, Long, Long)] = null if (newBatch.isEmpty) { if (System.currentTimeMillis() - session.get._2 > SESSION_TIMEOUT) {