spark streaming with cassandra

Download Spark Streaming with Cassandra

Post on 21-May-2015

1.480 views

Category:

Data & Analytics

3 download

Embed Size (px)

DESCRIPTION

In this presentation we show the introduction to Apache Spark Streaming system and the way how we can use it with Apache Cassandra database.

TRANSCRIPT

  • 1. spark streamingwith C*jacek.lewandowski@datastax.com

2. applies where you neednear-realtime data analysis 3. Spark vs Spark Streamingzillions of bytes gigabytes per secondstatic datasetstream of data 4. What can you do with it?applications sensors web mobile phonesintrusion detection malfunction detection site analytics network metrics analysisfraud detectiondynamic processoptimisationrecommendations location based adslog processing supply chain planning sentiment analysis spying 5. What can you do with it?applications sensors web mobile phonesintrusion detection malfunction detection site analytics network metrics analysisfraud detectiondynamic processoptimisationrecommendations location based adslog processing supply chain planning sentiment analysis spying 6. AlmostWhateverSourceYouWantAlmostWhateverDestinationYouWant 7. so, lets see how it works 8. DStream - A continuous sequenceof micro batchesDStreamBatch (ordinary RDD) Batch (ordinary RDD) Batch (ordinary RDD)Processing of DStream = Processing of Batches, RDDs 9. 9 8 7 6 5 4 3 2 1 Receiver Interface between differentstream sources and Spark 10. 9 8 7 6 5 4 3 2 1 ReceiverSpark memory boundaryBlock ManagerInterface between differentstream sources and Spark 11. 9 8 7 6 5 4 3 2 1 ReceiverSpark memory boundaryBlock ManagerReplication andbuilding BatchesInterface between differentstream sources and Spark 12. Spark memory boundary Block Manager 13. Spark memory boundary Block ManagerBlocks of input data9 8 7 6 5 4 3 2 1 14. Spark memory boundary Block ManagerBlocks of input data9 8 7 6 5 4 3 2 1Batch made of blocks9 8 7 6 5 4 3 2 1 15. Batch made of blocks9 8 7 6 5 4 3 2 1 16. Batch made of blocks9 8 7 6 5 4 3 2 1Partition Partition Partition 17. Batch made of blocks9 8 7 6 5 4 3 2 1Partition Partition Partition 18. Ingestion from multiple sourcesReceiving,Batch buildingReceiving,Batch buildingReceiving,Batch building 19. Ingestion from multiple sourcesReceiving,Batch buildingReceiving,Batch buildingReceiving,Batch buildingBatch Batch2s 1s 0s 20. A well-worn example ingestion of text messages splitting them into separate words count the occurrence of words within 5seconds windows save word counts from the last 5 seconds,every 5 second to Cassandra, and display thefirst few results on the console 21. how to do that ?well 22. Yes, it is that easycase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """s+"""))val wordCounts: DStream[(String, Long)] = words.countByValue()val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) =>val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map {case (word, count) =>(count.toInt, WordCount(time.milliseconds, word, count.toInt))}val topWordCountsRDD: RDD[WordCount] = mappedWordCounts.sortByKey(ascending = false).values)topWordsStream.saveToCassandra("meetup", "word_counts")topWordsStream.print() 23. DStream stateless operators(quick recap) map flatMap filter repartition union count countByValue reduce reduceByKey joins cogroup transform transformWith 24. DStream[Bean].count()count 4 31s 1s 1s 1s 25. DStream[Bean].count()count 4 31s 1s 1s 1s 26. DStream[Orange].union(DStream[Apple])union1s 1s 27. Other stateless operations join(DStream[(K, W)]) leftOuterJoin(DStream[(K, W)]) rightOuterJoin(DStream[(K, W)]) cogroup(DStream[(K, W)])are applied on pairs of corresponding Batches 28. transform, transformWith DStream[T].transform(RDD[T] => RDD[U]): DStream[U] DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V]allow you to create new stateless operators 29. DStream[Blue].transformWith(DStream[Red], ): DStream[Violet]1-A 2-A 3-A1-B 2-B 3-B1-A x 1-B 2-A x 2-B 3-A x 3-B 30. DStream[Blue].transformWith(DStream[Red], ): DStream[Violet]1-A 2-A 3-A1-B 2-B 3-B1-A x 1-B 2-A x 2-B 3-A x 3-B 31. DStream[Blue].transformWith(DStream[Red], ): DStream[Violet]1-A 2-A 3-A1-B 2-B 3-B1-A x 1-B 2-A x 2-B 3-A x 3-B 32. Windowingslide0s 1s 2s 3s 4s 5s 6s 7sBy default:window = slide = Batch durationwindow 33. Windowingslide0s 1s 2s 3s 4s 5s 6s 7sBy default:window = slide = Batch durationwindow 34. Windowingslide0s 1s 2s 3s 4s 5s 6s 7sBy default:window = slide = Batch durationwindow 35. Windowingslidewindow0s 1s 2s 3s 4s 5s 6s 7sThe resulting DStream consists of 3 seconds Batches!Each resulting Batch overlaps the preceding one by 1 second 36. Windowingslidewindow0s 1s 2s 3s 4s 5s 6s 7sThe resulting DStream consists of 3 seconds Batches!Each resulting Batch overlaps the preceding one by 1 second 37. Windowingslidewindow0s 1s 2s 3s 4s 5s 6s 7sThe resulting DStream consists of 3 seconds Batches!Each resulting Batch overlaps the preceding one by 1 second 38. Windowingslidewindow1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8Batch appears in output stream every 1s!It contains messages collected during 3s1s 39. Windowingslidewindow1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8Batch appears in output stream every 1s!It contains messages collected during 3s1s 40. DStream window operators window(Duration, Duration) countByWindow(Duration, Duration) reduceByWindow(Duration, Duration, (T, T) => T) countByValueAndWindow(Duration, Duration) groupByKeyAndWindow(Duration, Duration) reduceByKeyAndWindow((V, V) => V, Duration, Duration) 41. Lets modify the example ingestion of text messages splitting them into separate words count the occurrence of words within 10seconds windows save word counts from the last 10 seconds,every 2 second to Cassandra, and display thefirst few results on the console 42. Yes, it is still easy to docase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """s+"""))val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2))val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) =>val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map {case (word, count) =>(count.toInt, WordCount(time.milliseconds, word, count.toInt))}val topWordCountsRDD: RDD[WordCount] = mappedWordCounts.sortByKey(ascending = false).values)topWordsStream.saveToCassandra("meetup", "word_counts")topWordsStream.print() 43. DStream stateful operator DStream[(K, V)].updateStateByKey(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]A1B2A3C4A5B6A7B8C9 R1 = f(Seq(1, 3, 5), Some(7)) R2 = f(Seq(2, 6), Some(8)) R3 = f(Seq(4), Some(9))AR1BR2CR3 44. Total word count examplecase class WordCount(time: Long, word: String, count: Int)def update(counts: Seq[Long], state: Option[Long]): Option[Long] = {val sum = counts.sumSome(state.getOrElse(0L) + sum)}val totalWords: DStream[(String, Long)] =stream.map { case (_, paragraph) => paragraph}.flatMap(_.split( """s+""")).countByValue().updateStateByKey(update)val topTotalWordCounts: DStream[WordCount] =totalWords.transform((rdd, time) =>rdd.map { case (word, count) =>(count, WordCount(time.milliseconds, word, count.toInt))}.sortByKey(ascending = false).values)topTotalWordCounts.saveToCassandra("meetup", "word_counts_total")topTotalWordCounts.print() 45. Obtaining DStreams ZeroMQ Kinesis HDFS compatible file system Akka actor Twitter MQTT Kafka Socket Flume 46. Particular DStreamsare available in separate modulesGroupId ArtifactId Latest Versionorg.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)org.apache.spark spark-streaming-flume-sink_2.10 1.1.0org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7) 47. If something goes wrong 48. Fault toleranceThe sequenceof transformations is knownto Spark StreamingBatches are replicatedonce they are receivedLost data can be recomputed 49. But there are pitfalls Spark replicates blocks, not single messages It is up to a particular receiver to decide whether to form the block from asingle message or to collect more messages before pushing the block The data collected in the receiver before the block is pushed will be lost incase of failure of the receiver Typical tradeoff - efficiency vs fault tolerance 50. Built-in receivers breakdownPushing singlemessagesCan do both Pushing whole blocksKafka Akka RawNetworkReceiverTwitter Custom ZeroMQSocketMQTT 51. Thank you !Questions?!http://spark.apache.org/https://github.com/datastax/spark-cassandra-connectorhttp://cassandra.apache.org/http://www.datastax.com/