spark streaming + amazon kinesis

22
Spark Streaming + Amazon Kinesis @imaifactory

Upload: yuta-imai

Post on 07-Aug-2015

1.424 views

Category:

Technology


0 download

TRANSCRIPT

  1. 1. Spark Streaming + Amazon Kinesis @imai_factory
  2. 2. Spark Streaming Spark KafkaKinesis RDDDStream FRP
  3. 3. Conclusion KinesisConsumer Spark Streaming SQL Kinesis
  4. 4. RDD @t1 RDD @t2 RDD @t3 DStream Time RDD @t4 RDD @t5 DStream
  5. 5. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
  6. 6. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
  7. 7. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
  8. 8. Programming with DStream val conf = SparkConf()! val ssc = StreamingContext(conf, Seconds(1))! ! val lines = lines.ssc.socketTextStream(localhost,9999)! val words = lines.flatMap(_.split( ))! ! val pairs = words.map(word => (word, 1))! val count = pairs.reduceByKey(_ + _)! count.print()! ! ssc.satrt()! ssc.awaitTermination()!
  9. 9. DStream Flume Kafka Kinesis Twitter File Socket Data sources
  10. 10. Amazon Kinesis / Kafka
  11. 11. Amazon Kinesis Amazon Kinesis Datastream Store,Shue&Sort Consumer apps Consumer apps Consumer apps Process
  12. 12. Spark Streaming +Amazon Kinesis Amazon Kinesis Datastream Store,Shue&Sort Process
  13. 13. Spark Streaming +Amazon Kinesis KinesisSpark Kinesis +SparkSQL KinesisConsumer
  14. 14. Building Amazon Kinesis Consumer app Amazon Kinesis Datastream Store,Shue&Sort API, SDK KCL AWS Lambda Process SparkKinesisStormkinesis-spout KCL StormSpark
  15. 15. Amazon Kinesis Datastream Store,Shue&Sort Process Run SparkSQL on Kinesis Stream SQL
  16. 16. Run SparkSQL on Kinesis Stream import org.apache.spark.streaming.kinesis.KinesisUtils! ! val kinesisStreams = (0 until numStreams).map { i =>! KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! }! val unionStreams = ssc.union(kinesisStreams)! val words = unionStreams.flatMap(...)!
  17. 17. import org.apache.spark.streaming.kinesis.KinesisUtils! ! val kinesisStreams = (0 until numStreams).map { i =>! KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! }! ! val unionStreams = ssc.union(kinesisStreams)! ! val words = unionStreams.flatMap(...)! Run SparkSQL on Kinesis Stream Dstream DstreamUNION DstreamTransformation
  18. 18. words.foreachRDD(foreachFunc = (rdd: RDD[String], time: Time) => {! ! val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)! ! sqlContext.read.json(rdd).registerTempTable("words")! ! val wordCountsDataFrame =! sqlContext.sql(select level, count(*) as total ! from words! group by level)! ! println(s"========= $time =========")! wordCountsDataFrame.show()! ! })! DStream Run SparkSQL on Kinesis Stream JSON
  19. 19. Conclusion KinesisConsumer Spark Streaming
  20. 20. PluggableInputDStream KinesisReceiver KinesisClientLibrary Worker thread KinesisUtils.createStream(! ssc, streamName, endpointUrl, kinesisCheckpointInterval,! InitialPositionInStream.LATEST, StorageLevel.MEMORY_ONLY! )! DynamoDB Table Kinesis Stream Under the hood GetRecords Checkpoint
  21. 21. One more thing: Amazon EMR now supports Apache Spark! EMR Spark 2015/06/23 Spark1.3.1
  22. 22. One more thing: Amazon EMR now supports Apache Spark! Amazon Kinesis Amazon EMR +