strata nyc 2015: what's new in spark streaming

48
What’s new in Spark Streaming Tathagata “TD” Das Strata NY 2015 @tathadas

Upload: databricks

Post on 13-Apr-2017

3.421 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Strata NYC 2015: What's new in Spark Streaming

What’s new in Spark Streaming Tathagata “TD” Das Strata NY 2015

@tathadas

Page 2: Strata NYC 2015: What's new in Spark Streaming

Who am I?

Project Management Committee (PMC) member of Spark Started Spark Streaming in AMPLab, UC Berkeley Current technical lead of Spark Streaming Software engineer at Databricks

2

Page 3: Strata NYC 2015: What's new in Spark Streaming

Founded by creators of Spark and remains largest contributor Offers a hosted service •  Spark on EC2 •  Notebooks •  Plot visualizations •  Cluster management •  Scheduled jobs

What is Databricks?

3

Page 4: Strata NYC 2015: What's new in Spark Streaming

Spark Streaming

Scalable, fault-tolerant stream processing system

File systems

Databases

Dashboards

Flume

Kinesis HDFS/S3

Kafka

Twitter

High-level API

joins, windows, … often 5x less code

Fault-tolerant

Exactly-once semantics, even for stateful ops

Integration

Integrates with MLlib, SQL, DataFrames, GraphX

4

Page 5: Strata NYC 2015: What's new in Spark Streaming

What can you use it for? Real-time fraud detection in transactions

React to anomalies in sensors in real-time

Cat videos in tweets as soon as they go viral

5

Page 6: Strata NYC 2015: What's new in Spark Streaming

Spark Streaming

Receivers receive data streams and chop them up into batches

Spark processes the batches and pushes out the results

data streams

rece

iver

s

batches results

6

Page 7: Strata NYC 2015: What's new in Spark Streaming

Word Count with Kafka

val  context  =  new  StreamingContext(conf,  Seconds(1))  

val  lines  =  KafkaUtils.createStream(context,  ...)  

entry point of streaming functionality

create DStream from Kafka data

7

Page 8: Strata NYC 2015: What's new in Spark Streaming

Word Count with Kafka

val  context  =  new  StreamingContext(conf,  Seconds(1))  

val  lines  =  KafkaUtils.createStream(context,  ...)  

val  words  =  lines.flatMap(_.split("  "))   split lines into words

8

Page 9: Strata NYC 2015: What's new in Spark Streaming

Word Count with Kafka

val  context  =  new  StreamingContext(conf,  Seconds(1))  

val  lines  =  KafkaUtils.createStream(context,  ...)  

val  words  =  lines.flatMap(_.split("  "))  

val  wordCounts  =  words.map(x  =>  (x,  1))  

                                           .reduceByKey(_  +  _)  

wordCounts.print()  

context.start()  

print some counts on screen

count the words

start receiving and transforming the data

9

Page 10: Strata NYC 2015: What's new in Spark Streaming

Integrates with Spark Ecosystem

10

Spark Core

Spark Streaming

Spark SQL DataFrames

MLlib GraphX

Page 11: Strata NYC 2015: What's new in Spark Streaming

Combine batch and streaming processing

Join data streams with static data sets //  Create  data  set  from  Hadoop  file  val  dataset  =  sparkContext.hadoopFile(“file”)            //  Join  each  batch  in  stream  with  the  dataset  kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)                              .filter(  ...  )  }  

Spark Core

Spark Streaming

Spark SQL DataFrames

MLlib GraphX

11

Page 12: Strata NYC 2015: What's new in Spark Streaming

Combine machine learning with streaming

Learn models offline, apply them online //  Learn  model  offline  val  model  =  KMeans.train(dataset,  ...)    //  Apply  model  online  on  stream  kafkaStream.map  {  event  =>            model.predict(event.feature)    }    

Spark Core

Spark Streaming

Spark SQL DataFrames

MLlib GraphX

12

Page 13: Strata NYC 2015: What's new in Spark Streaming

Combine SQL with streaming

Interactively query streaming data with SQL and DataFrames //  Register  each  batch  in  stream  as  table  kafkaStream.foreachRDD  {  batchRDD  =>        batchRDD.toDF.registerTempTable("events")  }    //  Interactively  query  table  sqlContext.sql("select  *  from  events")  

Spark Core

Spark Streaming

Spark SQL DataFrames

MLlib GraphX

13

Page 14: Strata NYC 2015: What's new in Spark Streaming

Spark Streaming Adoption

14

Page 15: Strata NYC 2015: What's new in Spark Streaming

Spark Survey by Databricks

Survey over 1417 individuals from 842 organizations  56% increase in Spark Streaming users since 2014 Fastest rising component in Spark

https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html

15

Page 16: Strata NYC 2015: What's new in Spark Streaming

Feedback from community

We have learnt a lot from our rapidly growing user base Most of the development in the last few releases have driven by community demands

16

Page 17: Strata NYC 2015: What's new in Spark Streaming

What have we

added recently?

17  

Page 18: Strata NYC 2015: What's new in Spark Streaming

Ease of use

Infrastructure

Libraries

Page 19: Strata NYC 2015: What's new in Spark Streaming

Streaming MLlib algorithms

val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)    //  Train  on  one  DStream  model.trainOn(trainingDStream)    //  Predict  on  another  DStream  model.predictOnValues(      testDStream.map  {  lp  =>            (lp.label,  lp.features)        }  ).print()    

19

Continuous learning and prediction on streaming data

StreamingLinearRegression [Spark 1.1]

StreamingKMeans [Spark 1.2]

StreamingLogisticRegression [Spark 1.3]

https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Page 20: Strata NYC 2015: What's new in Spark Streaming

Python API Improvements

Added Python API for Streaming ML algos [Spark 1.5] Added Python API for various data sources

Kafka [Spark 1.3 - 1.5] Flume, Kinesis, MQTT [Spark 1.5]

20

lines  =  KinesisUtils.createStream(streamingContext,            appName,  streamName,  endpointUrl,  regionName,        

 InitialPositionInStream.LATEST,  2)      counts  =  lines.flatMap(lambda  line:  line.split("  "))    

Page 21: Strata NYC 2015: What's new in Spark Streaming

Ease of use

Infrastructure

Libraries

Page 22: Strata NYC 2015: What's new in Spark Streaming

New Visualizations [Spark 1.4-15]

22

Stats over last 1000 batches

For stability Scheduling delay should be approx 0 Processing Time approx < batch interval

Page 23: Strata NYC 2015: What's new in Spark Streaming

New Visualizations [Spark 1.4-15]

23

Details of individual batches

Kafka offsets processed in each batch, Can help in debugging bad data

List of Spark jobs in each batch

Page 24: Strata NYC 2015: What's new in Spark Streaming

New Visualizations [Spark 1.4-15]

24

Full DAG of RDDs and stages generated by Spark Streaming

Page 25: Strata NYC 2015: What's new in Spark Streaming

New Visualizations [Spark 1.4-15]

Memory usage of received data

Can be used to understand memory consumption across executors

Page 26: Strata NYC 2015: What's new in Spark Streaming

Ease of use

Infrastructure

Libraries

Page 27: Strata NYC 2015: What's new in Spark Streaming

Zero data loss

System stability

Page 28: Strata NYC 2015: What's new in Spark Streaming

Non-replayable Sources Sources that do not support replay from any position (e.g. Flume, etc.) Spark Streaming’s saves received data to a Write Ahead Log (WAL) and replays data from the WAL on failure

Zero data loss: Two cases

Replayable Sources Sources that allow data to replayed from any pos (e.g. Kafka, Kinesis, etc.) Spark Streaming saves only the record identifiers and replays the data back directly from source

Page 29: Strata NYC 2015: What's new in Spark Streaming

Cluster

Write Ahead Log (WAL) [Spark 1.3]

Save received data in a WAL in a fault-tolerant file system

29

Driver Executor Data stream Driver runs receivers

Driver runs user code

Receiver Driver runs tasks to process received data

Receiver buffers data in memory and writes to WAL

WAL in HDFS

Page 30: Strata NYC 2015: What's new in Spark Streaming

Executor

Receiver

Cluster

Write Ahead Log (WAL) [Spark 1.3]

Replay unprocessed data from WAL if driver fails and restarts

30

Restarted Executor

Tasks read data from the WAL

WAL in HDFS

Failed Driver

Restarted Driver Failed tasks rerun on

restarted executors

Page 31: Strata NYC 2015: What's new in Spark Streaming

Write Ahead Log (WAL) [Spark 1.3]

WAL can be enabled by setting Spark configuration spark.streaming.receiver.writeAheadLog.enable to true   Should use reliable receiver, that ensures data written to WAL for acknowledging sources Reliable receiver + WAL gives at least once guarantee

31

Page 32: Strata NYC 2015: What's new in Spark Streaming

Kinesis [Spark 1.5]

Save the Kinesis sequence numbers instead of raw data

Using KCL

Sequence number ranges sent to driver

Sequence number ranges saved to HDFS

32

Driver Executor

Page 33: Strata NYC 2015: What's new in Spark Streaming

Kinesis [Spark 1.5]

Recover unprocessed data directly from Kinesis using recovered sequence numbers

Using AWS SDK

33

Restarted Driver

Restarted Executor Tasks rerun with

recovered ranges

Ranges recovered from HDFS

Page 34: Strata NYC 2015: What's new in Spark Streaming

Kinesis [Spark 1.5]

After any failure, records are either recovered from saved sequence numbers or replayed via KCL No need to replicate received data in Spark Streaming Provides end-to-end at least once guarantee

34

Page 35: Strata NYC 2015: What's new in Spark Streaming

Kafka [1.3, graduated in 1.5]

A priori decide the offset ranges to consume in the next batch

35

Every batch interval, latest offset info fetched for each Kafka partition

Offset ranges for next batch decided and saved to HDFS

Driver

Page 36: Strata NYC 2015: What's new in Spark Streaming

Kafka [1.3, graduated in 1.5]

A priori decide the offset ranges to consume in the next batch

36

Executor

Executor

Executor

Broker

Broker

Broker Tasks run to read each range in parallel

Driver

Every batch interval, latest offset info fetched for each Kafka partition

Page 37: Strata NYC 2015: What's new in Spark Streaming

Direct Kafka API [Spark 1.5] Does not use receivers, no need for Spark Streaming to replicate Can provide up to 10x higher throughput than earlier receiver approach

https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/

Can provide exactly once semantics

Output operation to external storage should be idempotent or transactional

Can run Spark batch jobs directly on Kafka

# RDD partitions = # Kafka partitions, easy to reason about

37

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Page 38: Strata NYC 2015: What's new in Spark Streaming

System stability

Streaming applications may have to deal with variations in data rates and processing rates

For stability, any streaming application must receive data only as fast as it can process Since 1.1, Spark Streaming allowed setting static limits ingestion rates on receivers to guard against spikes

38

Page 39: Strata NYC 2015: What's new in Spark Streaming

Backpressure [Spark 1.5]

System automatically and dynamically adapts rate limits to ensure stability under any processing conditions If sinks slow down, then the system automatically pushes back on the source to slow down receiving

39

rece

iver

s

Sources Sinks

Page 40: Strata NYC 2015: What's new in Spark Streaming

Backpressure [Spark 1.5]

System uses batch processing times and scheduling delays used to set rate limits Well known PID controller theory (used in industrial control systems) is used calculate appropriate rate limits Contributed by Typesafe

40

Page 41: Strata NYC 2015: What's new in Spark Streaming

Backpressure [Spark 1.5]

System uses batch processing times and scheduling delays used to set rate limits

41

Dynamic rate limit prevents receivers from receiving too fast

Scheduling delay kept in check by the rate limits

Page 42: Strata NYC 2015: What's new in Spark Streaming

Backpressure [Spark 1.5]

Experimental, so disabled by default in Spark 1.5 Enabled by setting Spark configuration spark.streaming.backpressure.enabled to true   Will be enabled by default in future releases https://issues.apache.org/jira/browse/SPARK-7398

42

Page 43: Strata NYC 2015: What's new in Spark Streaming

What’s next?

Page 44: Strata NYC 2015: What's new in Spark Streaming

API and Libraries

Support for operations on event time and out of order data

Most demanded feature from the community

Tighter integration between Streaming and SQL + DataFrames

Helps leverage Project Tungsten

44

Page 45: Strata NYC 2015: What's new in Spark Streaming

Infrastructure

Add native support for Dynamic Allocation for Streaming Dynamically scale the cluster resources based on processing load Will work in collaboration with backpressure to scale up/down while maintaining stability

Note: As of 1.5, existing Dynamic Allocation not optimized for streaming But users can build their own scaling logic using developer API

sparkContext.requestExecutors(),  sparkContext.killExecutors()  

45

Page 46: Strata NYC 2015: What's new in Spark Streaming

Infrastructure

Higher throughput and lower latency by leveraging Project Tungsten Specifically, improved performance of stateful ops

46

Page 47: Strata NYC 2015: What's new in Spark Streaming
Page 48: Strata NYC 2015: What's new in Spark Streaming

Fastest growing component in the Spark ecosystem

Significant improvements in fault-tolerance, stability, visualizations and Python API

More community requested features to come

@tathadas