real-time big-data processing - north america - databricks · real-time big-data processing...

31
Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY

Upload: dangthu

Post on 04-Jun-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Spark Streaming!!

Real-time big-data processing

Tathagata Das (TD)

UC  BERKELEY  

Page 2: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

What is Spark Streaming?

§  Extends Spark for doing big data stream processing §  Project started in early 2012, alpha released in

Spring 2013 with Spark 0.7 § Moving out of alpha in Spark 0.9

Spark

Spark Streaming GraphX

… Shark

MLlib BlinkDB

Page 3: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Why Spark Streaming? Many big-data applications need to process large

data streams in realtime

Website monitoring Fraud detection

Ad monetization

Page 4: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Why Spark Streaming? Need a framework for big data

stream processing that

Website monitoring Fraud detection

Ad monetization Scales to hundreds of nodes

Achieves second-scale latencies

Efficiently recover from failures

Integrates with batch and interactive processing

Page 5: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Integration with Batch Processing § Many environments require processing same data

in live streaming as well as batch post-processing

§  Existing frameworks cannot do both -  Either, stream processing of 100s of MB/s with low latency -  Or, batch processing of TBs of data with high latency

§  Extremely painful to maintain two different stacks -  Different programming models -  Double implementation effort

Page 6: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Stateful Stream Processing §  Traditional model

§ Mutable state is lost if node fails

§ Making stateful stream processing fault tolerant is challenging!

–  Processing pipeline of nodes –  Each node maintains mutable state –  Each input record updates the state

and new records are sent out

mutable  state  

node  1  

node  3  

input    records  

node  2  

input    records  

Page 7: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Existing Streaming Systems

§  Storm -  Replays record if not processed by a node -  Processes each record at least once -  May update mutable state twice! -  Mutable state can be lost due to failure!

§  Trident – Use transactions to update state -  Processes each record exactly once -  Per-state transaction to external database is slow

7

Page 8: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Spark Streaming

8

Page 9: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs

9

Spark  

Spark  Streaming  

batches  of  X  seconds  

live  data  stream  

processed  results  

§ Chop up the live stream into batches of X seconds

§ Spark treats each batch of data as RDDs and processes them using RDD operations

§ Finally, the processed results of the RDD operations are returned in batches

Page 10: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs

10

§ Batch sizes as low as ½ second, latency of about 1 second

§ Potential for combining batch processing and streaming processing in the same system

Spark  

Spark  Streaming  

batches  of  X  seconds  

live  data  stream  

processed  results  

Page 11: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Example – Get hashtags from Twitter val  tweets  =  ssc.twitterStream()  

DStream:  a  sequence  of  RDDs  represen9ng  a  stream  of  data  

batch  @  t+1  batch  @  t   batch  @  t+2  

tweets  DStream  

stored  in  memory  as  an  RDD  (immutable,  distributed)  

TwiFer  Streaming  API  

Page 12: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Example – Get hashtags from Twitter val  tweets  =  ssc.twitterStream()  

val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

flatMap   flatMap   flatMap  

transforma1on:  modify  data  in  one  DStream  to  create  another  DStream    new  DStream  

new  RDDs  created  for  every  batch    

batch  @  t+1  batch  @  t   batch  @  t+2  

tweets  DStream  

hashTags  Dstream  [#cat,  #dog,  …  ]  

Page 13: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Example – Get hashtags from Twitter val  tweets  =  ssc.twitterStream()  

val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

hashTags.saveAsHadoopFiles("hdfs://...")  

output  opera1on:  to  push  data  to  external  storage  

flatMap flatMap flatMap

save save save

batch  @  t+1  batch  @  t   batch  @  t+2  tweets  DStream  

hashTags  DStream  

every  batch  saved  to  HDFS  

Page 14: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Example – Get hashtags from Twitter val  tweets  =  ssc.twitterStream()  

val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

hashTags.foreach(hashTagRDD  =>  {  ...  })  

foreach:  do  whatever  you  want  with  the  processed  data  

flatMap flatMap flatMap

foreach foreach foreach

batch  @  t+1  batch  @  t   batch  @  t+2  tweets  DStream  

hashTags  DStream  

Write  to  a  database,  update  analy9cs  UI,  do  whatever  you  want  

Page 15: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Demo

Page 16: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Java Example Scala  

val  tweets  =  ssc.twitterStream()  

val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

hashTags.saveAsHadoopFiles("hdfs://...")  

Java

JavaDStream<Status>  tweets  =  ssc.twitterStream()  

JavaDstream<String>  hashTags  =  tweets.flatMap(new  Function<...>  {    })  

hashTags.saveAsHadoopFiles("hdfs://...")  

Func9on  object  

Page 17: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

DStream  of  data  

Window-based Transformations val  tweets  =  ssc.twitterStream()  

val  hashTags  =  tweets.flatMap(status  =>  getTags(status))  

val  tagCounts  =  hashTags.window(Minutes(1),  Seconds(5)).countByValue()  

sliding  window  opera9on   window  length   sliding  interval  

window  length  

sliding  interval  

Page 18: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data  

-  Example: Maintain per-user mood as state, and update it with their tweets

   def  updateMood(newTweets,  lastMood)  =>  newMood  

 

   moods  =  tweetsByUser.updateStateByKey(updateMood  _)  

Page 19: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Arbitrary Combinations of Batch and Streaming Computations

Inter-mix RDD and DStream operations!

-  Example: Join incoming tweets with a spam HDFS file to filter out bad tweets!  

 tweets.transform(tweetsRDD  =>  {    tweetsRDD.join(spamHDFSFile).filter(...)  

 })  

 

 

 

Page 20: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

DStreams + RDDs = Power § Online machine learning

-  Continuously learn and update data models (updateStateByKey and transform)

§  Combine live data streams with historical data -  Generate historical data models with Spark, etc. -  Use data models to process live data stream (transform)

§  CEP-style processing -  window-based operations (reduceByWindow, etc.)

Page 21: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Input Sources § Out of the box, we provide

-  Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets, etc.

§  Very easy to write a receiver for your own data source

§  Also, generate your own RDDs from Spark, etc. and push them in as a “stream”

Page 22: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Fault-tolerance §  Batches of input data are

replicated in memory for fault-tolerance

§  Data lost due to worker failure, can be recomputed from replicated input data

input  data  replicated  in  memory  

flatMap  

lost  par99ons  recomputed  on  other  workers  

tweets  RDD  

hashTags  RDD  

§  All transformations are fault-tolerant, and exactly-once transformations

Page 23: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Performance Can process 60M records/sec (6 GB/sec) on

100 nodes at sub-second latency

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

0   50   100  

Cluster  T

hrou

ghpu

t  (GB/s)  

#  Nodes  in  Cluster  

WordCount  

1  sec  

2  sec  0  

1  

2  

3  

4  

5  

6  

7  

0   50   100  

Cluster  T

hhroughp

ut  (G

B/s)  

#  Nodes  in  Cluster  

Grep  

1  sec  

2  sec  

Page 24: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Comparison with other systems Higher throughput than Storm

-  Spark Streaming: 670k records/sec/node -  Storm: 115k records/sec/node -  Commercial systems: 100-500k records/sec/node

0  

10  

20  

30  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

WordCount  

Spark  

Storm  

0  

20  

40  

60  

100   1000  

Throughp

ut  per  nod

e  (M

B/s)  

Record  Size  (bytes)  

Grep  

Spark  

Storm  

Page 25: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

Page 26: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Mobile Millennium Project Traffic transit time estimation using online machine learning on GPS observations

0

400

800

1200

1600

2000

0 20 40 60 80 G

PS o

bser

vatio

ns p

er s

ec

# Nodes in Cluster

§ Markov-chain Monte Carlo simulations on GPS observations

§ Very CPU intensive, requires dozens of machines for useful computation

§ Scales linearly with cluster size

Page 27: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Advantage of an unified stack §  Explore data

interactively to identify problems

§  Use same code in Spark for processing large logs

§  Use similar code in Spark Streaming for realtime processing

$  ./spark-­‐shell  scala>  val  file  =  sc.hadoopFile(“smallLogs”)  ...  scala>  val  filtered  =  file.filter(_.contains(“ERROR”))  ...  scala>  val  mapped  =  filtered.map(...)  ...    object  ProcessProductionData  {      def  main(args:  Array[String])  {          val  sc  =  new  SparkContext(...)          val  file  =  sc.hadoopFile(“productionLogs”)          val  filtered  =  file.filter(_.contains(“ERROR”))          val  mapped  =  filtered.map(...)          ...      }  }  object  ProcessLiveStream  {  

   def  main(args:  Array[String])  {          val  sc  =  new  StreamingContext(...)          val  stream  =  sc.kafkaStream(...)          val  filtered  =  stream.filter(_.contains(“ERROR”))          val  mapped  =  filtered.map(...)          ...      }  }  

Page 28: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Roadmap §  Spark 0.8.1

-  Marked alpha, but has been quite stable -  Master fault tolerance – manual recovery

-  Restart computation from a checkpoint file saved to HDFS

§  Spark 0.9 in Jan 2014 – out of alpha! -  Automated master fault recovery -  Performance optimizations -  Web UI, and better monitoring capabilities

Page 29: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Roadmap §  Long term goals

-  Python API -  MLlib for Spark Streaming -  Shark Streaming

§  Community feedback is crucial! -  Helps us prioritize the goals

§  Contributions are more than welcome!!

Page 30: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Today’s Tutorial §  Process Twitter data stream to find most popular

hashtags over a window

§  Requires a Twitter account -  Need to setup Twitter OAuth keys to access tweets -  All the instructions are in the tutorial

§  Your account will be safe! -  No need to enter your password anywhere, only the keys -  Destroy the keys after the tutorial is done

Page 31: Real-time big-data processing - North America - Databricks · Real-time big-data processing Tathagata Das ... batch processing of TBs of data with high latency! ... Spark treats each

Conclusion

§  Streaming programming guide – spark.incubator.apache.org/docs/latest/streaming-programming-guide.html

§  Research Paper – tinyurl.com/dstreams

Thank you!