spark streaming with cassandra

spark streamingwith C*

[email protected]

…applies where you need near-realtime data analysis

Spark vs Spark Streaming

zillions of bytes gigabytes per second

stat

ic d

atas

etstream

of data

applications sensors web mobile phones

intrusion detection malfunction detection site analytics network metrics analysis

fraud detection dynamic process optimisation recommendations location based ads

log processing supply chain planning sentiment analysis spying

What can you do with it?

AlmostWhateverSource

YouWant

AlmostWhatever

DestinationYou

Want

so, let’s see how it works

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

DStream - A continuous sequence of micro batches

Processing of DStream = Processing of μBatches, RDDs

DStream

Receiver9 8 7 6 5 4 3 2 1Interface between different stream sources and Spark

Receiver9 8 7 6 5 4 3 2 1

Spark memory boundaryBlock Manager

Interface between different stream sources and Spark

Receiver9 8 7 6 5 4 3 2 1


Replication and building μBatches

Interface between different stream sources and Spark


9 8 7 6 5 4 3 2 1

Blocks of input data


9 8 7 6 5 4 3 2 1

9 8 7 6 5 4 3 2 1

Blocks of input data

μBatch made of blocks

9 8 7 6 5 4 3 2 1


9 8 7 6 5 4 3 2 1


Partition Partition Partition

Ingestion from multiple sources

Receiving,μBatch building



Ingestion from multiple sources




2s 1s 0s

μBatch μBatch

A well-worn example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 5

seconds windows• save word counts from the last 5 seconds,

every 5 second to Cassandra, and display the first few results on the console

how to do that ?

well…

Yes, it is that easycase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue()val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

DStream stateless operators(quick recap)

• map• flatMap• filter• repartition• union• count• countByValue

• reduce• reduceByKey• joins• cogroup• transform• transformWith

DStream[Bean].count()

count 4 3

1s 1s 1s 1s

DStream[Orange].union(DStream[Apple])

union

1s 1s

Other stateless operations

• join(DStream[(K, W)])• leftOuterJoin(DStream[(K, W)])• rightOuterJoin(DStream[(K, W)])• cogroup(DStream[(K, W)])

are applied on pairs of corresponding μBatches

transform, transformWith

• DStream[T].transform(RDD[T] => RDD[U]): DStream[U]• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V]

allow you to create new stateless operators

DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]

1-A 2-A 3-A

1-B 2-B 3-B

1-A x 1-B 2-A x 2-B 3-A x 3-B

Windowing

0s 1s 2s 3s 4s 5s 6s 7s

By default:window = slide = μBatch duration

window

slide

Windowing

The resulting DStream consists of 3 seconds μBatches!

Each resulting μBatch overlaps the preceding one by 1 second

0s 1s 2s 3s 4s 5s 6s 7s

window

slide

Windowing

1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window

window

slide

μBatch appears in output stream every 1s!

It contains messages collected during 3s

1s

DStream window operators

• groupByKeyAndWindow(Duration, Duration)• reduceByKeyAndWindow((V, V) => V, Duration, Duration)

• window(Duration, Duration)• countByWindow(Duration, Duration)• reduceByWindow(Duration, Duration, (T, T) => T)• countByValueAndWindow(Duration, Duration)

Let’s modify the example

• ingestion of text messages• splitting them into separate words• count the occurrence of words within 10

seconds windows• save word counts from the last 10 seconds,

every 2 second to Cassandra, and display the first few results on the console

Yes, it is still easy to docase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()

DStream stateful operator• DStream[(K, V)].updateStateByKey

(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]

1

A

2

B

3

A

4

C

5

A

6

B

7

A

8

B

9

C

• R1 = f(Seq(1, 3, 5), Some(7))• R2 = f(Seq(2, 6), Some(8))• R3 = f(Seq(4), Some(9))

R1

A

R2

B

R3

C

case class WordCount(time: Long, word: String, count: Int)def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum)} val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """\s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()

Total word count example

Obtaining DStreams

• ZeroMQ• Kinesis• HDFS compatible file system• Akka actor• Twitter• MQTT• Kafka• Socket• Flume• …

Particular DStreams are available in separate modules

GroupId ArtifactId Latest Version

org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0

org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-flume-sink_2.10 1.1.0

org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)

org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)

If something goes wrong…

Fault tolerance

The sequence of transformations is known

to Spark Streaming

μBatches are replicated once they are received

Lost data can be recomputed

But there are pitfalls

• Spark replicates blocks, not single messages

• It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block

• The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver

• Typical tradeoff - efficiency vs fault tolerance

Built-in receivers breakdown

Pushing single messages Can do both Pushing whole blocks

Kafka Akka RawNetworkReceiver

Twitter Custom ZeroMQ

Socket

MQTT

Thank you !

Questions?!

http://spark.apache.org/https://github.com/datastax/spark-cassandra-connectorhttp://cassandra.apache.org/http://www.datastax.com/

spark streaming with cassandra

Data & Analytics

split s

resulting batch

batch buildingreceiving

batch durationwindow

batch buildingbatch

partition partition

seconds batches

spark streamingwith