spark streaming into context

Spark Streaming into context David Martinez Rego 20th of October 2016

About me• Phd in ML 2013: predictive maintenance

of windmills

• Lived in London since then

• Postdoc @ UCL

• Teaching and Mentoring @ UCL internships inside financial institutions

• Consulting on Data analytics

• Early Startup

Plethora of options?

Wishlist• Easy to compose complex pipelines

• Easy scaling out

• Interoperable with a large ecosystem

• Low latency and high throughput

• Monitoring

Flume• Its mechanism of scaling to different machines is

managed in an ad hoc way

Flume

• Its mechanism of scaling to different machines is managed in an ad hoc way

• Nice to solve simple custom data gathering from the exterior and throw it in the perimeter for further processing.

Lessons learnt

• Each project has added some good ideas when they were more needed

• Eventually, all platforms have absorbed the best ideas from peers

• It seems that we have a winner, for now?

Time view

Pipelining Composition

one at a time spouts and bolts

RDD


Storm basic model

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Topologys.g.

s.g.s.g.

s.g.

Guarantees and fault tolerance

ACK ANCH

Anchoring

ACK

Guarantees and fault tolerance

Topology

Storm basic model

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Topologys.g.

s.g.s.g.

s.g.

Lambda architecture

Time view



RDD

one at a time system, stream, stream task

Kappa architecture

Time view


one at a time source, spouts, bolts and ack

RDD


Microbatch

Init + connect to source

pipeline

computation + state mgmt.

Time view


one at a time source, spouts, bolts and ack

RDD


Much better, but still…• Introduce problems

1. Still no full equivalence between batch and streaming

2. out of order management and early reporting have to be coded

3. custom windows code needs to be mixed with business logic

4. Micro-batches impose a lower limit on latency

Spark: batch and streaming

Lambda architecture?

Out of order

Latency is unpredictable

Our aim

Final Spark (1)

Final Spark (2)

Batch vs. Streaming

Data Streaming

Data Batch

Batch vs. Streaming

Data Batch

Batch vs. Streaming

A batch pipeline IS a streaming pipeline applied to

a finite stream!

Event time + Processing time

Processing time

Event time Business logic

+

Event time + Processing time

Processing time Event time

Business logic

+

Beam/Dataflow

Apache Beam

Streaming API

Execution engine

Apache Beam

Streaming API

!

!

!

Execution enginehttp://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

Apache Beam

Kostas Tzoumas, Data artisans

Tyler Akidau, Beam PMC

Other considerations

Maturity ? -

Ecosystem - -

Community -

Ops - -

Other considerations• Flow of the experiment:

• Read an event from Kafka.

• Deserialize the JSON string.

• Filter out irrelevant events

• Take a projection of the relevant fields

• Join each event with its associated campaign (from Redis).

• Take a windowed count of events per campaign and store each window in Redis along with a last updated timestamp (with late events).

Resources• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

• https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

• http://data-artisans.com/why-apache-beam/#more-710

• http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

• https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

http://data-artisans.com/why-apache-beam/#more-710

http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Spark Streaming into context Thanks for listening!

spark streaming into context

Data & Analytics