Spark Streaming into context

Download Spark Streaming into context

Post on 15-Apr-2017

204 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

  • Spark Streaming into context David Martinez Rego 20th of October 2016

  • About me Phd in ML 2013: predictive maintenance

    of windmills

    Lived in London since then

    Postdoc @ UCL

    Teaching and Mentoring @ UCL internships inside financial institutions

    Consulting on Data analytics

    Early Startup

  • Plethora of options?

  • Wishlist Easy to compose complex pipelines

    Easy scaling out

    Interoperable with a large ecosystem

    Low latency and high throughput

    Monitoring

  • Plethora of options?

  • Flume Its mechanism of scaling to different machines is

    managed in an ad hoc way

  • Flume Its mechanism of scaling to different machines is

    managed in an ad hoc way

  • Flume

    Its mechanism of scaling to different machines is managed in an ad hoc way

    Nice to solve simple custom data gathering from the exterior and throw it in the perimeter for further processing.

  • Plethora of options?

  • Plethora of options?

  • Plethora of options?

  • Plethora of options?

  • Lessons learnt

    Each project has added some good ideas when they were more needed

    Eventually, all platforms have absorbed the best ideas from peers

    It seems that we have a winner, for now?

  • Time view

    Pipelining Composition

    one at a time spouts and bolts

    RDD

    one at a time spouts and bolts

  • Storm basic model

    Spout

    Spout

    Bolt

    Bolt

    Bolt

    Bolt

    Topologys.g.

    s.g.s.g.

    s.g.

  • Guarantees and fault tolerance

    ACK ANCH

  • Anchoring

    ACK

    Guarantees and fault tolerance

  • Spout

  • Bolt

  • Topology

  • Storm basic model

    Spout

    Spout

    Bolt

    Bolt

    Bolt

    Bolt

    Topologys.g.

    s.g.s.g.

    s.g.

  • Lambda architecture

  • Time view

    Pipelining Composition

    one at a time spouts and bolts

    RDD

    one at a time system, stream, stream task

  • Samza

  • Samza

  • Samza

  • Samza

  • Kappa architecture

  • Time view

    Pipelining Composition

    one at a time source, spouts, bolts and ack

    RDD

    one at a time system, stream, stream task

  • RDD

  • Microbatch

  • Init + connect to source

    pipeline

    computation + state mgmt.

  • Time view

    Pipelining Composition

    one at a time source, spouts, bolts and ack

    RDD

    one at a time system, stream, stream task

  • Much better, but still Introduce problems

    1. Still no full equivalence between batch and streaming

    2. out of order management and early reporting have to be coded

    3. custom windows code needs to be mixed with business logic

    4. Micro-batches impose a lower limit on latency

  • Spark: batch and streaming

  • Spark: batch and streaming

  • Lambda architecture?

  • Out of order

    Latency is unpredictable

  • Our aim

  • Final Spark (1)

  • Final Spark (2)

  • Batch vs. Streaming

    Data Streaming

  • Data Batch

    Batch vs. Streaming

  • Data Batch

    Batch vs. Streaming

    A batch pipeline IS a streaming pipeline applied to

    a finite stream!

  • Event time + Processing time

    Processing time

    Event time Business logic

    +

  • Event time + Processing time

    Processing time Event time

    Business logic

    +

  • Plethora of options?

  • Beam/Dataflow

  • Beam/Dataflow

  • Beam/Dataflow

  • Apache Beam

    Streaming API

    Execution engine

  • Apache Beam

    Streaming API

    !

    !

    !

    Execution enginehttp://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

    http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

  • Apache Beam

    Kostas Tzoumas, Data artisans

    Tyler Akidau, Beam PMC

  • Other considerations

    Maturity ? -

    Ecosystem - -

    Community -

    Ops - -

  • Other considerations Flow of the experiment:

    Read an event from Kafka.

    Deserialize the JSON string.

    Filter out irrelevant events

    Take a projection of the relevant fields

    Join each event with its associated campaign (from Redis).

    Take a windowed count of events per campaign and store each window in Redis along with a last updated timestamp (with late events).

  • Resources https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

    https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

    http://data-artisans.com/why-apache-beam/#more-710

    http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

    http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

    https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparisonhttp://data-artisans.com/why-apache-beam/#more-710http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.htmlhttps://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

  • Spark Streaming into context Thanks for listening!