structured streaming in - bi consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · the...
TRANSCRIPT
Budapest Data Forum, 2018
Structured Streaming in
Spark / Big Data / Cloud Computing Trainings Building Data Infrastructures for Industry 4.0 & Online
Why Real-time?
Why Spark Streaming?
Why Real-time?
How to chose a streaming tool?
The Apache landscape
streams
Sometimes you just want to keep it simple
+
Remember this from 1 hour ago?
So, our fancy tools
streams
How to chose a fancy streaming tool?
Popularity
See the bigger picture
Throughput
source: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
*as the Spark folks measured it
Throughput
source:https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime
*as the Flink folks measured it
Developers!
LatencyNative Streaming
(event-based processing)
vs
Microbatching
streams
trident
https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers
Structured Streaming
Pain points to solve• Interoperability
batch, interactive and real-time analytics
• Event time based processingevent time instead of processing time
• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing
Pain points to solve• Interoperability
batch, interactive and real-time analytics
• Event time based processingevent time instead of processing time
• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing
Unbounded Table
image credit: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Pain points to solve• Interoperability
batch, interactive and real-time analytics
• Event time based processingevent time instead of processing time
• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing
Late data
Handling late data with Watermarking
Pain points to solve• Interoperability
batch, interactive and real-time analytics
• Event time based processingevent time instead of processing time
• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing
The drama of Exactly-once processing (Act I)
Spark: got it, thanks! Consider line 11 done.Spark: Hey Postgres,
store the results please
Spark: give me data
Kafka: you were at the 10th line, there you go with the 11th.
Spark: give me data
Kafka: you were at the 11th line, there you go with the 12th.
OK!
...
The drama of Exactly-once processing (Act II)
Spark: got it, thanks! Consider line 13 done.Spark: Hey Postgres,
store the re.....
Spark: give me data
Kafka: you were at the 12th line, there you go with the 13th.
Claudius: Hey Spark, got thirsty? ;)
Demo
Summary• Only use fancy tools if you need them ;)
• Structured Streaming
• Great Concept
• Access to core Spark functionalities
• Probably takes 1-2 years to make it feature-rich