making structured streaming ready for production
TRANSCRIPT
![Page 1: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/1.jpg)
Making Structured StreamingReady For Production
Tathagata “TD” Das@tathadas
Spark Summit East8th February 2017
![Page 2: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/2.jpg)
About Me
Spark PMC Member
Built Spark Streaming in UC Berkeley
Currently focused on Structured Streaming
2
![Page 3: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/3.jpg)
building robust stream processing
apps is hard
3
![Page 4: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/4.jpg)
Complexities in stream processing
4
Complex Data
Diverse data formats (json, avro, binary, …)
Data can be dirty, late, out-of-order
Complex Systems
Diverse storage systems and formats (SQL, NoSQL, parquet, ... )
System failures
Complex Workloads
Event time processing
Combining streaming with interactive queries,
machine learning
![Page 5: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/5.jpg)
Structured Streaming
stream processing on Spark SQL enginefast, scalable, fault-tolerant
rich, unified, high level APIs deal with complex data and complex workloads
rich ecosystem of data sourcesintegrate with many storage systems
5
![Page 6: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/6.jpg)
Philosophy
the simplest way to perform stream processing is not having to reason
about streaming at all
6
![Page 7: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/7.jpg)
Treat Streams as Unbounded Tables
7
data stream unbounded input table
new data in the data stream
= new rows appended to a unbounded table
![Page 8: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/8.jpg)
New ModelTrigger: every 1 sec
Time
Input data upto t = 3
Que
ry
Input: data from source as an append-only table
Trigger: how frequently to checkinput for new data
Query: operations on inputusual map/filter/reduce new window, session ops
t=1 t=2 t=3
data upto t = 1
data upto t = 2
![Page 9: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/9.jpg)
New Model
result up to t = 1
Result
Que
ry
Time
data upto t = 1
Input data upto t = 2
result up to t = 2
data upto t = 3
result up to t = 3
Result: final operated tableupdated after every trigger
Output: what part of result to write to storage after every trigger
Complete output: write full result table every time
Output[complete mode]
write all rows in result table to storage
t=1 t=2 t=3
![Page 10: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/10.jpg)
New Modelt=1 t=2 t=3
Result
Que
ry
Time
Input data upto t = 3
result up to t = 3
Output[append mode] write new rows since last trigger to storage
Result: final operated tableupdated after every trigger
Output: what part of result to write to storage after every trigger
Complete output: write full result table every time
Append output: write only new rows that got added to result table since previous batch
*Not all output modes are feasible with all queries
data upto t = 1
data upto t = 2
result up to t = 1
result up to t = 2
![Page 11: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/11.jpg)
static data =bounded table
streaming data =unbounded table
API - Dataset/DataFrame
Single API !
![Page 12: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/12.jpg)
Batch Queries with DataFrames
input = spark.read.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.write.format("parquet").save("dest-path")
Read from Json file
Select some devices
Write to parquet file
![Page 13: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/13.jpg)
Streaming Queries with DataFrames
input = spark.readStream.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.writeStream.format("parquet").start("dest-path")
Read from Json file streamReplace read with readStream
Select some devicesCode does not change
Write to Parquet file streamReplace save() with start()
![Page 14: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/14.jpg)
DataFrames,Datasets, SQL
input = spark.readStream.format("json").load("source-path")
result = input.select("device", "signal").where("signal > 15")
result.writeStream.format("parquet").start("dest-path")
Logical Plan
Streaming Source
Projectdevice, signal
Filtersignal > 15
Streaming Sink
Streaming Query Execution
Spark SQL converts batch-like query to series of incremental execution plans operating on new batches of data
Series of IncrementalExecution Plans
proc
ess
new
file
s
t = 1 t = 2 t = 3
proc
ess
new
file
s
proc
ess
new
file
s
![Page 15: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/15.jpg)
Fault-tolerance with Checkpointing
Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3
Query can be restarted from the log
Streaming sources can replay the exact data range in case of failure
Streaming sinks can dedup reprocessed data when writing, idempotent by design
end-to-end exactly-once guarantees
proc
ess
new
file
s
t = 1 t = 2 t = 3
proc
ess
new
file
s
proc
ess
new
file
s
write ahead
log
![Page 16: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/16.jpg)
ComplexStreaming ETL
16
![Page 17: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/17.jpg)
Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw data to structured data ready for further analytics
17
filedump
seconds hourstable
10101010
![Page 18: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/18.jpg)
Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence[intrusion detection, anomaly detection, etc.]
18
filedump
seconds hourstable
10101010
![Page 19: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/19.jpg)
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available as structured data as soon as possible
19
table
seconds10101010
![Page 20: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/20.jpg)
Streaming ETL w/ Structured Streaming
20
Example
- Json data being received in Kafka
- Parse nested json and flatten it
- Store in structured Parquet table
- Get end-to-end failure guarantees
val rawData = spark.readStream.format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()
val parsedData = rawData.selectExpr("cast (value as string) as json")) .select(from_json("json").as("data")) .select("data.*")
val query = parsedData.writeStream.option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable/")
![Page 21: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/21.jpg)
Reading from Kafka [Spark 2.1]
21
Support Kafka 0.10.0.1Specify options to configure
How?kafka.boostrap.servers => broker1
What?subscribe => topic1,topic2,topic3 // fixed list of topicssubscribePattern => topic* // dynamic list of topicsassign => {"topicA":[0,1] } // specific partitions
Where?startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream.format("kafka").option("kafka.boostrap.servers",...).option("subscribe", "topic").load()
![Page 22: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/22.jpg)
Reading from Kafka
22
val rawData = spark.readStream.format("kafka").option("subscribe", "topic").option("kafka.boostrap.servers",...).load()
rawData dataframe has the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
![Page 23: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/23.jpg)
Transforming Data
Cast binary value to stringName it column json
23
val parsedData = rawData.selectExpr("cast (value as string) as json").select(from_json("json").as("data")).select("data.*")
![Page 24: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/24.jpg)
Transforming Data
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
24
val parsedData = rawData.selectExpr("cast (value as string) as json").select(from_json("json").as("data")).select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")as "data"
![Page 25: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/25.jpg)
Transforming Data
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
Flatten the nested columns
25
val parsedData = rawData.selectExpr("cast (value as string) as json").select(from_json("json").as("data")).select("data.*")
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
timestamp device …
1486087873 devA …
1486086721 devX …
select("data.*")
(not nested)
![Page 26: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/26.jpg)
Transforming Data
Cast binary value to stringName it column json
Parse json string and expand into nested columns, name it data
Flatten the nested columns
26
val parsedData = rawData.selectExpr("cast (value as string) as json").select(from_json("json").as("data")).select("data.*")
powerful built-in APIs to perform complex data
transformationsfrom_json, to_json, explode, ...
100s of functions
![Page 27: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/27.jpg)
Writing to Parquet table
Save parsed data as Parquet table in the given path
Partition files by date so that future queries on time slices of data is fast
e.g. query on last 48 hours of data
27
val query = parsedData.writeStream.option("checkpointLocation", ...).partitionBy("date").format("parquet").start("/parquetTable")
![Page 28: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/28.jpg)
Checkpointing
Enable checkpointing by setting the checkpoint location to save offset logs
start actually starts a continuous running StreamingQuery in the Spark cluster
28
val query = parsedData.writeStream.option("checkpointLocation", ...).format("parquet").partitionBy("date").start("/parquetTable/")
![Page 29: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/29.jpg)
Streaming Query
query is a handle to the continuously running StreamingQuery
Used to monitor and manage the execution
29
val query = parsedData.writeStream.option("checkpointLocation", ...).format("parquet").partitionBy("date").start("/parquetTable/")
proc
ess
new
dat
a
t = 1 t = 2 t = 3
proc
ess
new
dat
a
proc
ess
new
dat
a
StreamingQuery
![Page 30: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/30.jpg)
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrityEven if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our bloghttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
30
seconds!complex, ad-hoc
queries on latest data
![Page 31: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/31.jpg)
Advanced Streaming Analytics
31
![Page 32: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/32.jpg)
Event time Aggregations
Many use cases require aggregate statistics by event timeE.g. what's the #errors in each system in the 1 hour windows?
Many challengesExtracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
32
![Page 33: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/33.jpg)
Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Support UDAFs!
33
parsedData.groupBy(window("timestamp","1 hour")).count()
parsedData.groupBy(
"device", window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each device every 10 mins
![Page 34: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/34.jpg)
Stateful Processing for Aggregations
Aggregates has to be saved as distributed state between triggers
Each trigger reads previous state and writes updated state
State stored in memory, backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
34
proc
ess
new
dat
a
t = 1
sink
src
t = 2
proc
ess
new
dat
a
sink
src
t = 3
proc
ess
new
dat
a
sink
src
state state
write ahead
log
state updates are written to log for checkpointing
state
![Page 35: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/35.jpg)
Watermarking and Late Data
Watermark [Spark 2.1]boundary in event time trailing behind max observed event time
Windows older than watermark automatically deleted to limit the amount of intermediate state
35
event time
max event time
watermark
![Page 36: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/36.jpg)
Watermarking and Late Data
Gap is a configurable allowed lateness
Data newer than watermark may be late, but allowed to aggregate
Data older than watermark is "too late" and dropped, state removed
36
max event time
event time
watermark
allowed lateness
late dataallowed to aggregate
data too late,
dropped
![Page 37: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/37.jpg)
Watermarking and Late Data
37
max event time
event time
watermark
allowed latenessof 10 mins
parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()
late dataallowed to aggregate
data too late,
dropped
![Page 38: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/38.jpg)
Watermarking to Limit State [Spark 2.1]
38
data too late, ignored in counts, state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
Even
t Tim
e12:15
12:18
12:04
watermark updated to 12:14 - 10m = 12:04for next trigger, state < 12:04 deleted
data is late, but considered in counts
parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()
system tracks max observed event time
12:08
wm = 12:04
10 m
in
12:14
more details in online programming guide
![Page 39: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/39.jpg)
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithStateallows any user-definedstateful ops to a user-defined state
fault-tolerant, exactly-once
supports type-safe langsScala and Java
39
dataset.groupByKey(groupingFunc).mapGroupsWithState(mappingFunc)
def mappingFunc(key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state// return mapped value
}
![Page 40: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/40.jpg)
Many more updates!StreamingQueryListener [Spark 2.1]Receive of regular progress heartbeats for health and perf monitoringAutomatic in Databricks!!
Kafka Batch Queries [Spark 2.2]Run batch queries on Kafka just like a file system
Kafka Sink [Spark 2.2]Write to Kafka, can only give at-least-once guarantee as Kafka doesn't support transactional updates
Update Output mode [Spark 2.2]Only updated rows in result table to be written to sink
40
![Page 41: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/41.jpg)
More Info
Structured Streaming Programming Guidehttp://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussionshttps://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
and more, stay tuned!!
41
![Page 42: Making Structured Streaming Ready for Production](https://reader030.vdocuments.net/reader030/viewer/2022021502/58a2b0721a28ab5d408b518f/html5/thumbnails/42.jpg)
Comparison with Other Engines
42Read the blog to understand this table