continuous processing with apache flink - strata london 2016
TRANSCRIPT
Stephan Ewen@stephanewen
Continuous Processingwith Apache Flink
2
Streaming technology is enabling the obvious: continuous processing on data that is
continuously produced
Continuous Apps before Streaming
3time
Scheduler
file 1
file 2
Job 1
Job 2
Serv
ing
file 3 Job 3
Continuous Apps with Lambda
4
Scheduler
file 1
file 2
Job 1
Job 2
Serv
ing
Streaming job
Continuous Apps with Streaming
5
collect log analyze serve & store
Continuous Data Sources
6
Process a period ofhistoric data
partition
partition
Process latest datawith low latency(tail of the log)
Reprocess stream(historic data first, catches up with realtime data)
Continuous Data Sources
7
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Stream of events in Apache Kafka partitions
Stream view over sequence of files
Continuous Processing
Time State
9
Enter Apache Flink
Apache Flink Stack
10
DataStream APIStream Processing
DataSet APIBatch Processing
RuntimeDistributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
Programs and Dataflows
11
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source[1]
map()[1]
keyBy()/window()/
apply()[1]
Sink[1]
Source[2]
map()[2]
keyBy()/window()/
apply()[2]
StreamingDataflow
What makes Flink flink?
12
Low latency
High Throughput
Well-behavedflow control
(back pressure)
Make more sense of data
Works on real-timeand historic data
TrueStreaming
Event Time
APIsLibraries
StatefulStreaming
Globally consistentsavepoints
Exactly-once semanticsfor fault tolerance
Windows &user-defined state
Flexible windows(time, count, session, roll-your own)
Complex Event Processing
13
(It's) About Time
Different Notions of Time
14
Event Producer Message Queue FlinkData Source
FlinkWindow Operator
partition 1
partition 2
EventTime Stream Processor
IngestionTime
WindowProcessing
TimeStorage
IngestionTime
1977 1980 1983 1999 2002 2005 2015
Processing Time
EpisodeIV
EpisodeV
EpisodeVI
EpisodeI
EpisodeII
EpisodeIII
EpisodeVII
Event Time
Event Time vs. Processing Time
15
Batch: Implicit Treatment of Time
16
Time is treated outside of your application.Data is grouped by storage ingestion time.
Batch Job1h Serving
LayerBatch Job1h
Batch Job1h
Streaming: Windows
17Time
Aggregates on streamsare scoped by windows
Time-driven Data-drivene.g. last X minutes e.g. last X records
Streaming: Windows
18
Time
"Average over the last 5 minutes”
Event Time Windows
19
Event Time Windows reorder the events to their Event Time order
Processing Time
20
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
Ingestion Time
21
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(IngestionTime)
val stream: DataStream[Event] = env.addSource(…)
stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
Event Time
22
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator())
tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
The Power of Event Time Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions• Wrong grouping at batch boundaries
Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays
Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries
23
The Power of Event Time Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions• Wrong grouping at batch boundaries
Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays
Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries
24
Purely data-driven time
Purely wall clock time
Mix of data-driven and wall clock time
Event Time Progress: Watermarks
25
7
W(11)W(17)
11159121417122220 171921
WatermarkEvent
Event timestamp
Stream (in order)
7
W(11)W(20)
Watermark
991011141517
Event
Event timestamp
1820 192123
Stream (out of order)
Bounding the Latency for Results Triggering on combinations on
Event Time and Processing Time
See previous talks by Tyler Akidau &Kenneth Knowles on Apache Beam (incub.)
Concepts apply almost 1:1 to Apache Flink Syntax varies
26
27
Matters of State
Batch vs. Continuous
28
• No state across batches
• Fault tolerance within a job
• Re-processing starts empty
Batch Jobs Continuous Programs
• Continuous state across time
• Fault tolerance guards state
• Reprocessing starts stateful
Continuous State
29
time
No stateless point in time
Sessions over time
Re-processing data (in batch)
30
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-15:00am
2016-3-16:00am
2016-3-17:00am
2016-3-14:00am
2016-3-13:00 am
Re-processing data (in batch)
31
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-15:00am
2016-3-16:00am
2016-3-17:00am
2016-3-14:00am
2016-3-13:00 am
Wrong / corrupt results
Streaming: Savepoints
32
Savepoint A Savepoint B
Globally consistent point-in-time snapshotof the streaming application
Re-processing data (continuous)
33
Savepoint A
Re-processing data (continuous) Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …) Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)• Initializes pending state (like partial sessions)
34
Savepoint
Run new streamingprogram from savepoint
Forking and Versioning Applications
35
Savepoint
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
36
Conclusion
Wrap up Streaming is the architecture for continuous processing
Continuous processing makes data applications• Simpler: Fewer moving parts• More correct: No broken state at any boundaries• More flexible: Reprocess data and fork applications via savepoints
Requires a powerful stream processor, like Apache Flink
37
Upcoming Features Dynamic Scaling, Resource Elasticity Stream SQL CEP enhancements Incremental & asynchronous state snapshotting Mesos support More connectors, end-to-end exactly once API enhancements (e.g., joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)
38
What makes Flink flink?
39
Low latency
High Throughput
Well-behavedflow control
(back pressure)
Make more sense of data
Works on real-timeand historic data
TrueStreaming
Event Time
APIsLibraries
StatefulStreaming
Globally consistentsavepoints
Exactly-once semanticsfor fault tolerance
Windows &user-defined state
Flexible windows(time, count, session, roll-your own)
Complex Event Processing
Flink Forward 2016, BerlinSubmission deadline: June 30, 2016Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!data-artisans.com/careers