Page 1
Apache Flink Meetup Berlin #6 April 29, 2015
!Ufuk Celebi
[email protected]
Unified Batch & Stream Processing
in Apache Flink
Page 2
2
Batch vs. Stream Processing
Batch Processing
High Latency
Batch Processors
Static Files
Stream Processing
Low Latency
Stream Processors
Event Streams
Page 3
What's the difference between a Batch and Streaming Runtime?
Page 4
4
Batch Processing
HDFSO Romeo, Romeo,
wherefore art thou Romeo?
map
Read complete
fileRomeo, 1Romeo, 1
wherefore, 1art, 1
thou, 1Romeo, 1
Write out intermediate
data set
reduce
HDFSRomeo, 3
wherefore, 1art, 1
thou, 1
Write out complete
result
Page 5
5
Stream Processing
Kafka
map
reduce
Kafka
Romeo, 1Romeo, 1
wherefore, 1art, 1
thou, 1Romeo, 1
Keep data moving
Romeo, 3wherefore, 1
art, 1thou, 1
Write out incremental
updates
O
Read incremental
updates
Romeo Romeowherefore art thou Romeo
Page 6
6
andBatch vs. Stream
Processing
Page 7
7
Do we really need two separate systems for this?
Page 8
8
DataSet (Java/Scala) DataStream (Java/Scala)
Flink Runtime
BatchProcessing
StreamProcessing
Apache Flink
Page 9
9
Flink Runtime
Operator Intermediate Result
Page 10
10
Map
Operators represent computations over data.
JoinReduce CoCroup ...User
Sort Hash ...Internal
Operators
Page 11
11
Map
Join Sum
Operators
Page 12
12
Logical handle to data produced by operators.
Intermediate Results
Intermediate Result
Page 13
13
Map
Join Sum
Intermediate Results
Logical handle to data produced by operators.
Page 14
14
Map-Reduce Example
DataSet<Tuple2<String, Integer>> counts; !counts = input .flatMap(new LineSplitter()) .groupBy(0) .sum(1);
Logical handle to data produced by operators.
Map
Intermediate Result
Reduce
Page 15
15
Result Characteristics
Pipelined vs. Blocking
Ephemeral vs. Checkpointed
Page 16
16
Result Characteristics
Pipelined vs. Blocking
Ephemeral vs. Checkpointed
How and when to do data exchange?
Page 17
Map Blocking Result
17
Blocking Results
110101010100
Page 18
Map Blocking Result
17
Blocking Results
110101010100
Page 19
Map Blocking Result
17
Blocking Results
110101010100
Page 20
Map Blocking Result
17
Blocking Results
11010101
0100
Page 21
Map Blocking Result
17
Blocking Results
11010101
0100
Page 22
Map Blocking Result
17
Blocking Results
110101010100
Page 23
Map Blocking Result
17
Blocking Results
110101010100
Page 24
Map Blocking Result Reduce
17
Blocking Results
110101010100
Page 25
Map Blocking Result Reduce
17
Blocking Results
110101010100
Page 26
Map Blocking Result Reduce
17
Blocking Results
110101010100
Page 27
Map Blocking Result Reduce
17
Blocking Results
11010101
0100
Page 28
Map Blocking Result Reduce
17
Blocking Results
110101010100
Page 29
Map Pipelined Result
18
Pipelined Results
110101010100
Page 30
Map Pipelined Result
18
Pipelined Results
110101010100
Page 31
Map Pipelined Result
18
Pipelined Results
110101010100
Page 32
Map Pipelined Result Reduce
18
Pipelined Results
110101010100
Page 33
Map Pipelined Result Reduce
18
Pipelined Results
110101010100
Page 34
Map Pipelined Result Reduce
18
Pipelined Results
11010101
0100
Page 35
Map Pipelined Result Reduce
18
Pipelined Results
11010101
0100
Page 36
Map Pipelined Result Reduce
18
Pipelined Results
11010101
0100
Page 37
Map Pipelined Result Reduce
18
Pipelined Results
11010101
0100
Page 38
Map Pipelined Result Reduce
18
Pipelined Results
110101010100
Page 39
Map Pipelined Result Reduce
18
Pipelined Results
11010101
0100
Page 40
Map Pipelined Result Reduce
18
Pipelined Results
110101010100
Page 41
19
Result Characteristics
Ephemeral Checkpointed
Pipelined
Blocking
Low-latency Low-latency
Easy to reason aboutresource consumption
Easy to reason aboutresource consumption
Page 42
20
Result Characteristics
Pipelined vs. Blocking
Ephemeral vs. Checkpointed
How long to keep results around?
Page 43
Map Ephemeral Result
21
Ephemeral Results
110101010100
Page 44
Map Ephemeral Result
21
Ephemeral Results
110101010100
Page 45
Map Ephemeral Result
21
Ephemeral Results
110101010100
Page 46
Map Ephemeral Result Reduce
21
Ephemeral Results
110101010100
Page 47
Map Ephemeral Result Reduce
21
Ephemeral Results
110101010100
Page 48
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 49
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 50
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 51
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 52
Map Ephemeral Result Reduce
21
Ephemeral Results
1101
01010100
Page 53
Map Ephemeral Result Reduce
21
Ephemeral Results
1101
01010100
Page 54
Map Ephemeral Result Reduce
21
Ephemeral Results
1101
01010100
Page 55
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 56
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 57
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 58
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 59
Map Ephemeral Result Reduce
21
Ephemeral Results
11010101
0100
Page 60
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
Page 61
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
Page 62
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
Page 63
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
1101
Page 64
Map Checkpointed Result Reduce
22
Checkpointed Results
11010101
0100
1101
Page 65
Map Checkpointed Result Reduce
22
Checkpointed Results
11010101
0100
1101
Page 66
Map Checkpointed Result Reduce
22
Checkpointed Results
11010101
0100
11010101
Page 67
Map Checkpointed Result Reduce
22
Checkpointed Results
11010101010011010101
Page 68
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
11010101
Page 69
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
110101010100
Page 70
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
110101010100
Page 71
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
11010101
0100
Page 72
Map Checkpointed Result Reduce
22
Checkpointed Results
110101010100
110101010100
Page 73
23
Result Characteristics
Ephemeral Checkpointed
Pipelined
Blocking
Low-latency Low-latency
Fine-grained fault tolerance
Fine-grained fault tolerance
Easy to reason aboutresource consumption
Easy to reason aboutresource consumption
Page 74
24
Benefits
• Very flexible design
• Decouples high-level requirements from runtime • Fault tolerance for batch vs. streaming • Different program optimization paths • Iterative programs • Interactive queries
Page 76
26
Gra
ph: G
elly
Flin
k M
L
Tabl
e
26
Optimizer Stream Builder
Flink Runtime
DataSet (Java/Scala) DataStream (Java/Scala)
Flink Stack
Tabl
e
Page 77
27
Current Implementations
Pipelined vs. Blocking
Ephemeral vs. Checkpointed
Backpressure vs. No Backpressure
Page 78
28
Current Implementations
Pipelined vs. Blocking
Ephemeral vs. Checkpointed
Backpressure vs. No Backpressure
Page 79
• Default type for both batch and streaming programs: Pipelined
• In batch mode only: use blocking exchange if necessary (e.g. to avoid deadlocks or break up long pipelines)
• More details: https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
29
Pipelined vs. Blocking in Flink
Page 80
30
ExecutionConfig
// Set up the execution environmentExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment(); ExecutionConfig conf = env.getConfig();
// Remote data exchange is blocking (local is pipelined)conf.setExecutionMode(ExecutionMode.BATCH);
// Remote and local data exchange is blockingconf.setExecutionMode(ExecutionMode.BATCH_FORCED);
// Remote and local data exchange is pipelined except// when necessary to avoid deadlocks etc. [DEFAULT]conf.setExecutionMode(ExecutionMode.PIPELINED);
// Remote and local data exchange is always pipelinedconf.setExecutionMode(ExecutionMode.PIPELINED_FORCED);
Page 81
flink.apache.org@ApacheFlink