apache gearpump next-gen streaming engine
TRANSCRIPT
Apache Gearpump next-gen streaming engineManu Zhang, [email protected] Zhong, [email protected]
What is Gearpump ?
● An Akka[2] based real-time streaming engine ● An Apache Incubator[1] project since Mar.8th, 2016● A super simple pump that consists of only two gears
but very powerful at streaming water
Gearpump Architecture
● Actor concurrency● Message passing
communication● error handling and
isolation with supervision hierarchy
● Master HA with Akka Cluster
Why Gearpump ?
Stream processing is hard but no system meets all requirements
● Infinite data● Out-of-order data● Low latency assurance (e.g real-time recommendation)● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update user logics
How Gearpump solves the hard parts
● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
User Interface - DAG API
Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~> HashPartitioner ~> AnomalyDetector, AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)
partitioner
User Interface - Partitioner
Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~ HashPartitioner ~> AnomalyDetector, AnomalyDetector ~ ShufflePartitioner ~> KafkaSink)
WindowCounter(parallelism = 3)
ModelCalculator(parallelism = 2)
Task
Task
Task
Task
Task
User Interface - Compose DAG
How Gearpump solves the hard parts
● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
Flow Control
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
Without Flow Control - messages queue up
KafkaSource WindowCounter ModelCalculator
fast fast very slow
Without Flow Control - OOM
KafkaSource WindowCounter ModelCalculator
fast fast very slow
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
fast slow down very slowbackpressure
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
Slow down Slow down Very Slowbackpressurebackpressure
pull slower
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
Out-of-order data
Event time321
Proc
essi
ng t
ime
● Event time - when data generated● Processing time - when data processed
1
2
3
4
56
Window Count
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
When can window counts be emitted ?
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, 1)
WindowCounter In-memory Table
● No window can be emitted since message as early as time 1 has not arrived
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
On Watermark[4]
Event time321
Proc
essi
ng t
ime
1
2
3
4
56 watermark
● No timestamp earlier than watermark will be seen
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, 1)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 0,No window can be emitted
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, [
0, 2), 1
)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 2,Window [0, 2) can be emitted
(“gearpump”, [0, 2), 1)
How to get watermark ?
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
From upstream
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
watermark
More on Watermark
● Source watermark defined by user● Usually heuristic based ● Users decide whether to drop data arriving after
watermark
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
Exactly Once - no Lost or Duplicated States
● Lost states can cause anomaly words skipped
● Duplicated states can cause false alarm
States
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
kafka_offset window_counts
models
Exactly Once with asynchronous checkpointing
(2, kafka_offset)
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 0 Watermark = 0
(2, kafka_offset)
(2, kafka_offset)(2, window_counts)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 0
(2, window_counts)
(2, kafka_offset)(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)(2, window_counts)
(2, models)
Crash
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
(2, kafka_offset)(2, window_counts)
(2, models)
Recover to latest checkpoint at 2
KafkaSource WindowCounter ModelCalculator
modelswindow_countskafka_offset
Replay from kafka
Get state at 2
More on Exactly Once
● User extends PersistentState interface and system checkpoints state for users automatically
● Watermark is persisted on master
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
Can we update model algorithm in a cheap way ?
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSinkThe model algorithm doesn’t perform well
Dynamic DAG - modify application at runtime
Performance
Yahoo! Streaming Benchmark[5]
Yahoo! Streaming Benchmark
● 4 node cluster
● Each node has
32-core, 64 GB
memory
● 10GB Ethernet
Advanced features
DAG Visualization
● Watermark● Node size reflects throughput● Edge width represents flow rate
DAG Visualization
● Data skew analysis
Apache Beam[6] Gearpump Runner
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
Apache Gearpump
Experimental features
● Web UI Authorization / OAuth2 Authentication● CGroup Resource Isolation● Binary Storm compatibility● Akka Streams integration
Summary
● Gearpump is good at streaming infinite out-of-order data and guarantees correctness
● Gearpump helps users to easily program streaming applications, get runtime information and update dynamically
References
1. gearpump.apache.org2. akka.io3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti
me-dagprocessing-with-akka-at-scale4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid
au.pdf5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami
ng-computation-engines-at6. Apache Beam [project overview]
Backup Slides
What is Akka
● Message driven● Shared nothing● Location transparent● Supervision based
failure management● High performance
Actor
ActorActor
Actor Actor
spawn