towards portability and beyond€¦ · the beam model • unified batch and stream programming...

31
DATA PROCESSING WITH APACHE BEAM TOWARDS PORTABILITY AND BEYOND Maximilian Michels [email protected] @stadtlegende maximilianmichels.com

Upload: others

Post on 28-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

D ATA P R O C E S S I N G W I T H

A PA C H E B E A M

T O W A R D S P O R TA B I L I T Y A N D B E Y O N D

Maximilian Michels [email protected]

@stadtlegende maximilianmichels.com

Page 2: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

B E A M V I S I O N

!2

SDKs Runners Backends

Write Pipeline Execute

Page 3: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

T H E B E A M M O D E L

• Unified batch and stream programming model

• Stream: Batch is just a bounded stream

• Massively parallelizable Transformations

• Event Time as an explicit concept

!3

2015 VLDB: “The Dataflow Model: A Practical

Approach to Balancing Correctness, Latency, and

Cost in Massive-Scale, Unbounded, Out-of-Order

Data Processing”

Page 4: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

B E A M A P I

Page 5: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

T H E B E A M A P I

• Pipeline: An (acyclic) graph of PCollections

• [Output PCollection] = [Input PCollection].apply([Transform])

• Pipeline p = Pipeline.create(options)

• PCollection pCollection = p.apply(…).apply(…).…

• p.run()

!5

Page 6: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

I N P U T

Pipeline p = Pipeline.create(options);

PCollection<String> input1 = p.apply( “ReadMyFile", TextIO.read().from(“protocol://path/to/some/inputData.txt"));

PCollection<String> input2 = p.apply( Create.of(“DataEngConf”, “is”, “awesome”));

PCollection<KV<Long, String>> input3 = p.apply( KafkaIO.<Long, String>read() .withBootstrapServers("broker_1:9092,broker_2:9092") .withTopic("my_topic") .withKeyDeserializer(LongDeserializer.class) .withValueDeserializer(StringDeserializer.class))

!6

Page 7: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

C O R E P R I M I T I V E S T R A N S F O R M S

!7

P a r D o G ro u p B y K e y

input -> output

KV<“to”, [1,1]> KV<“be”, [1,1]> KV<“or”, [1 ]> KV<“not”,[1 ]>

“to” -> KV<“to”, 1> “be” -> KV<“be”, 1> “or” -> KV<“or”, 1> “not”-> KV<“not”,1> “to” -> KV<“to”, 1> “be” -> KV<“be”, 1>

KV<k,v>… -> KV<k, [v…]>

Map/Reduce Phase Shuffle Phase

Page 8: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

C O R E P R I M I T I V E S T R A N S F O R M S

!8

P a r D oPCollection<String> words = ...;

PCollection<KV<String, Integer>> wordCounts = words.apply(“AssignWordCounts", ParDo.of(new DoFn<String, KV<String, Integer>>() { @ProcessElement public void processElement(@Element String word, OutputReceiver<Integer> out) { out.output(KV.of(word, 1)); } } ));

G ro u p B y K e yPCollection<KV<String, Iterable<Integer>>> groupByWordCounts = wordCounts.apply(“GroupByWords”, GroupByKey.create());

Page 9: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

C O M P O S I T E T R A N S F O R M S

!9

C o m b i n e

F l a t t e n

PCollection<KV<String, Integer> wordCounts = …

PCollection<KV<String, Integer> combinedWordCounts = wordCounts.apply( Combine.perKey(new SerializableFunction<Iterable<Integer>, Integer> { @Override public Integer apply(Iterable<Integer> input) { int sum = 0; for (int count : input) { sum += count; } return sum; } });

(Map/) Shuffle/Reduce Phase

For more sophisticated combing, you can define

a CombineFn.

Page 10: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

M O R E T R A N S F O R M S

• CoGroupByKey (Join)

• Flatten

• Partition

• Define your own!

• Also: Side Inputs / Multiple Outputs / State / Timers

!10

Page 11: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

P R O C E S S I N G U N B O U N D E D D ATA

Page 12: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

P R O C E S S I N G ( U N ) B O U N D E D D ATA

• Stream are unbounded by nature

• Windows group data according to a windowing function

• Triggers decide when to kick off execution of windows

• Event Time is the predominant domain for windowing

• Every element has an event timestamp

• Watermark indicates the current event time

!12

Page 13: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

E V E N T V S P R O C E S S I N G T I M E

!13

Visualization by Frances Perry and Tyler Akidau

Even

t Tim

e

0

10

20

30

40

50

60

Processing Time

0 10 20 30 40 50 60

Input Ideal

Early

Late

Page 14: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

E V E N T V S P R O C E S S I N G T I M E : S TA R W A R S

!14

I II IIIIV V VI VII VIII IX

1999 2002 20051977 1980 1983 2015 2017 2019Year

Episode

These episodes appeared out-of-order

That’s much better!

E V E N T T I M E

P R O C E S S I N G T I M E

I II III IV V VI VII VIII IXEpisode

1977 1980 1983 2015 2017 20191999 2002 2005Year

Page 15: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

What, Where, When, How

T H E B E A M M O D E L : A S I M P L E P I P E L I N E

!15

PCollection<KV<String,Integer>>scores=input.apply(Sum.integersPerKey());

Page 16: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

A S I M P L E P I P E L I N E : W I N D O W I N G

!16

What, Where, When, How

PCollection<KV<String,Integer>>scores=input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))).apply(Sum.integersPerKey());

Window Types • Global Window • Fixed Time Windows • Sliding Time Windows • Per-Session Windows

Page 17: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

A S I M P L E P I P E L I N E : T R I G G E R I N G

!17

What, Where, When, How

PCollection<KV<String,Integer>>scores=input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)).triggering(AtWatermark()).apply(Sum.integersPerKey());

Triggers • Event time • Processing • Data-driven • Composite triggers

Page 18: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

A S I M P L E P I P E L I N E : T R I G G E R I N G

!18

What, Where, When, How

PCollection<KV<String,Integer>>scores=input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)).triggering(AtWatermark().withEarlyFirings(AtPeriod(Duration.standardMinutes(1))).withLateFirings(AtCount(1))).apply(Sum.integersPerKey()); Triggers

• Event time • Processing • Data-driven • Composite triggers

Page 19: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

A S I M P L E P I P E L I N E : R E F I N E M E N T

!19

What, Where, When, How

PCollection<KV<String,Integer>>scores=input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)).triggering(AtWatermark().withEarlyFirings(AtPeriod(Duration.standardMinutes(1))).withLateFirings(AtCount(1)).accumulatingFiredPanes()).apply(Sum.integersPerKey());

Page 20: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

J AVA V S P Y T H O N

!20

scores=input|WindowInto(FixedWindows(120)trigger=AfterWatermark(early=AfterProcessingTime(60),late=AfterCount(1))accumulation_mode=ACCUMULATING)|CombinePerKey(sum)

PCollection<KV<String,Integer>>scores=input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)).triggering(AtWatermark().withEarlyFirings(AtPeriod(Duration.standardMinutes(1))).withLateFirings(AtCount(1)).accumulatingFiredPanes()).apply(Sum.integersPerKey());

Page 21: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

E X E C U T E W I T H C H O I C E O F R U N N E R

!21

input=pipeline|ReadFromText("/path/to/text*")|Map(lambdaline:...)scores=input|WindowInto(FixedWindows(120)trigger=AfterWatermark(early=AfterProcessingTime(60),late=AfterCount(1))accumulation_mode=ACCUMULATING)|CombinePerKey(sum))scor|WriteToText("/path/to/outputs")

MyRunner().run(pipeline)

Page 22: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

R U N N E R S

Page 23: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

R U N N E R S

!23

Google Cloud Dataflow

Apache Flink Apache Spark Apache Apex Ali BabaJStorm

Direct

Apache Gearpump

Hadoop MapReduce

Apache Samza

Apache Storm

IBM Streams

WIP

Page 24: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

P R O B L E M S W I T H T H I S A P P R O A C H

• All execution backends written in Java

• Can Python run on top of Java?

• N Runners, M languages => N*M translation paths?

• Submission Flow

!24

Page 25: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

P O R TA B I L I T Y

Page 26: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

T H E B E A M V I S I O N

- Portability in Beam means Pipelines can be written and executed in any supported SDK (Java/Python/Go)

- Pipelines also contain language-specific code (e.g. map/reduce functions)

- Libraries of the language can be used (!)

!26

Execution (Fn API)

Apache Flink

Apache Spark

Pipeline (Runner API)

Beam Go

Beam Java

Beam Python

E x e c u t i o n E x e c u t i o n

Cloud Dataflow

E x e c u t i o n

Page 27: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

Backend (e.g. Flink)

TA S K 1 TA S K 2 TA S K 3 TA S K N

W I T H O U T P O R TA B I L I T Y

!27

S D K

R U N N E R

All components are tight to a single language

language-specific

Page 28: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

W I T H P O R TA B I L I T Y

!28

S D K H A R N E S S

S D K

S D K H A R N E S S

language-specific

language-agnostic

Job API Runner API

Fn API Fn API

Portable Jo

b

J O B S E R V E R R U N N E R

Backend (e.g. Flink)

TA S K 1 TA S K 2 TA S K 3 TA S K N

Page 29: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

W H AT I S T H E S TAT E O F T H E B E A M V I S I O N ?

• |Runners| * |SDKs| is not feasible

• Instead each SDK only implements a Portable Runner

• Flink Runner is the first OSS Runner to work with the Portable Runner

• Cross-language pipelines in the future

• Overhead has been measured to be 5-10%, could be less in real-world scenarios

!29

Page 30: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

D E M O T I M E

Page 31: TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

H O W T O G E T I N V O LV E D

• Visit beam.apache.org

• Documentation

• Examples

• Subscribe to the mailing lists:

[email protected]

[email protected]

• Join the ASF Slack channel #beam

• Follow @ApacheBeam

!31

Maximilian Michels [email protected]

@stadtlegende maximilianmichels.com