continuous processing with apache flink - strata london 2016

41
Stephan Ewen @stephanewen Continuous Processing with Apache Flink

Upload: stephan-ewen

Post on 16-Apr-2017

9.954 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Continuous Processing with Apache Flink - Strata London 2016

Stephan Ewen@stephanewen

Continuous Processingwith Apache Flink

Page 2: Continuous Processing with Apache Flink - Strata London 2016

2

Streaming technology is enabling the obvious: continuous processing on data that is

continuously produced

Page 3: Continuous Processing with Apache Flink - Strata London 2016

Continuous Apps before Streaming

3time

Scheduler

file 1

file 2

Job 1

Job 2

Serv

ing

file 3 Job 3

Page 4: Continuous Processing with Apache Flink - Strata London 2016

Continuous Apps with Lambda

4

Scheduler

file 1

file 2

Job 1

Job 2

Serv

ing

Streaming job

Page 5: Continuous Processing with Apache Flink - Strata London 2016

Continuous Apps with Streaming

5

collect log analyze serve & store

Page 6: Continuous Processing with Apache Flink - Strata London 2016

Continuous Data Sources

6

Process a period ofhistoric data

partition

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

Page 7: Continuous Processing with Apache Flink - Strata London 2016

Continuous Data Sources

7

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream of events in Apache Kafka partitions

Stream view over sequence of files

Page 8: Continuous Processing with Apache Flink - Strata London 2016

Continuous Processing

Time State

Page 9: Continuous Processing with Apache Flink - Strata London 2016

9

Enter Apache Flink

Page 10: Continuous Processing with Apache Flink - Strata London 2016

Apache Flink Stack

10

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 11: Continuous Processing with Apache Flink - Strata London 2016

Programs and Dataflows

11

Source

Transformation

Transformation

Sink

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

Source[1]

map()[1]

keyBy()/window()/

apply()[1]

Sink[1]

Source[2]

map()[2]

keyBy()/window()/

apply()[2]

StreamingDataflow

Page 12: Continuous Processing with Apache Flink - Strata London 2016

What makes Flink flink?

12

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 13: Continuous Processing with Apache Flink - Strata London 2016

13

(It's) About Time

Page 14: Continuous Processing with Apache Flink - Strata London 2016

Different Notions of Time

14

Event Producer Message Queue FlinkData Source

FlinkWindow Operator

partition 1

partition 2

EventTime Stream Processor

IngestionTime

WindowProcessing

TimeStorage

IngestionTime

Page 15: Continuous Processing with Apache Flink - Strata London 2016

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

Event Time vs. Processing Time

15

Page 16: Continuous Processing with Apache Flink - Strata London 2016

Batch: Implicit Treatment of Time

16

Time is treated outside of your application.Data is grouped by storage ingestion time.

Batch Job1h Serving

LayerBatch Job1h

Batch Job1h

Page 17: Continuous Processing with Apache Flink - Strata London 2016

Streaming: Windows

17Time

Aggregates on streamsare scoped by windows

Time-driven Data-drivene.g. last X minutes e.g. last X records

Page 18: Continuous Processing with Apache Flink - Strata London 2016

Streaming: Windows

18

Time

"Average over the last 5 minutes”

Page 19: Continuous Processing with Apache Flink - Strata London 2016

Event Time Windows

19

Event Time Windows reorder the events to their Event Time order

Page 20: Continuous Processing with Apache Flink - Strata London 2016

Processing Time

20

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(ProcessingTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Page 21: Continuous Processing with Apache Flink - Strata London 2016

Ingestion Time

21

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(IngestionTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Page 22: Continuous Processing with Apache Flink - Strata London 2016

Event Time

22

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(EventTime)

val stream: DataStream[Event] = env.addSource(…)val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator())

tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Page 23: Continuous Processing with Apache Flink - Strata London 2016

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

23

Page 24: Continuous Processing with Apache Flink - Strata London 2016

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

24

Purely data-driven time

Purely wall clock time

Mix of data-driven and wall clock time

Page 25: Continuous Processing with Apache Flink - Strata London 2016

Event Time Progress: Watermarks

25

7

W(11)W(17)

11159121417122220 171921

WatermarkEvent

Event timestamp

Stream (in order)

7

W(11)W(20)

Watermark

991011141517

Event

Event timestamp

1820 192123

Stream (out of order)

Page 26: Continuous Processing with Apache Flink - Strata London 2016

Bounding the Latency for Results Triggering on combinations on

Event Time and Processing Time

See previous talks by Tyler Akidau &Kenneth Knowles on Apache Beam (incub.)

Concepts apply almost 1:1 to Apache Flink Syntax varies

26

Page 27: Continuous Processing with Apache Flink - Strata London 2016

27

Matters of State

Page 28: Continuous Processing with Apache Flink - Strata London 2016

Batch vs. Continuous

28

• No state across batches

• Fault tolerance within a job

• Re-processing starts empty

Batch Jobs Continuous Programs

• Continuous state across time

• Fault tolerance guards state

• Reprocessing starts stateful

Page 29: Continuous Processing with Apache Flink - Strata London 2016

Continuous State

29

time

No stateless point in time

Sessions over time

Page 30: Continuous Processing with Apache Flink - Strata London 2016

Re-processing data (in batch)

30

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Page 31: Continuous Processing with Apache Flink - Strata London 2016

Re-processing data (in batch)

31

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Wrong / corrupt results

Page 32: Continuous Processing with Apache Flink - Strata London 2016

Streaming: Savepoints

32

Savepoint A Savepoint B

Globally consistent point-in-time snapshotof the streaming application

Page 33: Continuous Processing with Apache Flink - Strata London 2016

Re-processing data (continuous)

33

Savepoint A

Page 34: Continuous Processing with Apache Flink - Strata London 2016

Re-processing data (continuous) Draw savepoints at times that you will want to start new jobs

from (daily, hourly, …) Reprocess by starting a new job from a savepoint

• Defines start position in stream (for example Kafka offsets)• Initializes pending state (like partial sessions)

34

Savepoint

Run new streamingprogram from savepoint

Page 35: Continuous Processing with Apache Flink - Strata London 2016

Forking and Versioning Applications

35

Savepoint

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

Page 36: Continuous Processing with Apache Flink - Strata London 2016

36

Conclusion

Page 37: Continuous Processing with Apache Flink - Strata London 2016

Wrap up Streaming is the architecture for continuous processing

Continuous processing makes data applications• Simpler: Fewer moving parts• More correct: No broken state at any boundaries• More flexible: Reprocess data and fork applications via savepoints

Requires a powerful stream processor, like Apache Flink

37

Page 38: Continuous Processing with Apache Flink - Strata London 2016

Upcoming Features Dynamic Scaling, Resource Elasticity Stream SQL CEP enhancements Incremental & asynchronous state snapshotting Mesos support More connectors, end-to-end exactly once API enhancements (e.g., joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)

38

Page 39: Continuous Processing with Apache Flink - Strata London 2016

What makes Flink flink?

39

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 40: Continuous Processing with Apache Flink - Strata London 2016

Flink Forward 2016, BerlinSubmission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

Page 41: Continuous Processing with Apache Flink - Strata London 2016

We are hiring!data-artisans.com/careers