data stream processing with apache flink

43
Data Stream Processing with Apache Flink Fabian Hueske @fhueske Apache Flink Meetup Madrid, 25.02.2016

Upload: fabian-hueske

Post on 06-Jan-2017

2.126 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Data Stream Processing with Apache Flink

Data Stream Processing with Apache Flink

Fabian Hueske@fhueske

Apache Flink Meetup Madrid, 25.02.2016

Page 2: Data Stream Processing with Apache Flink

What is Apache Flink?Apache Flink is an open source platform for scalable stream and batch processing.

2

• The core of Flink is a distributed streaming dataflow engine.• Executes dataflows in

parallel on clusters• Provides a reliable

backend for various workloads

• DataStream and DataSet programming abstractions are the foundation for user programs and higher layers

Page 3: Data Stream Processing with Apache Flink

What is Apache Flink?

3

Streaming topologies

Long batch pipelines

Machine Learning at scale

A stream processor with many faces

Graph Analysis

resource utilization iterative algorithms

Mutable state

low-latency processing

Page 4: Data Stream Processing with Apache Flink

History & Community of Flink

From incubation until now

4

Page 5: Data Stream Processing with Apache Flink

5

Apr ‘14 Jun ‘15Dec ‘14

0.7

0.6

0.5

0.9

0.10

Nov ‘15

Top level

0.8

Mar ‘15

1.0!

Page 6: Data Stream Processing with Apache Flink

Growing and Vibrant Community

Flink is one of the largest and most active Apache big data projects:• more than 150 contributors• more than 600 forks• more than 1000 Github stars (since yesterday)

6

Page 7: Data Stream Processing with Apache Flink

Flink Meetups around the Globe

7

Page 8: Data Stream Processing with Apache Flink

Flink Meetups around the Globe

8

Page 9: Data Stream Processing with Apache Flink

Organizations at Flink Forward

9

Page 10: Data Stream Processing with Apache Flink

The streaming eraComing soon…

10

Page 11: Data Stream Processing with Apache Flink

11

What is Stream Processing? Today, most data is continuously produced

• user activity logs, web logs, sensors, database transactions, …

The common approach to analyze such data so far• Record data stream to stable storage (DBMS, HDFS,

…)• Periodically analyze data with batch processing

engine (DBMS, MapReduce, ...)

Streaming processing engines analyze data while it arrives

Page 12: Data Stream Processing with Apache Flink

Why do Stream Processing? Decreases the overall latency to obtain results

• No need to persist data in stable storage• No periodic batch analysis jobs

Simplifies the data infrastructure• Fewer moving parts to be maintained and

coordinated

Makes time dimension of data explicit• Each event has a timestamp• Data can be processed based on timestamps

12

Page 13: Data Stream Processing with Apache Flink

What are the Requirements? Low latency

• Results in millisecond

High throughput• Millions of events per second

Exactly-once consistency• Correct results in case of failures

Out-of-order events• Process events based on their associated time

Intuitive APIs13

Page 14: Data Stream Processing with Apache Flink

OS Stream Processors so far Either low latency or high throughput

Exactly-once guarantees only with high latency

Lacking time semantics• Processing by wall clock time only• Events are processed in arrival order, not in the order

they were created

Shortcomings lead to complicated system designs• Lambda architecture

14

Page 15: Data Stream Processing with Apache Flink

Stream Processing with Flink

15

Page 16: Data Stream Processing with Apache Flink

Stream Processing with Flink Low latency

• Pipelined processing engine

High throughput• Controllable checkpointing overhead

Exactly-once guarantees• Distributed snapshots

Support for out-of-order streams• Processing semantics based on event-time

Programmability • APIs similar to those known from the batch world

16

Page 17: Data Stream Processing with Apache Flink

17

Flink in Streaming Architectures

Flink

Flink Flink

Elasticsearch, Hbase, Cassandra, …

HDFS

Kafka

Analytics on static dataData ingestion

and ETL

Analytics on data in motion

Page 18: Data Stream Processing with Apache Flink

The DataStream APIConcise and easy-to-grasp code

18

Page 19: Data Stream Processing with Apache Flink

The DataStream API

19

case class Event(location: Location, numVehicles: Long)

val stream: DataStream[Event] = …;

stream .filter { evt => isIntersection(evt.location) }

Page 20: Data Stream Processing with Apache Flink

The DataStream API

20

case class Event(location: Location, numVehicles: Long)

val stream: DataStream[Event] = …;

stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles")

Page 21: Data Stream Processing with Apache Flink

The DataStream API

21

case class Event(location: Location, numVehicles: Long)

val stream: DataStream[Event] = …;

stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.minutes(15), Time.minutes(5)) .sum("numVehicles")

.keyBy("location") .mapWithState { (evt, state: Option[Model]) => { val model = state.orElse(new Model()) (model.classify(evt), Some(model.update(evt))) }}

Page 22: Data Stream Processing with Apache Flink

Event-time processingConsistent and sound results

22

Page 23: Data Stream Processing with Apache Flink

Event-time Processing Most data streams consist of events

• log entries, sensor data, user actions, …• Events have an associated timestamp

Many analysis tasks are based on time• “Average temperature every minute”• “Count of processed parcels per hour”• ...

Events often arrive out-of-order at processor• Distributed sources, network delays, non-synced clocks, …

Stream processor must respect time of events for consistent and sound results• Most stream processors use wall clock time

23

Page 24: Data Stream Processing with Apache Flink

Event Processing

24

Events occur on devices

Queue / Log

Events analyzed in a

stream processor

Stream Analysis

Events stored in a log

Page 25: Data Stream Processing with Apache Flink

Event Processing

25

Page 26: Data Stream Processing with Apache Flink

Event Processing

26

Page 27: Data Stream Processing with Apache Flink

Event Processing

27

Page 28: Data Stream Processing with Apache Flink

Event Processing

28

Out of order!!!

First burst of eventsSecond burst of events

Page 29: Data Stream Processing with Apache Flink

Event Processing

29

Event time windows

Arrival time windows

Instant event-at-a-time

Flink supports out-of-order streams (event time) windows,arrival time windows (and mixtures) plus low latency processing.

First burst of eventsSecond burst of events

Page 30: Data Stream Processing with Apache Flink

Event-time Processing Event-time processing decouples job

semantics from processing speed

Analyze events from static data store and online stream using the same program

Semantically sound and consistent results

Details:http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1

30

Page 31: Data Stream Processing with Apache Flink

Operational FeaturesRunning Flink 24*7*52

31

Page 32: Data Stream Processing with Apache Flink

Monitoring & Dashboard Many metrics exposed via REST

interface Web dashboard• Submit, stop, and cancel jobs• Inspect running and completed jobs• Analyze performance• Check exceptions• Inspect configuration• …

32

Page 33: Data Stream Processing with Apache Flink

Highly-available Cluster Setup Stream applications run for weeks, months, …

• Application must never fail!• No single-point-of-failure component allowed

Flink supports highly-available cluster setups• Master failures are resolved using Apache Zookeeper• Worker failures are resolved by master

Stand-alone cluster setup• Requires (manually started) stand-by masters and workers

YARN cluster setup• Masters and workers are automatically restarted

33

Page 34: Data Stream Processing with Apache Flink

A save point is a consistent snapshot of a job• Includes source offsets and operator state• Stop job• Restart job from save point

What can I use it for?• Fix or update your job• A/B testing• Update Flink• Migrate cluster• …

Details:http://data-artisans.com/how-apache-flink-enables-new-streaming-applications

Save Points

34

Page 35: Data Stream Processing with Apache Flink

Performance: Summary

35

Continuousstreaming

Latency-boundbuffering

DistributedSnapshots

High Throughput &Low Latency

With configurable throughput/latency tradeoff

Details: http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink

Page 36: Data Stream Processing with Apache Flink

Integration (picture not complete)

36

POSIX Java/ScalaCollections

POSIX

Page 37: Data Stream Processing with Apache Flink

Post v1.0 RoadmapWhat’s coming next?

37

Page 38: Data Stream Processing with Apache Flink

Stream SQL and Table API Structured queries over data streams

• LINQ-style Table API• Stream SQL

Based on Apache Calcite • SQL Parser and optimizer

“Compute every hour the number of orders and number ordered units for each product.”

38

SELECT STREAM productId, TUMBLE_END(rowtime, INTERVAL '1' HOUR) AS rowtime, COUNT(*) AS cnt, SUM(units) AS units FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL '1' HOUR), productId;

Page 39: Data Stream Processing with Apache Flink

Complex Event Processing Identify complex patterns in event streams

• Correlations & sequences

Many applications• Network intrusion detection via access patterns• Item tracking (parcels, devices, …)• …

CEP depends on low latency processing• Most CEP system are not distributed

CEP in Flink• Easy-to-use API to define CEP patterns• Integration with Table API for structured analytics• Low-latency and high-throughput engine

39

Page 40: Data Stream Processing with Apache Flink

Dynamic Job Parallelism Adjusting parallelism of tasks without

(significantly) interrupting the program

Initial version based on save points• Trigger save point• Stop job• Restart job with adjusted parallelism

Later change parallelism while job is running

Vision is automatic adaption based on throughput

40

Page 41: Data Stream Processing with Apache Flink

Wrap up! Flink is a kick-ass stream processor…• Low latency & high throughput• Exactly-once consistency• Event-time processing• Support for out-of-order streams• Intuitive API

with lots of features in the pipeline…

and a reliable batch processor as well!41

Page 42: Data Stream Processing with Apache Flink

I ♥ Squirrels, do you? More Information at

• http://flink.apache.org/

Free Flink training at• http://dataartisans.github.io/flink-training

Sign up for user/dev mailing list

Get involved and contribute

Follow @ApacheFlink on Twitter42

Page 43: Data Stream Processing with Apache Flink

43