streaming dataflow with apache flink

59
Ufuk Celebi [email protected] HUG London October 15, 2015 Streaming Data Flow with Apache Flink

Upload: huguk

Post on 14-Apr-2017

602 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Streaming Dataflow with Apache Flink

Ufuk Celebi [email protected]

HUG London October 15, 2015

Streaming Data Flow with Apache Flink

Page 2: Streaming Dataflow with Apache Flink

Recent HistoryApril ‘14 December ‘14

v0.5 v0.6 v0.7

April ‘15

Project Incubation

Top Level Project

v0.8 v0.9

Currently moving towards 0.10 and 1.0 release.

Page 3: Streaming Dataflow with Apache Flink

What is Flink?

StreamingTopologies

Stream TimeWindow Count

Low Latency

Long Batch PipelinesResource Utilization

1.2

1.4

1.5

1.2

0.8

0.9

1.0

0.8

Rating Matrix User Matrix Item Matrix

1.5

1.7

1.2

0.6

1.0

1.1

0.8

0.4

W X Y ZW X Y Z

A

B

C

D

4.0

4.5

5.0

3.5

2.0

3.5

4.0

2.0

1.0

= XUse

r

Machine LearningIterative Algorithms

Graph Analysis

53

1 2

4

0.5

0.2 0.9

0.3

0.1

0.40.7

Mutable State

Page 4: Streaming Dataflow with Apache Flink

Overview

Deployment Local (Single JVM) · Cluster (Standalone, YARN)

DataStream API Unbounded Data

DataSet API Bounded Data

Runtime Distributed Streaming Data Flow

Libraries Machine Learning · Graph Processing · SQL-like API

Page 5: Streaming Dataflow with Apache Flink

Stream ProcessingReal world data is unbounded and is pushed to

systems.

BatchStreaming

Page 6: Streaming Dataflow with Apache Flink

Stream Platform Architecture

Server Logs

Trxn Logs

Sensor Logs

Downstream Systems

Flink

– Analyze and correlate streams – Create derived streams

Kafka

– Gather and backup streams – Offer streams

Page 7: Streaming Dataflow with Apache Flink

Cornerstones of Flink

Low Latency for fast results.

High Throughput to handle many events per second.

Exactly-once guarantees for correct results.

Intuitive APIs for productivity.

Page 8: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 9: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 10: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 11: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 12: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 13: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 14: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 15: Streaming Dataflow with Apache Flink

DataStream APIStreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment()

DataStream<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataStream Windowed WordCount DataStream<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .keyBy(0) // [word, [1, 1, …]] for 10 seconds .timeWindow(Time.of(10, TimeUnit.SECONDS)) .sum(1); // sum per word per 10 second window

counts.print();

env.execute();

Page 16: Streaming Dataflow with Apache Flink

Pipelining

s1 t1 w1

s2 t2 w2

Source Tokenizer Window Count

Complete Pipeline Online Concurrently.

Page 17: Streaming Dataflow with Apache Flink

Pipelining

s1 t1 w1

s2 t2 w2

Source Tokenizer Window Count

Complete Pipeline Online Concurrently.

Chained tasks

Page 18: Streaming Dataflow with Apache Flink

Pipelining

s1

s2 t2 w2

t1 w1

Source Tokenizer Window Count

Complete Pipeline Online Concurrently.

Chained tasks Pipelined Shuffle

Page 19: Streaming Dataflow with Apache Flink
Page 20: Streaming Dataflow with Apache Flink

Streaming Fault Tolerance

At Least Once • Ensure that all operators see all events.

Exactly Once• Ensure that all operators see all events. • Do not perform duplicates updates to operator state.

Page 21: Streaming Dataflow with Apache Flink

Streaming Fault Tolerance

At Least Once • Ensure that all operators see all events.

Exactly Once• Ensure that all operators see all events. • Do not perform duplicates updates to operator state.

Flink guarantees exactly once processing.

Page 22: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 23: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 24: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 25: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 26: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 27: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 28: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 29: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 30: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 31: Streaming Dataflow with Apache Flink

Distributed SnaphotsBarriers flow through the topology in line with data.

Flink guarantees exactly once processing.

Part of snapshot

Page 32: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: State 1:

Source 2: State 2:

Source 3: Sink 1:

Source 4: Sink 2:

Offset: 6791

Offset: 7252

Offset: 5589

Offset: 6843

Page 33: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: State 1:

Source 2: State 2:

Source 3: Sink 1:

Source 4: Sink 2:

Offset: 6791

Offset: 7252

Offset: 5589

Offset: 6843

Start CheckpointMessage

Page 34: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1:

Source 2: 7252 State 2:

Source 3: 5589 Sink 1:

Source 4: 6843 Sink 2:

Emit Barriers

Acknowledge withPosition

Page 35: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1:

Source 2: 7252 State 2:

Source 3: 5589 Sink 1:

Source 4: 6843 Sink 2:

Received barrier at each input

Page 36: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1:

Source 2: 7252 State 2:

Source 3: 5589 Sink 1:

Source 4: 6843 Sink 2:

s1 Write Snapshotof its state

Received barrier at each input

Page 37: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1: PTR1

Source 2: 7252 State 2: PTR2

Source 3: 5589 Sink 1:

Source 4: 6843 Sink 2:

s1

Acknowledge withpointer to state

s2

Page 38: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1: PTR1

Source 2: 7252 State 2: PTR2

Source 3: 5589 Sink 1: ACK

Source 4: 6843 Sink 2: ACK

s1 s2

Acknowledge CheckpointReceived barrier

at each input

Page 39: Streaming Dataflow with Apache Flink

Distributed Snaphots

Flink guarantees exactly once processing.

JobManager

Master

State Backend

Ceckpoint DataSource 1: 6791 State 1: PTR1

Source 2: 7252 State 2: PTR2

Source 3: 5589 Sink 1: ACK

Source 4: 6843 Sink 2: ACK

s1 s2

Page 40: Streaming Dataflow with Apache Flink

Operator State

User-defined state • Flink’s transformations are long running operators • Feel free to keep objects around • Hooks to include into system’s checkpoint

Windowed streams• Time, count, and data-driven windows • Managed by the system

Page 41: Streaming Dataflow with Apache Flink

Batch on Streaming

DataStream API Unbounded Data

DataSet API Bounded Data

Runtime Distributed Streaming Data Flow

Libraries Machine Learning · Graph Processing · SQL-like API

Page 42: Streaming Dataflow with Apache Flink

Batch on StreamingRun a bounded stream (data set) on

a stream processor.

Bounded data set

Unbounded data stream

Page 43: Streaming Dataflow with Apache Flink

Batch on Streaming

Stream Windows

PipelinedData Exchange

Global View

Pipelined or BlockingData Exchange

Infinite Streams Finite Streams

Run a bounded stream (data set) ona stream processor.

Page 44: Streaming Dataflow with Apache Flink

Batch Pipelines

Page 45: Streaming Dataflow with Apache Flink

Batch Pipelines

Data exchange is mostly streamed

Page 46: Streaming Dataflow with Apache Flink

Batch Pipelines

Data exchange is mostly streamed

Some operators block (e.g. sort, hash table)

Page 47: Streaming Dataflow with Apache Flink

DataSet APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 48: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 49: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 50: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 51: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 52: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 53: Streaming Dataflow with Apache Flink

DataStream APIExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment()

DataSet<String> data = env.fromElements( "O Romeo, Romeo! wherefore art thou Romeo?”, ...);

// DataSet WordCount DataSet<Tuple2<String, Integer>> counts = data .flatMap(new SplitByWhitespace()) // (word, 1) .groupBy(0) // [word, [1, 1, …]] .sum(1); // sum per word for all occurrences

counts.print();

Page 54: Streaming Dataflow with Apache Flink

Batch-specific optimizations

Managed memory • On- and off-heap memory • Internal operators (e.g. join or sort) with out-of-core

support • Serialization stack for user-types

Cost-based optimizer• Program adapts to changing data size

Page 55: Streaming Dataflow with Apache Flink

Getting Started

Project Page: http://flink.apache.org

Page 56: Streaming Dataflow with Apache Flink

Getting Started

Project Page: http://flink.apache.org

Quickstarts: Java & Scala API

Page 57: Streaming Dataflow with Apache Flink

Getting Started

Project Page: http://flink.apache.org

Docs: Programming Guides

Page 58: Streaming Dataflow with Apache Flink

Getting Started

Project Page: http://flink.apache.org

Get Involved: Mailing Lists, Stack Overflow, IRC, …

Page 59: Streaming Dataflow with Apache Flink

Blogs http://flink.apache.org/blog http://data-artisans.com/blog

Twitter @ApacheFlink

Mailing lists (news|user|dev)@flink.apache.org

Apache Flink