fabian hueske - stream analytics with sql on apache flink

22
FEBRUARY 9, 2017, WARSAW Stream Analytics with SQL on Apache Flink® Fabian Hueske | Apache Flink PMC member | Co- founder dataArtisans

Upload: dataartisans

Post on 21-Feb-2017

120 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Stream Analytics with SQL on Apache Flink®

Fabian Hueske | Apache Flink PMC member | Co-founder dataArtisans

Page 2: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Streams are Everywhere

Page 3: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Data Analytics on Streaming Data• Periodic batch processing

• Lots of duct tape and baling wire• It’s up to you to make

everything work… reliably!• High latency

• Continuous stream processing • Framework takes care of failures• Low latency

Page 4: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Stream Processing in Apache Flink• Platform for scalable stream processing

• Fast• Low latency and high throughput

• Accurate• Stateful streaming processing in event time

• Reliable• Exactly-once state guarantees• Highly available cluster setup

Page 5: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Streaming Applications Powered by Flink

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

Page 6: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Stream Processing is not for Everybody, … yet• APIs of open source stream processors target developers

• Implementing streaming applications requires knowledge & skill• Stream processing concepts (time, state, windows, triggers, ...)• Programming experience (Java / Scala APIs)

• Stream processing technology spreads rapidly• There is a talent gap

Page 7: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

What about SQL?• SQL is the most widely used language for data analytics

• Many good reasons to use SQL• Declarative specification• Optimization• Efficient execution• “Everybody” knows SQL

• SQL would make stream processing much more accessible, but…

Page 8: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

No OS Stream Processor Offers Decent SQL Support• SQL was not designed with streaming data in mind• Relations are sets. Streams are infinite sequences.• Records arrive over time.

• Syntax• Time-based operations are cumbersome to specify (aggregates, joins)

• Semantics• A SQL query should compute the same result on a batch table and a stream

Page 9: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

• Standard SQL and LINQ-style Table API

• Unified APIs for batch & streaming data

• Common translation layers• Optimization based on Apache Calcite• Type system & code-generation• Table sources & sinks

• Streaming SQL & Table API is work in progress

Flink’s SQL Support & Table API

Page 10: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

What are the Use Cases for Stream SQL?• Continuous ETL & Data Import

• Live Dashboards & Reports

• Ad-hoc Analytics & Exploration

Page 11: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Dynamic Tables• Core concept is a “Dynamic Table”

• Dynamic tables change over time

• Dynamic tables are treated like static batch tables• Dynamic tables are queried with standard SQL• A query returns another dynamic table

• Stream ←→ Dynamic Table conversions without information loss• “Stream / Table Duality”

Page 12: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Stream → Dynamic Table• Append

• Replace by Key

time k1 A2 B4 A5 C7 B8 A9 B… …

time k

2, B4, A5, C7, B8, A9, B 1, A

2, B4, A5, C7, B8, A9, B 1, A

8 A

9 B

5 C

… …

Page 13: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Querying a Dynamic Table• Dynamic tables change over time• A[t]: Table A at time t

• Dynamic tables are queried with regular SQL• Result of a query changes as input table changes• q(A[t]): Evaluate query q on table A at time t

• As time t progresses, the query result is continuously updated • similar to maintaining a materialized view• t is current event time

Page 14: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Querying a Dynamic Tabletime k

k cntA 3B 2C 1

9 B

k cntA 3B 3C 1

12 C

k cntA 3B 3C 2

A[8]

A[9]

A[12]

q(A[8])

q(A[9])

q(A[12])

Table A

q:

SELECT k, COUNT(k) as cntFROM AGROUP BY k

1 A

2 B

4 A

5 C

7 B

8 A

Page 15: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

time k

A[5]

A[10]

A[15]

q(A[5])

q(A[10])

q(A[15])

Table A

Querying a Dynamic Table

7 B8 A

9 B

11 A

12 C

14 C

15 A

k cnt endTA 2 5B 1 5C 1 5

q(A)

A 1 10B 2 10

A 2 15

C 2 15

q:

SELECT k, COUNT(k) AS cnt, TUMBLE_END( time, INTERVAL '5' SECONDS) AS endTFROM AGROUP BY k, TUMBLE( time, INTERVAL '5' SECONDS)

1 A2 B4 A5 C

Page 16: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Can We Run Any Query on Dynamic Tables?• No

• There are state and computation constraints

• State may not grow infinitely as more data arrives• Clean-up timeout must be defined

• Input updates may only trigger partial re-computation of the result

• Queries with possibly unbounded state or computation are rejected• Optimizer performs validation

Page 17: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Bounding the State of a Query

• State grows infinitely with domain of grouping attribute• Bound query input by time

• Query aggregates data of last 24 hours. Older data is discarded.

SELECT k, COUNT(k) AS cntFROM AGROUP BY k

SELECT k, COUNT(k) AS cntFROM AWHERE last(time, INTERVAL ‘1’ DAY)GROUP BY k

STOP!UNBOUNED

STATE!

Page 18: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Updating Results and Late Arriving Data• Sometimes emitted results need to be updated• Results which are continuously updated• Results for which relevant records arrived late

• Results that might be updated must be kept as state• Clean-up timeout

• When a table is converted into a stream, updates must be propagated• Update mode• Add/Retract mode

Page 19: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Dynamic Table → Stream: Update Mode

time kTable A

B, 1A, 2C, 1B, 2A, 3 A, 1

SELECT k, COUNT(k) AS cntFROM AGROUP BY k

1 A2 B4 A5 C7 B8 A

… …

Update by Key

Page 20: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Dynamic Table → Stream: Add/Retract Mode

time kTable A

+ B, 1+ A, 2+ C, 1+ B, 2+ A, 3 + A, 1- A, 1- B, 1- A, 2

1 A2 B4 A5 C7 B8 A

… …

SELECT k, COUNT(k) AS cntFROM AGROUP BY k

Add (+) / Retract (-)

Page 21: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Current State of SQL and Table API• Huge interest and many contributors

• Current development efforts• Adding more window operators• Introducing dynamic tables

• And there is a lot more to do• New operators and features for streaming and batch• Performance improvements• Tooling and integration

• Try it out, give feedback, and start contributing!

Page 22: Fabian Hueske - Stream Analytics with SQL on Apache Flink

FEBRUARY 9, 2017, WARSAW

Stream Analytics with SQL on Apache Flink

Fabian Hueske | @fhueske