the future of real-time in spark - github pages...2016/02/18  · structured streaming high-level...

30
The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016

Upload: others

Post on 20-May-2020

16 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

The Future of Real-Time in Spark

Reynold Xin @rxinSpark Summit, New York, Feb 18, 2016

Page 2: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Why Real-Time?

Making decisions faster is valuable.

• Preventing credit card fraud• Monitoring industrial machinery• Human-facing dashboards• …

Page 3: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Streaming Engine

Noun.

Takes an input stream and produces an output stream.

Page 4: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

SQL Streaming MLlib

Spark Core

GraphX

Spark Unified Stack

Page 5: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

StreamingSQL MLlib

Spark Core

GraphXStreaming

Introduced 3 years ago in Spark 0.750% users consider most important part of Spark

Spark Unified Stack

Page 6: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Spark Streaming

• First attempt at unifying streaming and batch• State management built in• Exactly once semantics• Features required for large clusters

• Straggler mitigation, dynamic load balancing, fast fault-recovery

Page 7: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Streaming computations don’t run in isolation.

Page 8: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Use Case: Fraud Detection

STREAM

ANOMALY

Machine learning modelcontinuously updates to detect new anomalies

Ad-hoc analyze historic data

Page 9: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Continuous Application

noun.

An end-to-end application that acts on real-time data.

Page 10: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Challenges Building Continuous Applications

Integration with non-streaming systems often an after-thought• Interactive, batch, relational databases, machine learning, …

Streaming programming models are complex

Page 11: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Integration Example

Streaming engine

Stream(home.html, 10:08)

(product.html, 10:09)

(home.html, 10:10)

. . .

What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...

MySQL

Page Minute Visits

home 10:09 21

pricing 10:10 30

... ... ...

Page 12: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

ProcessingBusiness logic change & new ops

(windows, sessions)

Complex Programming Models

OutputHow do we define

output over time & correctness?

DataLate arrival, varying distribution over time, …

Page 13: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Structured Streaming

Page 14: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

The simplest way to perform streaming analyticsis not having to reason about streaming.

Page 15: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Spark 2.0Infinite DataFrames

Spark 1.3Static DataFrames

Single API !

Page 16: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Structured Streaming

High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks

Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models

Page 17: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

output fordata at 1

Result

Que

ry

Time

data upto PT 1

Input

completeoutput

Output

1 2 3

Trigger: every 1 sec

data upto PT 2

output fordata at 2

data upto PT 3

output fordata at 3

Model

Page 18: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

deltaoutput

output fordata at 1

Result

Que

ry

Time

data upto PT 2

data upto PT 3

data upto PT 1

Input

output fordata at 2

output fordata at 3

Output

1 2 3

Trigger: every 1 secModel

Page 19: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Model Details

Input sources: append-only tables

Queries: new operators for windowing, sessions, etc

Triggers: based on time (e.g. every 1 sec)

Output modes: complete, deltas, update-in-place

Page 20: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Example: ETL

Input: files in S3

Query: map (transform each record)

Trigger: “every 5 sec”

Output mode: “new records”, into S3 sink

Page 21: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Example: Page View Count

Input: records in Kafka

Query: select count(*) group by page, minute(evtime)

Trigger: “every 5 sec”

Output mode: “update-in-place”, into MySQL sink

Note: this will automatically update “old” records on late data!

Page 22: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Logically:DataFrame operations on static data(i.e. as easy to understand as batch)

Physically:Spark automatically runs the query in streaming fashion(i.e. incrementally and continuously)

DataFrame

Logical Plan

Continuous, incremental execution

Catalyst optimizer

Execution

Page 23: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

logs = ctx.read.format("json").open("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))

.write.format("jdbc")

.save("jdbc:mysql//...")

Example: Batch Aggregation

Page 24: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

logs = ctx.read.format("json").stream("s3://logs")

logs.groupBy(logs.user_id).agg(sum(logs.time))

.write.format("jdbc")

.stream("jdbc:mysql//...")

Example: Continuous Aggregation

Page 25: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

T = 0 Aggregate

AggregateT = 1

AggregateT = 2

Automatic Incremental Execution

Page 26: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Rest of Spark will follow

• Interactive queries should just work

• Spark’s data source API will be updated to support seamless streaming integration• Exactly once semantics end-to-end• Different output modes (complete, delta, update-in-place)

• ML algorithms will be updated too

Page 27: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

What can we do with this that’s hard with other engines?Ad-hoc, interactive queries

Dynamic changing queries

Benefits of Spark: elastic scaling, straggler mitigation, etc

Page 28: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Use Case: Fraud Detection

STREAM

ANOMALY

Machine Learning Modelcontinuously updates to detect new anomalies

Analyze Historic Data

Page 29: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Timeline

Spark 2.0• API foundation• Kafka, file systems, and

databases• Event-time aggregations

Spark 2.1 + • Continuous SQL• BI app integration• Other streaming sources / sinks• Machine learning

Page 30: The Future of Real-Time in Spark - GitHub Pages...2016/02/18  · Structured Streaming High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames •

Thank you.@rxin