samzasql - scalable fast data management with streaming...

38
SamzaSQL Scalable Fast Data Management with Streaming SQL Milinda Pathirage (IU), Julian Hyde (Hortonworks), Yi Pan (LinkedIn), Beth Plale (IU) School of Informatics and Computing INDIANA UNIVERSITY

Upload: others

Post on 27-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQLScalable Fast Data Management with StreamingSQL

Milinda Pathirage (IU), Julian Hyde (Hortonworks), YiPan (LinkedIn), Beth Plale (IU)

School of Informatics and ComputingINDIANA UNIVERSITY

Page 2: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Fast Data

Data has to be processed as it arrives, so that we can reactimmediately to changing conditions.

BIG DATA ISN’T JUST BIG; IT’S ALSO FAST.

Big data is often data that is generated at incredible speeds,such as click-streamdata, financial ticker data, log aggregation,and sensor data.

John Hugg, "Fast data: The next step after big data"

Page 3: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Applications

# Real-time distributed tracing for website performance andefficiency optimizations

# Calculating click-through rates# Data stream enrichment

◦ Count page views by group key where group key isretrieved from a key/value storage

◦ Enriching data streams related to use activities with user’sinformation such as location and company

# At the time of writing LinkedIn uses 90 Kafka clustersdeployed across 1500 nodes to process 150TB of inputdata daily

Page 4: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Lambda Architecture (LA)

LA is a technology agnostic data processing architecture thatattempts to balance latency, accuracy, throughput andfault-tolerance by providing a unified serving layer on top ofbatch and stream processing sub-systems.

From: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 5: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Kappa Architecture (KA)

Simplification of Lambda Architecture is KA that usesappend-only immutable log as the canonical data store; batchprocessing is replaced by stream replay.

From: https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 6: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

MOTIVATION

Page 7: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Programming APIs for LA and KA

Summingbird is a well known abstraction for writing LA styleapplications. KA style applications are mainly written in astateful stream processing APIs provided by frameworks suchas Apache Samza.

Limitations

# Need to maintain two complex distributed systems

# Users need to understand complex programmingabstractions

# Long turnaround times

Page 8: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Summingbird

WORD COUNT

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

More examples at https://github.com/twitter/summingbird

Page 9: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Samza API

WINDOW AGGREGATION

public class WikipediaStatsStreamTask implements StreamTask, InitableTask, WindowableTask {

...

private KeyValueStore<String, Integer> store;

public void init(Config config, TaskContext context) {

this.store = (KeyValueStore<String, Integer>) context.getStore("wikipedia-stats");

}

@Override

public void process(IncomingMessageEnvelope envelope, MessageCollector collector,

TaskCoordinator coordinator) {

Map<String, Object> edit = (Map<String, Object>) envelope.getMessage();

...

}

@Override

public void window(MessageCollector collector, TaskCoordinator coordinator) {

...

collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wikipedia-stats"), counts));

...

}

Page 10: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SQL for Big Data

There are several well known SQL-on-Hadoop solutions andmost organizations that use Hadoop use one or moreSQL-on-Hadoop solutions.

# Apache Hive

# Presto

# Apache Drill

# Apache Impala

# Apache Kylin

# Apache Tajo

# Apache Phoenix

Page 11: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Motivating Research Questions

# Can the same low barrier and the clear semantics of SQLbe extended to queries that execute simultaneously overdata streams (in movement) and tables (at rest)?

# Can this be done with minimal and well-foundedextensions to SQL?

# And with minimal overhead over a non-SQL-basedLA/KA?

Page 12: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SAMZASQL

Page 13: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Data Model

# Stream: A stream S is a possibly indefinite partitionedsequence of temporally-defined elements where anelement is a tuple belonging to the schema of S.

# Partition: A partition is a time-ordered, immutablesequence of elements existing within a single stream.

# Relation: Analogous to a relation/table in relationaldatabases, a relation R is a bag of tuples belonging to theschema of R.

Page 14: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Continuous Queries

SAMZASQL

SELECT STREAM rowtime, productId, units FROM Orders

WHERE units > 25

CQL

SELECT ISTREAM rowtime, productId, units FROM Orders

WHERE units > 25;

Page 15: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Window Aggregations

SAMZASQL

SELECT STREAM TUMBLE_END (rowtime, INTERVAL '1' HOUR) AS rowtime,

productId,

COUNT(*) AS c,

SUM(units) AS units

FROM Orders

GROUP BY TUMBLE (rowtime, INTERVAL '1' HOUR), productId

CQL

SELECT ISTREAM ... AS rowtime, productId, COUNT(*) AS c,

SUM(units) AS units

FROM Orders[Range '1' HOUR, Slide '1' HOUR]

GROUP BY productId;

Page 16: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Sliding Windows

SAMZASQL

SELECT STREAM rowtime, productId, units,

SUM(units) OVER (ORDER BY rowtime PARTITION BY productId RANGE

INTERVAL '1' HOUR PRECEDING) unitsLastHour

FROM Orders;

CQL

SELECT ISTREAM rowtime, productId, units,

SUM(units) AS unitsLastHour

FROM Orders[Range '1' HOUR]

GROUP BY productId;

Page 17: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Window Joins

SAMZASQL

SELECT STREAM

GREATEST(PacketsR1.rowtime, PacketsR2.rowtime) AS rowtime,

PacketsR1.sourcetime,

PacketsR1.packetId,

PacketsR2.rowtime - PacketsR1.rowtime AS timeToTravel

FROM PacketsR1 JOIN PacketsR2 ON

PacketsR1.rowtime BETWEEN

PacketsR2.rowtime - INTERVAL '2' SECOND

AND PacketsR2.rowtime + INTERVAL '2' SECOND

AND PacketsR1.packetId = PacketsR2.packetId

Page 18: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Streaming SQL - Stream-to-Relation Joins

SAMZASQL

SELECT STREAM *

FROM Orders as o

JOIN Products as p

on o.productId = p.productId

Page 19: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQL - Implementation

# Uses Apache Calcite query planning framework

# Utilizes Calcite’s code generation framework

# Generates Samza jobs for streaming SQL queries

# Uses Samza’s local storage to implement fault-tolerantwindow aggregations

# Uses Samza’s bootstrap stream feature to cache the relationto perform stream-to-relation join queries

# Uses Janino to compile operators generated during streamtask initialization

Page 20: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQL - Architecture

SamzaSQL Shell Samza YARN Client

Calcite Model Schema Registry Zookeeper

SamzaSQL Job

Page 21: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQL - Samza Job

Samza YARN Client YARN Resource Manager Samza App Master

SamzaContainer [s-p2]

SamzaContainer [s-p1]

SamzaContainer [s-p0]

Kafka Cluster

Node Manager

Node ManagerNode Manager

s-p1 s-p0 s-p2

Kafka Broker 1Kafka Broker 2 Kafka Broker n

Page 22: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQL - Kafka

Page 23: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

SamzaSQL - Query Planner

SELECT STREAM …

Parser Validator Convert to Logical Plan

Generic Optimizations

Conver to SamzaSQL

ModelSamzaSQL

Optimizations

Samza Job Configuration*

Execution Plan

Apache Calcite

Page 24: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

EVALUATION

Page 25: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Environment

# 100 byte messages (based on previous Kafka benchmarks)

# 3 node (EC2 r3.2xlarge) Kafka cluster

# 3 node (EC2 r3.2xlarge) YARN cluster

# Each r3.2xlarge instance has 8 vCPUs, 61GB of RAM and160 GB SSD backed storage

# Data model◦ Stream - Orders (rowtime, productId, orderId, units)

◦ Table - Products (productId, name, supplierId)

Page 26: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Results

# Per task throughput is around 550MB/m for simplequeries (100 byte messages)

# Throughput is around 200MB/m when local storage isused (100 byte messages)

# 30-40% overhead for simple queries when compared withSamza jobs written in Java

# Overheads are mainly due to message formattransformations required in streaming SQL runtime

# Overheads increase when local storage is used due tomessage serialization/deserialization

Page 27: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - SamzaSQL Message Processing Flow

MESSAGE PROCESSING FLOW

Decode AvrotoArray Process ArraytoAvro Encode

Page 28: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Filter Throughput

2 4 8 16

2

4

6

·107

Number of tasks

Throug

hput

(msg/m

)

SamzaSQLNative

SELECT STREAM * FROM Orders WHERE units > 50

Page 29: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Project Throughput

2 4 8 16

2

4

6·107

Number of tasks

Throug

hput

(msg/m

)

SamzaSQLNative

SELECT STREAM rowtime, productId, units FROM Orders

Page 30: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Stream-to-Relation Join Throughput

2 4 8 160

2

4

·107

Number of tasks

Throug

hput

(msg/m

)

SamzaSQLNative

SELECT STREAM Orders.rowtime, Orders.orderId, Orders,productId, Orders.units,

Products.supplierId FROM Orders JOIN ON Orders.productId = Products.productId

Page 31: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Evaluation - Sliding Window Throughput

0 0.2 0.4 0.6 0.8 1·106

1

Throughput (ms g/m)

Num

bero

ftasks

SamzaSQL Samza

SELECT STREAM rowtime, productId, units, SUM(units) OVER (PARTITION BY

productId ORDER BY rowtime RANGE INTERVAL ’5’ MINUTE PRECEDING)

unitsLastFiveMinutes FROM Orders

Sliding window query throughput was measured in a iMac due tolimitations in EC2 IO rates.

Page 32: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

RELATED WORK

Page 33: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Related Work

# Eerly work on streaming SQL - TelegraphCQ, Tribecca,GSQL

# CQL

# Streaming SQL for Apache Flink and Apache Storm basedon our work in Apache Calcite

Page 34: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

FUTURE WORK AND CONCLUSION

Page 35: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Future Work

# Code generation to bring SamzaSQL generated physicalplans closer to Samza Java API based queries

# Local storage related improvements to reduceserialization/deserialization overheads

# Streaming query optimizations for fast data managementsystems

# Ordering guarantees in the presence of streamrepartitioning

# Stream-to-relation queries

# Intra-query optimizations

# Handling out-of-order arrivals

Page 36: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Summary and Conclusion

# We proposed a novel set of extensions to standard SQL forexpressing streaming queries.

# SamzaSQL is an implementation of proposed streamingSQL variant on top of Apache Samza.

# We demonstrate that we can achieve decent amount ofperformance by utilizing existing libraries.

# Our evaluation results shows that further improvementssuch as code generation is needed to bring streaming SQLruntime closer in performance to streaming queries writtenin languages such as Java and Scala.

Page 37: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

References

# Apache Samza

# Apache Calcite

# High-Level Language for Samza

# Calcite Streaming SQL

# Stream Processing for Everyone with SQL and ApacheFlink

Page 38: SamzaSQL - Scalable Fast Data Management with Streaming SQLweb.cse.ohio-state.edu/~lu.932/hpbdc2016/slides/hpbdc16-milinda.pdf · SamzaSQL Scalable Fast Data Management with Streaming

Acknowledgments

# The authors thank◦ Chris Riccomini, Jay Kreps, Martin Kleppman, Navina

Ramesh, Guzhang Wang and the Apache Samza andApache Calcite communities for their valuable feedback.

◦ Amazon Web Services for the resources allocation award.