is spark streaming based on reactive streams?

42
Is Spark Streaming based on Reactive Streams? 2016-12-14 もう1つのHadoop Summit

Upload: chibochibo

Post on 16-Apr-2017

319 views

Category:

Presentations & Public Speaking


5 download

TRANSCRIPT

Page 1: Is spark streaming based on reactive streams?

Is Spark Streaming based onReactive Streams?

2016-12-14 もう1つのHadoop Summit

Page 2: Is spark streaming based on reactive streams?

Agenda

back pressureの重要性

Reactive Streamsとは

Spark Streamingのback pressure実装

Page 3: Is spark streaming based on reactive streams?

自己紹介

島本 多可子(@chibochibo03)

株式会社ビズリーチ CTO室

Scala界隈に生息してます

GitBucketもよろしくお願いします

Page 4: Is spark streaming based on reactive streams?

なぜSparkの話を?

2014年半ばからSparkへの取り組みを開始

事例としては小規模

多少のオーバーヘッドは分かった上で使う

小規模でも待望の機能があった

back pressure

Page 5: Is spark streaming based on reactive streams?

back pressureとは

ストリーム処理にてデータのフロー制御を行う

過負荷であることをフィードバックする仕組み

自身の処理能力で処理できるデータ量を伝えるやばい!!あと2つ!

もー2つね。

Page 6: Is spark streaming based on reactive streams?

back pressureの重要性

送信側で常に一定のデータ量を保つのは難しい

一時的な増加などの波はある

システム全体として動き続けることは重要

基本的には常にデータが流れている

瞬間的過負荷時の即時性は失われても、止まらないほうがよい

Page 7: Is spark streaming based on reactive streams?

Sparkは1.5からback pressureに対応

spark.streaming.backpressure.enabled

デフォルトはfalse

有効にするにはtrueを設定

spark.streaming.receiver.maxRate

1秒あたりのレコード数の上限

back pressureで調整する際の上限になる

since 1.5

Page 8: Is spark streaming based on reactive streams?

どのように実現しているのか

Page 9: Is spark streaming based on reactive streams?

ストリーム処理

back pressure

Page 10: Is spark streaming based on reactive streams?

Reactive Streams

Page 11: Is spark streaming based on reactive streams?

Reactive Streamsとは

非同期ストリーム処理の標準化を目指す

ScalaだとAkka Streamsが既にサポート

JDK 9でFlow APIとして導入

Spring 5はReactive対応に

back pressure付き

Page 12: Is spark streaming based on reactive streams?

原理原則

Subscriber側でサイズを制限する

過負荷に直面するとSubscriberはback pressureのシグナルを送る

back pressureのシグナルは非同期であること

Dynamic Push-Pull

Subscriberが高速の場合はpush型

Publisherが高速の場合はpull型

Page 13: Is spark streaming based on reactive streams?

仕様 - Flow API

public final class Flow {

@FunctionalInterface public static interface Publisher<T> { public void subscribe(Subscriber<? super T> subscriber); }

public static interface Subscription { public void request(long n); public void cancel(); }

Page 14: Is spark streaming based on reactive streams?

仕様 - Flow API

public static interface Subscriber<T> {

public void onSubscribe(Subscription subscription); public void onNext(T item); public void onError(Throwable throwable); public void onComplete(); }

public static interface Processor<T,R> extends Subscriber<T>, Publisher<R> {}

}

Page 15: Is spark streaming based on reactive streams?

流れ

SubscriberPublisherSubscription

onSubscribe(Subscription)

Page 16: Is spark streaming based on reactive streams?

流れ

SubscriberPublisher

request(1) 1個

SubscriptiononSubscribe(Subscription)

Page 17: Is spark streaming based on reactive streams?

流れ

SubscriberPublisher

onNext(data)

1個request(1)

SubscriptiononSubscribe(Subscription)

Page 18: Is spark streaming based on reactive streams?

流れ

SubscriberPublisher

1個

3個

onNext(data)

request(1)

request(3)

SubscriptiononSubscribe(Subscription)

Page 19: Is spark streaming based on reactive streams?

流れ

SubscriberPublisher

1個

3個

onNext(data)

request(1)

request(3)

onNext(data)

SubscriptiononSubscribe(Subscription)

Page 20: Is spark streaming based on reactive streams?

流れ

SubscriberPublisher

onComplete()

1個

3個

これで全部!

onNext(data)

request(1)

request(3)

onNext(data)

SubscriptiononSubscribe(Subscription)

Page 21: Is spark streaming based on reactive streams?

SparkはReactive Streamsに遵守してる?

答えは、No!!

Though we will just take inspiration from some of the design principles of the Reactive Streams specification, we do not intend for Spark's internals to comply with this specification.

Reactive Streamsの設計方針からインスピレーションを受けていますが、私たちはSparkの内部がこの仕様を遵守するつもりはありません。

要約すると・・・

"4.1 Back-pressure signaling". Spark Streaming back-pressure signaling. https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk

7sAfayQw/edit?usp=sharing, (参照 2016-12-11)

Page 22: Is spark streaming based on reactive streams?

Sparkのback pressureはどうやってる?

きっかけは、StreamingListenerのonBatchCompleted

StreamingListenerトレイト

進行中のストリーム処理に関する情報を受け取るためのリスナー

onBatchCompletedメソッド

1つのminiバッチが完了したときに呼び出される

Page 23: Is spark streaming based on reactive streams?

onBatchCompletedでどんな情報が取れる?

batchTime・・miniバッチ時間

submissionTime・・JobSchedulerのキューに送信された時間

processingStartTime, processingEndTime・・処理開始・終了時間

schedulingDelay・・スケジュール済〜処理開始までの時間(待ち時間)

processingDelay・・処理時間

totalDelay・・スケジュール済〜処理完了までの時間(総所要時間)

遅延がなければ0

正常ならバッチ間隔に収まっている

Page 24: Is spark streaming based on reactive streams?

onBatchCompletedでどんな情報が取れる?

batchTime・・miniバッチ時間

submissionTime・・JobSchedulerのキューに送信された時間

processingStartTime, processingEndTime・・処理開始・終了時間

schedulingDelay・・スケジュール済〜処理開始までの時間(待ち時間)

processingDelay・・処理時間

totalDelay・・スケジュール済〜処理完了までの時間(総所要時間)

この値をRateEstimatorに渡して新しいRateを計算する

Page 25: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

ReceiverReceiverTracker

ReceiverInputDStream

JobGenerator

Page 26: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

Receiver

push data

ReceiverTracker

ReceiverInputDStream

JobGeneratorブロック間隔で

区切る

Page 27: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

add new blocks

Driver Executor

Receiver

push data

ReceiverTracker

ReceiverInputDStream

JobGenerator

Page 28: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

add new blocks

Driver Executor

Receiver

push data

ReceiverTracker

ReceiverInputDStream

JobGenerator

generateJob

バッチ間隔ごと

Page 29: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

add new blocks

Driver Executor

Receiver

push data

ReceiverTracker

ReceiverInputDStream

get blocks

JobGenerator

generateJob

ブロックを問い合わせ

Page 30: Is spark streaming based on reactive streams?

仕組み - データ受け取り時

JobScheduler ReceiverSupervisor

BlockGenerator

add new blocks

Driver Executor

Receiver

push data

ReceiverTracker

ReceiverInputDStream

get blocks

JobGeneratorsubmitJobSet

generateJob

ジョブとしてスケジュール

Page 31: Is spark streaming based on reactive streams?

仕組み - バッチ完了時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

ReceiverTracker

ReceiverInputDStream

ReceiverRateController

Page 32: Is spark streaming based on reactive streams?

仕組み - バッチ完了時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

ReceiverTracker

ReceiverInputDStream

ReceiverRateControllerRateLimiter

StreamingListener

RateControllerリスナー

Receiverが受け取るデータ量を制御

Page 33: Is spark streaming based on reactive streams?

仕組み - バッチ完了時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

ReceiverTracker

ReceiverInputDStream

ReceiverRateController

sendRateUpdateonBatchCompleteで

新しいRateを計算

Page 34: Is spark streaming based on reactive streams?

仕組み - バッチ完了時

JobScheduler ReceiverSupervisor

BlockGenerator

Driver Executor

ReceiverTracker

ReceiverInputDStream

ReceiverRateController

sendRateUpdate

RPC send

updateRate

GuavaのRateLimiterに

セット

onBatchCompleteで新しいRateを計算

Page 35: Is spark streaming based on reactive streams?

ポイント

バッチ完了時をフックして新しいRateが決まる

待ち時間や処理時間などを考慮

新RateはReceiverSupervisorを介してBlockGeneratorに伝播

GuavaのRateLimiterを使ってpushするデータ量を制御

1件ごとにacquireを呼ぶ

レシーバ毎に毎秒許可する件数を超えるとwait

Page 36: Is spark streaming based on reactive streams?

今後どうなる?

Page 37: Is spark streaming based on reactive streams?

Reactive Streamsに遵守!?

https://issues.apache.org/jira/browse/SPARK-10420

Problem

back pressureに関連する情報がReceiverから見えない

Receiverからrequest(1)のようなシグナルを送れない

Page 38: Is spark streaming based on reactive streams?

どこが問題?

ReceiverSupervisor

BlockGenerator

Executor

ReceiverRPC sendNew Rate

updateRate

Page 39: Is spark streaming based on reactive streams?

どこが問題?

ReceiverSupervisor

BlockGenerator

Executor

ReceiverRPC sendNew Rate

updateRate

新しいRateを知らない!!

Page 40: Is spark streaming based on reactive streams?

改善案

ReceiverSupervisor

BlockGenerator

Executor

ReceiverRPC sendNew Rate

updateRateLimit

Page 41: Is spark streaming based on reactive streams?

改善案

ReceiverSupervisor

BlockGenerator

Executor

ReceiverRPC sendNew Rate

updateRateLimit

Receiverを介す

Supervisorのメソッド経由で伝播

Reactive StreamsベースのReceiverを作成できる

Page 42: Is spark streaming based on reactive streams?

まとめ

従来からあるSparkの機能をうまく使って実現している

JIRAは上がっているけど、優先度は低

Sparkのback pressure Reactive Streams