streaming data analytics (kinesis, emr/spark) - pop-up loft tel aviv
TRANSCRIPT
Streaming Analytics on AWSDmitri TchikatilovAdTech BD, [email protected]
Agenda
1. Streaming principles 2. Streaming analytics on AWS3. Kinesis and Apache Spark on EMR 4. Querying and Scaling 5. Best Practices
Batch vs. Stream
Batch Processing Stream Processing
Data scopeQueries or processing over all or most of the
data
Queries or processing over data on rolling window or most recent data record
Data size Large batches of data Individual records or micro batches of few records
Performance Latencies in minutes to hours.
Requires latency in the order of seconds or milliseconds.
Analytics Complex analytics.Simple response functions,
aggregates, and rolling metrics.
Streaming App Challenges
Simple & Flexible Analytics
Elastic - adapt to input surges and back
pressure
Fast ~ 1s to 100ms for the majority of apps
Scalable ~ 1M records/secAvailable - low tolerance
for record losses
Usability Performance
“We are our choices...”
J.P. Sartre
Stream Processing Choices on AWSOperations Analytics
Storm Zookeeper/Nimbus for HA SQL - 3rd party, roll your own
Kafka Zookeeper (failure detection, partitioning, replication) SQL - 3rd party, roll your own
Druid Zookeeper, multiple node roles scale independently
OLAP engine (JSON) on denormalized data, real time indexing
Kinesis AWS Service SQL - Kinesis Analytics (in development)
Spark Streaming
EMR bootstraps latest 1.6, Yarn, Monitoring
SparkSQL on DataFrames, Joins, Zeppelin notebooks
Components
Storage layerIngest (record storing, ordering, strong consistency and replayable reads)
Storage Processing
Processing layerAnalytics (consume data from storage layer, run computations, removal from storage)
Real-Time Streaming Data Ingestion
Custom-built Streaming Applications(KCL)
Inexpensive: $0.014 per 1,000,000 PUT Payload Units
Storage - Amazon Kinesis Streams
Kinesis Stream1 Shard< 1MB-in / 2MB-outEach record < 1 MBPutRecords() < 500 (5MB)Increased retention 7 days
Processing - Spark Streaming
RE
CE
IVE
RS
Input data streams
SPARK Job
Results published to destinations
DStream
RDD = Resilient Distributed DatasetDStream = Collection of RDDs
Spark Steaming – Long Running Spark App
Driver Program
StreamingContext
SparkContext
Spark jobs toprocess
received data
Worker Node
Executor
Long Task Receiver
Worker Node
Executor
Task Task Task
Input stream
Worker Node processes the
data
Output Batch
Analytics - DataFrames on Streaming Data
• KCL – Kinesis Client Library (helps take data off Kinesis)• Spark Streaming uses KCL - reads data from Kinesis
and forms a DStream (Pull Mechanism)• Creates DataFrame in Spark Streaming
Kinesis KCL
Kinesis and Spark Streaming
EMRKinesis
Full Kinesis + Spark Pipeline
What About Analytics?
What operations are possible?Filter, GroupBy, Join, Window Operations
Not all queries make sense to run on the stream.Large joins on RDDs in DStreams can be expensive
Spark Streaming – Operations on DStreamsWindow Operations
Query the Data in DStreams?
This is all great, but I’d like to query my data!
StreamingContext > DStream (RDDs) > DataFrame
DataFrame converted to temp. table and query with SQL through HiveContext
Example: Querying DStreams with SQL
CourtesyAmo Abeyarante
AWS Big Data Blog
Setup
1. Kinesis Stream with data provided by Python script2. KCL Scala app launched as spark-job
• Checks the number of Shards and instantiates the same number of Streams
• Receives data from Kinesis in small batches• Creates DataFrame, registers as temp table • Creates HiveContext
3. Use Hive app to query the data
Demo – Querying Streams
Analytics – Choosing Where to Join Data
Join the data in a custom KCL app – denormalize and publish to another Kinesis Stream
Storage Processing
Join the streaming data using DStreams
Amazon Kinesis + Spark on EMR
Producer 1
Producer 2
Producer N
Shard1
Shard2
Kinesis
Receiver 1
KCL Worker 1Yarn Executor 1
RecordProcessor 1
RecordProcessor 2
EMR
Yarn Executor 2
Create DStream to Scale Out
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
kinesisStream = KinesisUtils.createStream(streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
Amazon Kinesis + Spark on EMR
Producer 1
Producer 2
Producer N
Shard1
Shard2
Kinesis
Receiver 1
KCL Worker 1Yarn Executor 1
RecordProcessor 1
EMR
Yarn Executor 2KCL Worker 2
Receiver 2
RecordProcessor 2
Scaling KinesisKinesis • Can accumulate data at any rate, but need input batching
for high rates of small messages to optimize cost• Scales inputs by splitting shards • Never “pressures” Spark – Spark and KCL is pulling data
Scaling EMR/SparkEMR/Spark• Scales by adding task nodes – can be EC2 Spot instances• Yarn can be configured for “dynamic resource allocation”
with variable number of executors per app. New default for the upcoming EMR 4.4 release Works well for batch – but not always for Streaming
• Automatic – same number of Receivers (in case of a shard split/merge operations)
• Manual (app restart) – if you need to change the number of Receivers
Stability in Spark Streaming
2s 2s 2s
0s 4s 8s
Tb (batch) = 4s Tp (process) = 2s
5s 5s
0s 4s 8s
Tb (batch) = 4s Tp (process) = 5s
Stable Tb <= Tp
Unstable Tb > Tp
Unstable state – increase in scheduling delay
Scheduling delay
5s
Spark Backpressure Feature
After every micro batch finishes – statistic used to estimate processing rate
PID controller (proportional-integral-derivative) – estimates what the maximum rate of ingest for the system (rows/sec)
PID controller limits the ingestSparkConfspark.streaming.backpressure.enabled = true
Analytics on Streaming Data
Is here today, but requires some work. Major advancements soon in Kinesis Analytics, Spark 2.0.
A lot of analytics can be done simply in a custom KCL app (moving averages, joins, filters, etc).
FLEXIBILITYPERFORMANCE
Streaming Best Practices Summary
1. Total Processing time is less than Batch interval (Tp < Tb)2. Load is well balanced - # of Receivers is a multiple of # of Executors3. Spark Streaming reading from Kinesis defaults to 1 sec.4. Enable Spark Checkpoints for reliable (at-least-once) semantics. Use Spark 1.6 with EMRFS for S3. 5. Streaming apps using different names to avoid using same DynamoDB table
Dmitri TchikatilovDigital Advertising [email protected]