processing streams of data with amazon kinesis

Processing Streams of Data with Amazon Kinesis(and other tools)

Michael Hanisch, Solutions Architect - Amazon Web Services

Analytics

Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library, AWS Lambda, Apache Spark and Apache Storm to Process Data Integration with S3, Redshift and Dynamo DB

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administra=on

Networking

Analy=cs

Data Sources

App.4

[Machine Learning]

AW

S En

dpoint

App.1

[Aggregate & De-‐Duplicate]

Data Sources

Data Sources

Data Sources

App.2

[Metric Extrac=on]

S3

DynamoDB

Redshift

App.3 [Sliding Window Analysis]

Data Sources

Shard 1

Shard 2

Shard N

Availability Zone

Availability Zone

Amazon Kinesis Dataflow

Availability Zone

Billing Auditors

Incremental Bill

Computation

Metering Archive

Billing Management

Service

Example Architecture - Metering

Streams Named Event Streams of Data

Shards You scale Kinesis streams by adding or removing Shards Each Shard ingests up to 1MB/sec of data and up to 1000 TPS All data is stored for 24 hours (up to 7 days on request)

Partition Key Identifier used for Ordered Delivery & Partitioning of Data across Shards

Sequence Number of an event as assigned by Kinesis

Amazon Kinesis Components

Getting Data In

  Producers use a PUT call to store data in a Stream

  A Partition Key is used to distribute the PUTs across Shards

  A unique Sequence # is returned to the Producer for each Event

  Data can be ingested at 1MB/second or 1000 Transactions/second per Shard

  Up to 1MB / Event

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Kinesis - Ingesting Fast Moving Data

  Native Code Module to perform efficient writes to Multiple Kinesis Streams

  C++/Boost   Asynchronous Execution   Configurable Aggregation of Events

Introducing the Kinesis Producer Library

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

KPL Aggregation

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

1MB Max Event Size

Aggregate

100k 20k 500k 200k

40k 20k 40k

500k 100k 200k 20k

40k

40k

20k

Protobuf Header Protobuf Footer

Kinesis

Apache Flume Source & Sink https://github.com/pdeyhim/flume-kinesis

FluentD Dynamic Partitioning Support https://github.com/awslabs/aws-fluent-plugin-kinesis

Log4J & Log4Net Included in Kinesis Samples

Logstash Plugins to use Kinesis as input or output

Kinesis Ecosystem - Ingest

Getting Data Out

  KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support

  All State Management in Dynamo DB

Kinesis Client Library

DynamoDB

Distributed Event Processing Platform

  Stateless JavaScript, Java & Python functions run against an Event Stream

  AWS SDK Built In   Configure RAM and Execution

Timeout (up to 5 minutes)   Functions automatically invoked

against a Shard   Invoked once per event or in batches   Access to underlying filesystem   Call other Lambda Functions

Consuming Data - AWS Lambda

Kinesis

Shard 1

Shard 2

Shard 3

Shard 4

Shard n

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Client library for fault-tolerant, at least-once, real-time processing

  Kinesis Client Library (KCL) simplifies reading from the stream by abstracting your code from individual shards

  Automatically starts a Worker Thread for each Shard

  Increases and decreases Thread count as number of Shards changes

  Uses checkpoints to keep track of a Thread’s location in the stream

  Restarts Threads & Workers if they fail

Consuming Data - Kinesis Enabled Applications

Analytics Tooling Integration (github.com/awslabs/amazon-kinesis-connectors)

S3 Batch Write Files for Archive into S3 Sequence Based File Naming

Redshift Once Written to S3, Load to Redshift Manifest Support User Defined Transformers

DynamoDB BatchPut Append to Table User Defined Transformers

ElasticSearch Automatically index Stream Contents

Kinesis Connectors

S3 Dynamo DB

Redshift

Kinesis

Amazon Elasticsearch

Service

Connectors Architecture

Fully managed ingestion of stream data into Amazon S3 or Redshift

•  Set up a Firehose to ingest data straight into S3 and/or Redshift

•  Takes care –  Buffering & Batching of your data

(based on time or size) –  Encryption –  Compression

•  Scales stream automatically, no need to take care of Shards.

Kinesis Firehose

S3 S3

Redshift

Kinesis Firehose

Apache Hive (on EMR) Breaks stream into iterations Iterations represented as partitions of external tables in Hive For batch processing using Hive-QL

Presto Near-real time exploration of a stream using SQL

Kinesis Ecosystem - Storm

Storm

Kinesis

Apache Storm Kinesis Spout Real-time processing of individual events Automatic Checkpointing with Zookeeper https://github.com/awslabs/kinesis-storm-spout

Kinesis Ecosystem - Storm

Storm

Kinesis

Apache Spark Streaming Presents slices of stream's data as frames Micro-batches (usually between 0.5s and 5s windows) Same programming & processing model as used for batch data!

Kinesis Ecosystem - Spark

Storm

Kinesis

Apache Spark DStream Receiver runs KCL One DStream per Shard Checkpointed via KCL

Spark Natively Available on EMR

EMRFS overlay on HDFS AMI 4.x (or AMI 3.8.0) https://aws.amazon.com/elasticmapreduce/details/spark

Kinesis Ecosystem - Spark

Kinesis – Spark Streaming Example

KinesisUtils.createStream(‘twitter-stream’)

.filter(_.getText.contains(”Open-Source"))

.countByWindow(Seconds(5))

Counting tweets on a sliding window

Kinesis – Spark Pipeline

Why Kinesis? Durability

Regional Service Synchronous Writes to Multiple

AZ’s Extremely High Durability

?May be in-memory for

Performance Requirement to understand Disk

Sync Semantics User Managed Replication

Replication Lag -> RPO

Why Kinesis? Performance

Perform continual processing on streaming big data. Processing latencies fall to a <1 second, compared with the minutes or hours associated with batch

processing

?Processing latencies < 1 second

Based on CPU & Disk Performance

Cluster Interruption -> Processing Outage

Why Kinesis? Availability

Regional Service Synchronous Writes to Multiple

AZ’s Extremely High Durability

AZ, Networking, & Chain Server Issues Transparent to Producers

& Consumers

?Many Depend on a CP Database Lost Quorum can result in failure/

inconsistency of the cluster Highest Availability is determined by Availability of Cross-AZ Links

or Availability of an AZ

Why Kinesis? Operations

Managed service for real-time streaming data collection,

processing and analysis. Simply create a new stream, set the

desired level of capacity, and let the service handle the rest

?Build Instances Install Software Operate Cluster

Manage Disk Space Manage Replication

Migrate to new Stream on Scale Up

Why Kinesis? Elasticity

Seamlessly scale to match your data throughput rate and volume.

You can easily scale up to gigabytes per second. The service

will scale up or down based on your operational or business

needs

?Fixed Partition Count up Front

Maximum Performance ~ 1 Partition/Core | Machine Convert from 1 Stream to

Another to Scale Application Reconfiguration

Scaling Streams

https://github.com/awslabs/amazon-kinesis-scaling-utils

Why Kinesis? Cost

Cost-efficient for workloads of any scale. You can get started by

provisioning a small stream, and pay low hourly rates only for what

you use. Scale Up/Down Dynamically

$.015/Hour/1MB

?Run your Own EC2 Instances

Multi-AZ Configuration for increased Durability

Utilise Instance AutoScaling on Worker Lag from HEAD with

Custom Metrics

Why Kinesis? Cost   Price Dropped on 2nd June 2015, Restructured to support KPL   Old Pricing: $.028 / 1M Records PUT   New Pricing: $.014/1M 25KB “Payload Units”

Units Cost Units Cost

Shards 50 $558 25 $279 PutRecords 4,320M

Records $120.96 2,648M

Payload Units

$37.50

$678.96 $316.50

Scenario: 50,000 Events / Second, 512B / Event = 24.4 MB/Second

Old Pricing New Pricing + KPL

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker


Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

RecordProcessor (Shard 1)






Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker






Worker


Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

Worker

Worker

Worker

Auto scaling Group

KCL Workers emit CloudWatch metrics you can use to auto-scale. In general, using CPU Utilization should be sufficient though.

•  Be distributed, to handle multiple shards •  Be fault tolerant, to handle failures in hardware or software

•  Scale up and down as the number of shards increase or decrease

KCL Applications & Autoscaling

•  Auto Scaling policies will restart EC2 instances if they fail

•  Automatically add EC2 instances when load increases

•  KCL will automatically redistribute Workers to use the new EC2 instances

Kinesis – Consumer Application Best Practices   Tolerate Failure of: Threads – Consider Event Serialisation issues and Lease Stealing;

Hardware – AutoScaling may add nodes as needed   Scale Consumers up and down as the number of Shards increase or decrease   Don’t store data in memory in the workers. Use an elastic data store such as Dynamo

DB for State   Elastic Beanstalk provides all Best Practices in a simple to

deploy, multi-version Application Container   KCL will automatically redistribute Workers to use new

Instances   Logic implemented in Lambda doesn’t require

any Servers at all!

Managing Application State

Consumer Local State Anti-Pattern

  Consumer binds to a configured number of Partitions

  Consumer stores the ‘state’ of a data structure, as defined by the event stream, on local storage

  Read API can access that local storage as a ‘shard’ of the overall database

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API


But what happens when an instance fails?

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P




Read API


  A new consumer process starts up for the required Partitions

  Consumer must read from the beginning of the Stream to rebuild local storage

  Complex, error prone, user constructed software

  Long Startup Time

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P




Read API

T0

THead

External Highly Available State – Best Practice

Consumer Consumer

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Read API

  Consumer binds to a even number of Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

DynamoDB


  Consumer binds to a even number of Shards based on number of Consumers




Read API

DynamoDB

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Consumer Consumer


Read API

DynamoDB

AWS Lambda

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S   Consumer binds to a even number of

Shards based on number of Consumers




Idempotency

Property of a system whereby the repeated application of a function on a single input results in the same end state of the system

…

Exactly Once Processing

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry

My Application PutRecord(s)

403 (Endpoint Redirect) Sto

rage

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry


PutRecord(s)

200 (OK)

403 (Endpoint Redirect) Sto

rage

S

tora

ge

Write 123

Write 123

Coming Soon…

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful


Sto

rage

DynamoDB X Hour Rolling

Window




Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window




403 (Endpoint Redirect)

Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window




PutRecord(s)


Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID


Window

ConditionCheckFailedException




PutRecord(s)

200 (OK)


Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID


Window

ConditionCheckFailedException

Easy Administration Real-time

Performance. High Durability

High Throughput. Elastic

S3, Redshift & DynamoDB Integration

Large Ecosystem Low Cost

In Short…

processing streams of data with amazon kinesis

Documents