deep dive – amazon kinesis · 2015-07-10 · analytics amazon kinesis managed service for real...

48
Deep Dive – Amazon Kinesis Guy Ernest, Solution Architect - Amazon Web Services @guyernest

Upload: others

Post on 13-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Deep Dive – Amazon Kinesis

Guy Ernest, Solution Architect - Amazon Web Services @guyernest

Page 2: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Analytics

Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library, AWS Lambda, Apache Spark and Apache Storm to Process Data Integration with S3, Redshift and Dynamo DB

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  Administra=on  

Networking  

Analy=cs  

Page 3: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

 Data  Sources  

App.4    

[Machine  Learning]  

                             

     AW

S  En

dpoint  

App.1    

[Aggregate  &  De-­‐Duplicate]  

 Data  Sources  

Data  Sources  

 Data  Sources  

App.2    

[Metric  Extrac=on]  

S3

DynamoDB

Redshift

App.3  [Sliding  Window  Analysis]  

 Data  Sources  

Shard 1

Shard 2

Shard N

Availability Zone

Availability Zone

Amazon Kinesis Dataflow

Availability Zone

Page 4: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Billing Auditors

Incremental Bill

Computation

Metering Archive

Billing Management

Service

Example Architecture - Metering

Page 5: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Streams Named Event Streams of Data All data is stored for 24 hours

Shards You scale Kinesis streams by adding or removing Shards Each Shard ingests up to 1MB/sec of data and up to 1000 TPS

Partition Key Identifier used for Ordered Delivery & Partitioning of Data across Shards

Sequence Number of an event as assigned by Kinesis

Amazon Kinesis Components

Page 6: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Getting Data In

Page 7: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

  Producers use a PUT call to store data in a Stream

  A Partition Key is used to distribute the PUTs across Shards

  A unique Sequence # is returned to the Producer for each Event

  Data can be ingested at 1MB/second or 1000 Transactions/second per Shard

  1MB / Event

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Kinesis - Ingesting Fast Moving Data

Page 8: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

  Native Code Module to perform efficient writes to Multiple Kinesis Streams

  C++/Boost   Asynchronous Execution   Configurable Aggregation of Events

Introducing the Kinesis Producer Library

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

Page 9: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

KPL Aggregation

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

1MB Max Event Size

Aggregate

100k 20k 500k 200k

40k 20k 40k

500k 100k 200k 20k

40

k

40

k

20k

Protobuf Header Protobuf Footer

Page 10: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Apache Flume Source & Sink https://github.com/pdeyhim/flume-kinesis

FluentD Dynamic Partitioning Support https://github.com/awslabs/aws-fluent-plugin-kinesis

Log4J & Log4Net Included in Kinesis Samples

Kinesis Ecosystem - Ingest

Kinesis

Page 11: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Best Practices for Partition Key

•  Random will give even distribution •  If events should be processed together, choose

a relevant high cardinality partition key and monitor shard distribution

•  If partial order is important use sequence number

Page 12: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Getting Data Out

Page 13: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

  KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support

  All State Management in Dynamo DB

Kinesis Client Library

DynamoDB

Page 14: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Client library for fault-tolerant, at least-once, real-time processing

  Kinesis Client Library (KCL) simplifies reading from the stream by abstracting your code from individual shards

  Automatically starts a Worker Thread for each Shard

  Increases and decreases Thread count as number of Shards changes

  Uses checkpoints to keep track of a Thread’s location in the stream

  Restarts Threads & Workers if they fail

Consuming Data - Kinesis Enabled Applications

Page 15: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Analytics Tooling Integration (github.com/awslabs/amazon-kinesis-connectors)

S3 Batch Write Files for Archive into S3 Sequence Based File Naming

Redshift Once Written to S3, Load to Redshift Manifest Support User Defined Transformers

DynamoDB BatchPut Append to Table User Defined Transformers

ElasticSearch Automatically index Stream Contents

Kinesis Connectors

S3 Dynamo DB

Redshift

Kinesis

ElasticSearch

Page 16: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Connectors Architecture

Page 17: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Apache Storm Kinesis Spout Automatic Checkpointing with Zookeeper https://github.com/awslabs/kinesis-storm-spout

Kinesis Ecosystem - Storm

Storm

Kinesis

Page 18: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Apache Spark DStream Receiver runs KCL One DStream per Shard Checkpointed via KCL

Spark Natively Available on EMR

EMRFS overlay on HDFS AMI 3.8.0 https://aws.amazon.com/elasticmapreduce/details/spark

Kinesis Ecosystem - Spark

Page 19: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Distributed Event Processing Platform

  Stateless JavaScript & Java functions run against an Event Stream

  AWS SDK Built In   Configure RAM and Execution Timeout   Functions automatically invoked against a

Shard   Community libraries for Python & Go   Access to underlying filesystem for read/

write   Call other Lambda Functions

Consuming Data - AWS Lambda

Kinesis

Shard 1

Shard 2

Shard 3

Shard 4

Shard n

Page 20: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Durability

Regional Service Synchronous Writes to

Multiple AZ’s Extremely High Durability

?May be in-memory for

Performance Requirement to understand

Disk Sync Semantics User Managed Replication

Replication Lag -> RPO

Page 21: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Performance

Perform continual processing on streaming big data. Processing latencies

fall to a <1 second, compared with the minutes or hours

associated with batch processing

?Processing latencies < 1

second Based on CPU & Disk

Performance Cluster Interruption ->

Processing Outage

Page 22: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Availability

Regional Service Synchronous Writes to

Multiple AZ’s Extremely High Durability

AZ, Networking, & Chain Server Issues Transparent to

Producers & Consumers

?Many Depend on a CP

Database Lost Quorum can result in

failure/inconsistency of the cluster

Highest Availability is determined by Availability of Cross-AZ Links or Availability

of an AZ

Page 23: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Operations

Managed service for real-time streaming data

collection, processing and analysis. Simply create a

new stream, set the desired level of capacity, and let the

service handle the rest

?Build Instances Install Software Operate Cluster

Manage Disk Space Manage Replication

Migrate to new Stream on Scale Up

Page 24: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Elasticity

Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or

down based on your operational or business needs

?Fixed Partition Count up

Front Maximum Performance ~ 1

Partition/Core | Machine Convert from 1 Stream to

Another to Scale Application Reconfiguration

Page 25: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Scaling Streams

https://github.com/awslabs/amazon-kinesis-scaling-utils

Page 26: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Cost

Cost-efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only

for what you use. Scale Up/Down Dynamically

$.015/Hour/1MB

?Run your Own EC2 Instances

Multi-AZ Configuration for increased Durability

Utilise Instance AutoScaling on Worker Lag from HEAD

with Custom Metrics

Page 27: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Why Kinesis? Cost   Price Dropped on 2nd June 2015, Restructured to support KPL   Old Pricing: $.028 / 1M Records PUT   New Pricing: $.014/1M 25KB “Payload Units”

Units Cost Units Cost

Shards 50 $558 25 $279 PutRecords 4,320M

Records $120.96 2,648M

Payload Units

$37.50

$678.96 $316.50

Scenario: 50,000 Events / Second, 512B / Event = 24.4 MB/Second

Old Pricing New Pricing + KPL

Page 28: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add
Page 29: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Kinesis – Consumer Application Best Practices   Tolerate Failure of: Threads – Consider Data Serialisation issues and Lease Stealing;

Hardware – AutoScaling may add nodes as needed   Scale Consumers up and down as the number of Shards increase or decrease   Don’t store data in memory in the workers. Use an elastic data store such as Dynamo

DB for State   Elastic Beanstalk provides all Best Practices in a simple to

deploy, multi-version Application Container   KCL will automatically redistribute Workers to use new

Instances   Logic implemented in Lambda doesn’t require

any Servers at all!

Page 30: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Managing Application State

Page 31: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Consumer Local State Anti-Pattern

  Consumer binds to a configured number of Partitions

  Consumer stores the ‘state’ of a data structure, as defined by the event stream, on local storage

  Read API can access that local storage as a ‘shard’ of the overall database

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

Page 32: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Consumer Local State Anti-Pattern

But what happens when an instance fails?

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

Page 33: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Consumer Local State Anti-Pattern

  A new consumer process starts up for the required Partitions

  Consumer must read from the beginning of the Stream to rebuild local storage

  Complex, error prone, user constructed software

  Long Startup Time

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

T0

THead

Page 34: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

External Highly Available State – Best Practice

Consumer Consumer

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Read API

  Consumer binds to a even number of Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

DynamoDB

Page 35: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

External Highly Available State – Best Practice

  Consumer binds to a even number of Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

Read API

DynamoDB

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Consumer Consumer

Page 36: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

External Highly Available State – Best Practice

Read API

DynamoDB

AWS Lambda

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S   Consumer binds to a even number of

Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

Page 37: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency

Page 38: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Property of a system whereby the repeated application of a function on a single input results in the same end state of the system

Exactly Once Processing

Page 39: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry

My Application PutRecord(s)

403 (Endpoint Redirect)

Sto

rage

Page 40: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry

My Application PutRecord(s)

PutRecord(s)

200 (OK)

403 (Endpoint Redirect)

Sto

rage

S

tora

ge

Write 123

Write 123

Page 41: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Coming Soon…

Page 42: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

Sto

rage

DynamoDB X Hour Rolling

Window

Page 43: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window

Page 44: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

403 (Endpoint Redirect)

Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window

Page 45: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

PutRecord(s)

403 (Endpoint Redirect)

Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID

DynamoDB X Hour Rolling

Window

ConditionCheckFailedException

Page 46: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

PutRecord(s)

200 (OK)

403 (Endpoint Redirect)

Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID

DynamoDB X Hour Rolling

Window

ConditionCheckFailedException

Page 47: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Easy Administration

Real-time Performance.

High Durability

High Throughput. Elastic

S3, Redshift, & DynamoDB Integration

Large Ecosystem Low Cost

In Short…

Page 48: Deep Dive – Amazon Kinesis · 2015-07-10 · Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add

Amsterdam