processing streams of data with amazon kinesis

59
Processing Streams of Data with Amazon Kinesis (and other tools) Michael Hanisch, Solutions Architect - Amazon Web Services

Upload: leduong

Post on 13-Feb-2017

228 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Processing Streams of Data with Amazon Kinesis

Processing Streams of Data with Amazon Kinesis(and other tools)

Michael Hanisch, Solutions Architect - Amazon Web Services

Page 2: Processing Streams of Data with Amazon Kinesis

Analytics

Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library, AWS Lambda, Apache Spark and Apache Storm to Process Data Integration with S3, Redshift and Dynamo DB

Compute   Storage  

AWS  Global  Infrastructure  

Database  

App  Services  

Deployment  &  Administra=on  

Networking  

Analy=cs  

Page 3: Processing Streams of Data with Amazon Kinesis

 Data  Sources  

App.4    

[Machine  Learning]  

                             

     AW

S  En

dpoint  

App.1    

[Aggregate  &  De-­‐Duplicate]  

 Data  Sources  

Data  Sources  

 Data  Sources  

App.2    

[Metric  Extrac=on]  

S3

DynamoDB

Redshift

App.3  [Sliding  Window  Analysis]  

 Data  Sources  

Shard 1

Shard 2

Shard N

Availability Zone

Availability Zone

Amazon Kinesis Dataflow

Availability Zone

Page 4: Processing Streams of Data with Amazon Kinesis

Billing Auditors

Incremental Bill

Computation

Metering Archive

Billing Management

Service

Example Architecture - Metering

Page 5: Processing Streams of Data with Amazon Kinesis

Streams Named Event Streams of Data

Shards You scale Kinesis streams by adding or removing Shards Each Shard ingests up to 1MB/sec of data and up to 1000 TPS All data is stored for 24 hours (up to 7 days on request)

Partition Key Identifier used for Ordered Delivery & Partitioning of Data across Shards

Sequence Number of an event as assigned by Kinesis

Amazon Kinesis Components

Page 6: Processing Streams of Data with Amazon Kinesis

Getting Data In

Page 7: Processing Streams of Data with Amazon Kinesis

  Producers use a PUT call to store data in a Stream

  A Partition Key is used to distribute the PUTs across Shards

  A unique Sequence # is returned to the Producer for each Event

  Data can be ingested at 1MB/second or 1000 Transactions/second per Shard

  Up to 1MB / Event

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Kinesis - Ingesting Fast Moving Data

Page 8: Processing Streams of Data with Amazon Kinesis

  Native Code Module to perform efficient writes to Multiple Kinesis Streams

  C++/Boost   Asynchronous Execution   Configurable Aggregation of Events

Introducing the Kinesis Producer Library

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

Page 9: Processing Streams of Data with Amazon Kinesis

KPL Aggregation

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

1MB Max Event Size

Aggregate

100k 20k 500k 200k

40k 20k 40k

500k 100k 200k 20k

40k

40k

20k

Protobuf Header Protobuf Footer

Page 10: Processing Streams of Data with Amazon Kinesis

Kinesis

Apache Flume Source & Sink https://github.com/pdeyhim/flume-kinesis

FluentD Dynamic Partitioning Support https://github.com/awslabs/aws-fluent-plugin-kinesis

Log4J & Log4Net Included in Kinesis Samples

Logstash Plugins to use Kinesis as input or output

Kinesis Ecosystem - Ingest

Page 11: Processing Streams of Data with Amazon Kinesis

Getting Data Out

Page 12: Processing Streams of Data with Amazon Kinesis

  KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support

  All State Management in Dynamo DB

Kinesis Client Library

DynamoDB

Page 13: Processing Streams of Data with Amazon Kinesis

Distributed Event Processing Platform

  Stateless JavaScript, Java & Python functions run against an Event Stream

  AWS SDK Built In   Configure RAM and Execution

Timeout (up to 5 minutes)   Functions automatically invoked

against a Shard   Invoked once per event or in batches   Access to underlying filesystem   Call other Lambda Functions

Consuming Data - AWS Lambda

Kinesis

Shard 1

Shard 2

Shard 3

Shard 4

Shard n

Page 14: Processing Streams of Data with Amazon Kinesis

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Client library for fault-tolerant, at least-once, real-time processing

  Kinesis Client Library (KCL) simplifies reading from the stream by abstracting your code from individual shards

  Automatically starts a Worker Thread for each Shard

  Increases and decreases Thread count as number of Shards changes

  Uses checkpoints to keep track of a Thread’s location in the stream

  Restarts Threads & Workers if they fail

Consuming Data - Kinesis Enabled Applications

Page 15: Processing Streams of Data with Amazon Kinesis

Analytics Tooling Integration (github.com/awslabs/amazon-kinesis-connectors)

S3 Batch Write Files for Archive into S3 Sequence Based File Naming

Redshift Once Written to S3, Load to Redshift Manifest Support User Defined Transformers

DynamoDB BatchPut Append to Table User Defined Transformers

ElasticSearch Automatically index Stream Contents

Kinesis Connectors

S3 Dynamo DB

Redshift

Kinesis

Amazon Elasticsearch

Service

Page 16: Processing Streams of Data with Amazon Kinesis

Connectors Architecture

Page 17: Processing Streams of Data with Amazon Kinesis

Fully managed ingestion of stream data into Amazon S3 or Redshift

•  Set up a Firehose to ingest data straight into S3 and/or Redshift

•  Takes care –  Buffering & Batching of your data

(based on time or size) –  Encryption –  Compression

•  Scales stream automatically, no need to take care of Shards.

Kinesis Firehose

S3 S3

Redshift

Kinesis Firehose

Page 18: Processing Streams of Data with Amazon Kinesis

Apache Hive (on EMR) Breaks stream into iterations Iterations represented as partitions of external tables in Hive For batch processing using Hive-QL

Presto Near-real time exploration of a stream using SQL

Kinesis Ecosystem - Storm

Storm

Kinesis

Page 19: Processing Streams of Data with Amazon Kinesis

Apache Storm Kinesis Spout Real-time processing of individual events Automatic Checkpointing with Zookeeper https://github.com/awslabs/kinesis-storm-spout

Kinesis Ecosystem - Storm

Storm

Kinesis

Page 20: Processing Streams of Data with Amazon Kinesis

Apache Spark Streaming Presents slices of stream's data as frames Micro-batches (usually between 0.5s and 5s windows) Same programming & processing model as used for batch data!

Kinesis Ecosystem - Spark

Storm

Kinesis

Page 21: Processing Streams of Data with Amazon Kinesis

Apache Spark DStream Receiver runs KCL One DStream per Shard Checkpointed via KCL

Spark Natively Available on EMR

EMRFS overlay on HDFS AMI 4.x (or AMI 3.8.0) https://aws.amazon.com/elasticmapreduce/details/spark

Kinesis Ecosystem - Spark

Page 22: Processing Streams of Data with Amazon Kinesis

Kinesis – Spark Streaming Example

KinesisUtils.createStream(‘twitter-stream’)

.filter(_.getText.contains(”Open-Source"))

.countByWindow(Seconds(5))

Counting tweets on a sliding window

Page 23: Processing Streams of Data with Amazon Kinesis

Kinesis – Spark Pipeline

Page 24: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Durability

Regional Service Synchronous Writes to Multiple

AZ’s Extremely High Durability

?May be in-memory for

Performance Requirement to understand Disk

Sync Semantics User Managed Replication

Replication Lag -> RPO

Page 25: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Performance

Perform continual processing on streaming big data. Processing latencies fall to a <1 second, compared with the minutes or hours associated with batch

processing

?Processing latencies < 1 second

Based on CPU & Disk Performance

Cluster Interruption -> Processing Outage

Page 26: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Availability

Regional Service Synchronous Writes to Multiple

AZ’s Extremely High Durability

AZ, Networking, & Chain Server Issues Transparent to Producers

& Consumers

?Many Depend on a CP Database Lost Quorum can result in failure/

inconsistency of the cluster Highest Availability is determined by Availability of Cross-AZ Links

or Availability of an AZ

Page 27: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Operations

Managed service for real-time streaming data collection,

processing and analysis. Simply create a new stream, set the

desired level of capacity, and let the service handle the rest

?Build Instances Install Software Operate Cluster

Manage Disk Space Manage Replication

Migrate to new Stream on Scale Up

Page 28: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Elasticity

Seamlessly scale to match your data throughput rate and volume.

You can easily scale up to gigabytes per second. The service

will scale up or down based on your operational or business

needs

?Fixed Partition Count up Front

Maximum Performance ~ 1 Partition/Core | Machine Convert from 1 Stream to

Another to Scale Application Reconfiguration

Page 29: Processing Streams of Data with Amazon Kinesis

Scaling Streams

https://github.com/awslabs/amazon-kinesis-scaling-utils

Page 30: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Cost

Cost-efficient for workloads of any scale. You can get started by

provisioning a small stream, and pay low hourly rates only for what

you use. Scale Up/Down Dynamically

$.015/Hour/1MB

?Run your Own EC2 Instances

Multi-AZ Configuration for increased Durability

Utilise Instance AutoScaling on Worker Lag from HEAD with

Custom Metrics

Page 31: Processing Streams of Data with Amazon Kinesis

Why Kinesis? Cost   Price Dropped on 2nd June 2015, Restructured to support KPL   Old Pricing: $.028 / 1M Records PUT   New Pricing: $.014/1M 25KB “Payload Units”

Units Cost Units Cost

Shards 50 $558 25 $279 PutRecords 4,320M

Records $120.96 2,648M

Payload Units

$37.50

$678.96 $316.50

Scenario: 50,000 Events / Second, 512B / Event = 24.4 MB/Second

Old Pricing New Pricing + KPL

Page 32: Processing Streams of Data with Amazon Kinesis
Page 33: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

Page 34: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

RecordProcessor (Shard 1)

RecordProcessor (Shard 2)

RecordProcessor (Shard 3)

RecordProcessor (Shard 4)

RecordProcessor (Shard 5)

Page 35: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

RecordProcessor (Shard 1)

RecordProcessor (Shard 2)

RecordProcessor (Shard 3)

RecordProcessor (Shard 4)

RecordProcessor (Shard 5)

Worker

Page 36: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

RecordProcessor (Shard 1)

RecordProcessor (Shard 2)

RecordProcessor (Shard 3)

RecordProcessor (Shard 4)

RecordProcessor (Shard 5)

Worker

Page 37: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

RecordProcessor (Shard 1)

RecordProcessor (Shard 2)

RecordProcessor (Shard 3)

RecordProcessor (Shard 4)

RecordProcessor (Shard 5)

Worker

Page 38: Processing Streams of Data with Amazon Kinesis

Deploying a KCL Application

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Worker

Worker

Worker

Worker

Auto scaling Group

KCL Workers emit CloudWatch metrics you can use to auto-scale. In general, using CPU Utilization should be sufficient though.

Page 39: Processing Streams of Data with Amazon Kinesis

•  Be distributed, to handle multiple shards •  Be fault tolerant, to handle failures in hardware or software

•  Scale up and down as the number of shards increase or decrease

KCL Applications & Autoscaling

•  Auto Scaling policies will restart EC2 instances if they fail

•  Automatically add EC2 instances when load increases

•  KCL will automatically redistribute Workers to use the new EC2 instances

Page 40: Processing Streams of Data with Amazon Kinesis

Kinesis – Consumer Application Best Practices   Tolerate Failure of: Threads – Consider Event Serialisation issues and Lease Stealing;

Hardware – AutoScaling may add nodes as needed   Scale Consumers up and down as the number of Shards increase or decrease   Don’t store data in memory in the workers. Use an elastic data store such as Dynamo

DB for State   Elastic Beanstalk provides all Best Practices in a simple to

deploy, multi-version Application Container   KCL will automatically redistribute Workers to use new

Instances   Logic implemented in Lambda doesn’t require

any Servers at all!

Page 41: Processing Streams of Data with Amazon Kinesis

Managing Application State

Page 42: Processing Streams of Data with Amazon Kinesis

Consumer Local State Anti-Pattern

  Consumer binds to a configured number of Partitions

  Consumer stores the ‘state’ of a data structure, as defined by the event stream, on local storage

  Read API can access that local storage as a ‘shard’ of the overall database

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

Page 43: Processing Streams of Data with Amazon Kinesis

Consumer Local State Anti-Pattern

But what happens when an instance fails?

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

Page 44: Processing Streams of Data with Amazon Kinesis

Consumer Local State Anti-Pattern

  A new consumer process starts up for the required Partitions

  Consumer must read from the beginning of the Stream to rebuild local storage

  Complex, error prone, user constructed software

  Long Startup Time

?Consumer Consumer

Partition 1

Partition …

Partition P/2

Partition P/2+1

Partition …

Partition P

Local Disk Local Disk

Local Storage Partitions P1..P/2

Local Storage Partitions P/2+1..P

Read API

T0

THead

Page 45: Processing Streams of Data with Amazon Kinesis

External Highly Available State – Best Practice

Consumer Consumer

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Read API

  Consumer binds to a even number of Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

DynamoDB

Page 46: Processing Streams of Data with Amazon Kinesis

External Highly Available State – Best Practice

  Consumer binds to a even number of Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

Read API

DynamoDB

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S

Consumer Consumer

Page 47: Processing Streams of Data with Amazon Kinesis

External Highly Available State – Best Practice

Read API

DynamoDB

AWS Lambda

Shard 1

Shard …

Shard S/2

Shard S/2+1

Shard…

Shard S   Consumer binds to a even number of

Shards based on number of Consumers

  Consumer stores the ‘state’ in Dynamo DB

  Dynamo DB is Highly Available, Elastic & Durable

  Read API can access Dynamo DB

Page 48: Processing Streams of Data with Amazon Kinesis

Idempotency

Page 49: Processing Streams of Data with Amazon Kinesis

Property of a system whereby the repeated application of a function on a single input results in the same end state of the system

Exactly Once Processing

Page 50: Processing Streams of Data with Amazon Kinesis

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry

My Application PutRecord(s)

403 (Endpoint Redirect) Sto

rage

Page 51: Processing Streams of Data with Amazon Kinesis

Idempotency – Writing Data

  The Kinesis SDK & KPL may retry PUT in certain circumstances

  Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones…

  But there could be a duplicate entry

My Application PutRecord(s)

PutRecord(s)

200 (OK)

403 (Endpoint Redirect) Sto

rage

S

tora

ge

Write 123

Write 123

Page 52: Processing Streams of Data with Amazon Kinesis

Coming Soon…

Page 53: Processing Streams of Data with Amazon Kinesis

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

Sto

rage

DynamoDB X Hour Rolling

Window

Page 54: Processing Streams of Data with Amazon Kinesis

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window

Page 55: Processing Streams of Data with Amazon Kinesis

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

403 (Endpoint Redirect)

Sto

rage

Write Record ID

OK DynamoDB

X Hour Rolling Window

Page 56: Processing Streams of Data with Amazon Kinesis

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

PutRecord(s)

403 (Endpoint Redirect)

Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID

DynamoDB X Hour Rolling

Window

ConditionCheckFailedException

Page 57: Processing Streams of Data with Amazon Kinesis

Idempotency – Rolling Idempotency Check

  Kinesis will manage a rolling time window of Record ID’s in Dynamo DB   Record ID’s are User Based   Duplicates in storage tier will be acknowledged as Successful

My Application PutRecord(s)

PutRecord(s)

200 (OK)

403 (Endpoint Redirect)

Sto

rage

S

tora

ge

Write Record ID

OK

Write Record ID

DynamoDB X Hour Rolling

Window

ConditionCheckFailedException

Page 58: Processing Streams of Data with Amazon Kinesis

Easy Administration Real-time

Performance. High Durability

High Throughput. Elastic

S3, Redshift & DynamoDB Integration

Large Ecosystem Low Cost

In Short…

Page 59: Processing Streams of Data with Amazon Kinesis