introduction to aws kinesis

Wellington AWS MeetupIntroduction to

Kinesis

Who Am I?

• Team Leader/Architect in Business Intelligence/databases• 17 years experience.• MCSE BI, OCP DBA, MCDBA• AWS-ASA-2505

Who are OptimalBI?

• Wellington based BI Consultancy• “Making Information Visible”

Talk Outline1. Why do we need Kinesis?2. What is Kinesis?3. Demo4. How does it fit into an

existing data warehouse5. When to use Kinesis

Big Data1. Volume2. Velocity3. Variety

Kinesis is an answer to Velocity

Machine learning looks simple: Data is collected,magic happens,and we output it to our users

Traditional Business Intelligence

Data Store Data Warehouse

Query Tool

• Periodic, Batch Extract-Transform-Load.

• Persistent data source• High latency

Internet of Things• Large number of sensors.• Self registering • Pushing data• May or may not retain any

historic data.= Only one chance to get data

Batch ETL• Data needs to wait

somewhere between loads.• If data is only loaded six hours

per day, then four-times as much hardware is needed.

• Latency of hours

DIY Streaming ETL

“Realtime” “ETL” cluster

DIY Streaming ETL 2.0

Add a queue

DIY Streaming ETL 3+Cluster more

Getting messy, still problems

Problems with DIY Streaming ETL1. Message queues deliver once. If you want

to fan out to many readers the application in front needs to know about each of them and queue the same message repeatedly.

2. Order of message delivery is not guaranteed.

3. If the program reading data crashes partway through aggregating, messages are lost.

What is Kinesis• Kinesis is like a message queue,

but more scalable and with multiple readers of each message.

• Kinesis is like a NOSQL database, but with message delivery and daily purging.

• Kinesis is like an Enterprise Service Bus focused on Analytics.

• For a limited, if common, use case Kinesis is the best of all.

Kinesis Qualities• Scalable• Elastic• Durable• Fault Tolerant• Replayable

Kinesis Components• Each Queue/DB is called a Stream• Each stream scales by adding Shards• Each Shard provides 1 MB/s in and

2MB/s out• Shards are only $0.44/day, so autoscale

them to give some safety margin• Also pay about 2 cents per million puts

Kinesis Client Library• Kinesis expects you to write bespoke

producer and consumer programs• KCL provides automatic multi-threading

with one worker thread per shard.• Similar to Hadoop, framework handles

the lifting the bespoke program does the “reduce”

• You have to autoscale the EC2 groups.

Kinesis Application

instancesAuto Scaling group



Amazon Kinesis

Existing Kinesis ConnectorsHTTP POST

AWS SDK

Log4j

Flume

Fluentd

Get* APIs

Amazon Kinesis Client Library +Connector Library

Apache Storm

Amazon Elastic MapReduce

Sending Reading

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html

Standard AWS Demo Script

1. HIVE already running in EMR2. Create Kinesis Stream3. Start Producer4. Configure HIVE as consumer

Integrating Kinesis into an existing Data Warehouse

1. Access data in near real-time2. Facilitate more-traditional ETL3. Archive

Near Real-time Data1. Analyze individual transactions2. Send alerts for both individual

transactions and trends3. Aggregate to feed a

live dashboard

Facilitate Traditional ETL1. Write lightly transformed data to

S3 to batch COPY into Redshift 2. Pre-compute aggregates, then

write them to S33. Provide a durable, replayable

buffer in front of traditional ETL tools.

Archive1. In addition to using your data,

Kinesis makes it easy to log the full incoming data set to S3.

2. An object store makes more sense for write-once/read-never data than a database.

When to use Kinesis1. Internet of Things (IOT)2. Use for near-real-time

access to data.3. Have more than one

consumer for each piece of data.

Thanks1. Our sponsors: • API Talent• AWS • OptimalPeople

2. Bronwyn and Wyn3. AWS for images on slides

introduction to aws kinesis

Data & Analytics