introduction to aws kinesis
TRANSCRIPT
Wellington AWS MeetupIntroduction to
Kinesis
Who Am I?
• Team Leader/Architect in Business Intelligence/databases• 17 years experience.• MCSE BI, OCP DBA, MCDBA• AWS-ASA-2505
Who are OptimalBI?
• Wellington based BI Consultancy• “Making Information Visible”
Talk Outline1. Why do we need Kinesis?2. What is Kinesis?3. Demo4. How does it fit into an
existing data warehouse5. When to use Kinesis
Big Data1. Volume2. Velocity3. Variety
Kinesis is an answer to Velocity
Machine learning looks simple: Data is collected,magic happens,and we output it to our users
Traditional Business Intelligence
Data Store Data Warehouse
Query Tool
• Periodic, Batch Extract-Transform-Load.
• Persistent data source• High latency
Internet of Things• Large number of sensors.• Self registering • Pushing data• May or may not retain any
historic data.= Only one chance to get data
Batch ETL• Data needs to wait
somewhere between loads.• If data is only loaded six hours
per day, then four-times as much hardware is needed.
• Latency of hours
DIY Streaming ETL
“Realtime” “ETL” cluster
DIY Streaming ETL 2.0
Add a queue
DIY Streaming ETL 3+Cluster more
Getting messy, still problems
Problems with DIY Streaming ETL1. Message queues deliver once. If you want
to fan out to many readers the application in front needs to know about each of them and queue the same message repeatedly.
2. Order of message delivery is not guaranteed.
3. If the program reading data crashes partway through aggregating, messages are lost.
What is Kinesis• Kinesis is like a message queue,
but more scalable and with multiple readers of each message.
• Kinesis is like a NOSQL database, but with message delivery and daily purging.
• Kinesis is like an Enterprise Service Bus focused on Analytics.
• For a limited, if common, use case Kinesis is the best of all.
Kinesis Qualities• Scalable• Elastic• Durable• Fault Tolerant• Replayable
Kinesis Components• Each Queue/DB is called a Stream• Each stream scales by adding Shards• Each Shard provides 1 MB/s in and
2MB/s out• Shards are only $0.44/day, so autoscale
them to give some safety margin• Also pay about 2 cents per million puts
Kinesis Client Library• Kinesis expects you to write bespoke
producer and consumer programs• KCL provides automatic multi-threading
with one worker thread per shard.• Similar to Hadoop, framework handles
the lifting the bespoke program does the “reduce”
• You have to autoscale the EC2 groups.
Kinesis Application
instancesAuto Scaling group
instancesAuto Scaling group
instancesAuto Scaling group
Amazon Kinesis
Existing Kinesis ConnectorsHTTP POST
AWS SDK
Log4j
Flume
Fluentd
Get* APIs
Amazon Kinesis Client Library +Connector Library
Apache Storm
Amazon Elastic MapReduce
Sending Reading
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html
Standard AWS Demo Script
1. HIVE already running in EMR2. Create Kinesis Stream3. Start Producer4. Configure HIVE as consumer
Integrating Kinesis into an existing Data Warehouse
1. Access data in near real-time2. Facilitate more-traditional ETL3. Archive
Near Real-time Data1. Analyze individual transactions2. Send alerts for both individual
transactions and trends3. Aggregate to feed a
live dashboard
Facilitate Traditional ETL1. Write lightly transformed data to
S3 to batch COPY into Redshift 2. Pre-compute aggregates, then
write them to S33. Provide a durable, replayable
buffer in front of traditional ETL tools.
Archive1. In addition to using your data,
Kinesis makes it easy to log the full incoming data set to S3.
2. An object store makes more sense for write-once/read-never data than a database.
When to use Kinesis1. Internet of Things (IOT)2. Use for near-real-time
access to data.3. Have more than one
consumer for each piece of data.
Thanks1. Our sponsors: • API Talent• AWS • OptimalPeople
2. Bronwyn and Wyn3. AWS for images on slides