aws re:invent 2016: analyzing streaming data in real-time with amazon kinesis analytics (bdm304)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
November 2016
BDM304
Analyzing Streaming Data in Real-Time
with Amazon Kinesis Analytics
Allan MacInnis, Senior Solutions Architect, AWS
What to Expect from the Session
• Streaming data overview
• Amazon Kinesis platform review
• Amazon Kinesis Analytics overview
• Amazon Kinesis Analytics patterns
• Streaming data end-to-end example and walk-
through
• Amazon Kinesis Analytics best practices
Streaming Data Overview
Most data is produced continuously
Mobile apps Web clickstream Application logs
Metering records IoT sensors Smart buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
The diminishing value of data
Recent data is highly valuable• If you act on it in time
• Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable • If you have the means to combine them
Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist
Amazon Kinesis Platform
Overview
Amazon Kinesis makes it easy to work with
real-time streaming data
Amazon Kinesis
Streams
• For technical developers
• Collect and stream data
for ordered, replayable,
real-time processing
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Amazon
Redshift, Amazon
Elasticsearch Service
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low
cost
• Build custom real-time applications to process streaming
data
Amazon Kinesis Firehose
• Reliably ingest and deliver batched, compressed, and encrypted
data to S3, Amazon Redshift, and Amazon Elasticsearch Service
(Amazon ES)
• Point-and-click setup with zero administration and seamless
elasticity
Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms
Amazon Kinesis Analytics:
Service Overview
Amazon Kinesis Analytics
Pay for only what you use
Automatic elasticity
Standard SQL for analytics
Real-time processing
Easy to use
Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
Connect to streaming source
• Streaming data sources include Amazon
Kinesis Firehose or Amazon Kinesis Streams
• Input formats include JSON, .csv, variable
column, or unstructured text
• Each input has a schema; schema is inferred,
but you can edit
• Reference data sources (S3) for data
enrichment
Write SQL code
• Build streaming applications with one-to-many
SQL statements
• Robust SQL support and advanced analytic
functions
• Extensions to the SQL standard to work
seamlessly with streaming data
• Support for at-least-once processing
semantics
Continuously deliver SQL results
• Send processed data to multiple destinations
• S3, Amazon Redshift, Amazon ES (through
Firehose)
• Streams (with AWS Lambda integration for
custom destinations)
• End-to-end processing speed as low as sub-
second
• Separation of processing and data delivery
What are common uses for
Amazon Kinesis Analytics?
Generate time series analytics
• Compute key performance indicators over time periods
• Combine with static or historical data in S3 or Amazon Redshift
Analytics
Streams
Firehose
Amazon
Redshift
S3
Streams
Firehose
Custom, real-
time
destinations
Feed real-time dashboards
• Validate and transform raw data, and then process to calculate
meaningful statistics
• Send processed data downstream for visualization in BI and
visualization services
Amazon
QuickSight
Analytics
Amazon ES
Amazon
Redshift
Amazon
RDS
Streams
Firehose
Create real-time alarms and notifications
• Build sequences of events from the stream, like user sessions in a
clickstream or app behavior through logs
• Identify events (or a series of events) of interest, and react to the
data through alarms and notifications
Analytics
Streams
Firehose
Streams
Amazon
SNS
Amazon
CloudWatch
Lambda
Example: NHL Tweet Analysis
Example Scenario Requirements
Data to capture
• Filter for NHL-related tweets
• Total number of tweets per hour that contain hashtags for
NHL teams
• Top 5 NHL-related hashtags per hour
Output Requirements
• Filtered tweets are saved to Amazon S3
• Hourly aggregate count is saved to Amazon ES
• Full team name of top 5 hashtags are saved to Amazon ES
Why use Amazon Kinesis Analytics for this solution?
Challenges
• Twitter stream can be noisy
• Tweet structure is complex, with several levels of
nested JSON
• NHL-related tweet volume is cyclical
With Amazon Kinesis Analytics:
• Easily filter out unwanted tweets
• Normalize tweet schema for simple SQL queries
• Automatically scale to meet demand
End-to-End Architecture
Amazon
Kinesis
Streams
Amazon
Kinesis
Analytics
Amazon
Kinesis
Firehose
Amazon
Elasticsearch
Service
Amazon S3
EC2
instance
Reference
data
Twitter Data Input
Amazon Kinesis streamEC2 instance
Source JSON Data• Hundreds of attributes
per tweet
• ~5KB per tweet
• Most attributes are
unnecessary for use
case
Amazon Kinesis
Streams input• 5 attributes per record
• ~250B per record
• Relevant attributes
How are tweets mapped to a schema?
Amazon Kinesis stream Amazon Kinesis Analytics
{
"id": 795296435386388500,
"text": "#Canes game tonight! #NHL",
"created_at": "11-06-2016 16:07:00",
"tags": [{
"tag": "Canes"
}, {
"tag": "NHL"
}]
}
id text created_at tag
795… #Canes… 11-06-2016… Canes
795… #Canes… 11-06-2016… NHL
Source data for
Amazon Kinesis Analytics
How is streaming data accessed with SQL?
STREAM• Analogous to a TABLE
• Represents continuous data flow
CREATE OR REPLACE STREAM "NHL_TWEET_STREAM" (
ID BIGINT, TWEET_TEXT VARCHAR(140), HASHTAG VARCHAR(140));
PUMP• Continuous INSERT query
• Inserts data from one in-application stream to another
CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS
INSERT INTO "NHL_TWEET_STREAM"
SELECT STREAM * FROM . . .
How do we model our data?
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
TOTAL_TWEETS_STREAM
•TWEET_COUNT
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
SELECT STREAM *
FROM "SOURCE_STREAM"
WHERE "HASHTAG" NOT IN
<exclude list>
SOURCE_STREAM
•id
•text
•tag
Amazon
Kinesis
stream
Amazon Kinesis
Firehose
Amazon Kinesis
FirehoseTeamName
• hashtag
• team
How do we filter unwanted tweets?
Use PUMP to insert filtered data into STREAM
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
SOURCE_STREAM
•id
•text
•tag
CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS
INSERT INTO "NHL_TWEET_STREAM"
SELECT STREAM "id", "text", LOWER("tag")
FROM "SOURCE_STREAM"
WHERE LOWER("tag") NOT IN ('nfl', 'mlb', 'nba');
How do we get team name from the hashtag?
• Create CSV file with hashtag to team name map in S3
• Configure Amazon Kinesis Analytics application to import
file as reference data
• Reference data appears as a table
• Join streaming data on reference data
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv
Use Reference Data in Query
SELECT STREAM tn."team"
FROM "NHL_TWEET_STREAM" tweets
INNER JOIN "TeamName" tn
ON tweets."HASHTAG" =
LOWER(tn."hashtag")
TeamName
• hashtag
• team
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv
New York Rangers
New York Rangers
Carolina Hurricanes
Edmonton Oilers
San Jose Sharks
How do we aggregate streaming data?
• A common requirement in streaming
analytics is to perform set-based operation(s) (count, average, max, min,..) over events
that arrive within a specified period of time
• Cannot simply aggregate over an entire table
like typical static database
• How do we define a subset in a potentially infinite
stream?
• Windowing functions!
Windowing Concepts
• Windows can be tumbling or sliding
• Windows are fixed length
Output record will have the timestamp of the end of the window
1 5 4 26 8 6 4
t1 t2 t5 t6t3 t4
Time
Window 1 Window 2 Window 3
Aggregate
Function (Sum)
18 14Output Events
Comparing Types of Windows
• Output created at the end of the window
• The output of the window will be single event based on the
aggregate function used
Tumbling windowAggregate per time interval
Sliding windowWindows constantly re-evaluated
How do we aggregate team mentions per hour?
• Use TOP_K_ITEMS_TUMBLING function
• Pass cursor to team name stream
• Define window size of 3600 seconds
INSERT INTO "MENTION_COUNT_STREAM"
SELECT STREAM *
FROM TABLE(TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM tn."team"... ),
'team', -- name of column to aggregate
5, -- number of top items
3600 -- tumbling window size in seconds
));
Output to Amazon Kinesis Firehose
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
Amazon
Elasticsearch
Service
KibanaAmazon
Kinesis Firehose
{
"aggregationtime": "2016-11-06T14:42:03.335",
"teamname": "New York Rangers",
"mentioncount": 604
}
5 records, every hour
Visualize Results with Kibana
Amazon Kinesis Analytics
Best Practices
Managing Applications
Set up Cloudwatch Alarms
• MillisBehindLatest metric tracks how far
behind the application is from the source
• Alarm on MillisBehindLatest metric.
Consider triggering when 1-hour behind, on a
1-minute average. Adjust accordingly for
applications with lower end-to-end processing
needs.
Managing Applications
Increase input parallelism to improve
performance
• By default, a single source in-application
stream is created
• If application is not keeping up with input
stream, consider increasing input parallelism to
create multiple source in-application streams
Managing Applications
Limit number of applications reading from
same source
• Avoid ReadProvisionedThroughputExceeded
exceptions
• For an Amazon Kinesis Streams source, limit
to 2 total applications
• For an Amazon Kinesis Firehose source, limit
to 1 application
Defining Input Schema
• Review and adequately test inferred input
schema
• Manually update schema to handle nested
JSON with greater than 2 levels of depth
• Use SQL functions in your application for
unstructured data
Authoring Application Code
• Avoid time-based windows greater than one
hour
• Keep window sizes small during development
• Use smaller SQL queries, with multiple in-
application streams, rather than a single, large
query
Limits
• Maximum row size in an in-application stream
is 50 KB.
• Maximum input parallelism is 10 in-application
streams.
• Each application supports one streaming
source, and one reference data source. The
reference data source can be no larger than 1
GB in size.
Pricing
• Pay only for what you use.
• Charged an hourly rate, based on the average
number of Amazon Kinesis Processing Units
(KPU) used to run your application.
• A single KPU provides one vCPU, and 4 GB of
memory.
• $0.11 per KPU-hour (US East).
Thank you!
Remember to complete
your evaluations!