aws re:invent 2016: analyzing streaming data in real-time with amazon kinesis analytics (bdm304)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 2016

BDM304

Analyzing Streaming Data in Real-Time

with Amazon Kinesis Analytics

Allan MacInnis, Senior Solutions Architect, AWS

What to Expect from the Session

• Streaming data overview

• Amazon Kinesis platform review

• Amazon Kinesis Analytics overview

• Amazon Kinesis Analytics patterns

• Streaming data end-to-end example and walk-

through

• Amazon Kinesis Analytics best practices

Streaming Data Overview

Most data is produced continuously

Mobile apps Web clickstream Application logs

Metering records IoT sensors Smart buildings

[Wed Oct 11 14:32:52

2000] [error] [client

127.0.0.1] client

denied by server

configuration:

/export/home/live/ap/h

tdocs/test

The diminishing value of data

Recent data is highly valuable• If you act on it in time

• Perishable Insights (M. Gualtieri, Forrester)

Old + Recent data is more valuable • If you have the means to combine them

Processing real-time, streaming data

• Durable

• Continuous

• Fast

• Correct

• Reactive

• Reliable

What are the key requirements?

Ingest Transform Analyze React Persist

Amazon Kinesis Platform

Overview

Amazon Kinesis makes it easy to work with

real-time streaming data

Amazon Kinesis

Streams

• For technical developers

• Collect and stream data

for ordered, replayable,

real-time processing

Amazon Kinesis

Firehose

• For all developers, data

scientists

• Easily load massive

volumes of streaming data

into Amazon S3, Amazon

Redshift, Amazon

Elasticsearch Service

Amazon Kinesis

Analytics

• For all developers, data

scientists

• Easily analyze data

streams using standard

SQL queries

Amazon Kinesis Streams

• Reliably ingest and durably store streaming data at low

cost

• Build custom real-time applications to process streaming

data

Amazon Kinesis Firehose

• Reliably ingest and deliver batched, compressed, and encrypted

data to S3, Amazon Redshift, and Amazon Elasticsearch Service

(Amazon ES)

• Point-and-click setup with zero administration and seamless

elasticity

Amazon Kinesis Analytics

• Interact with streaming data in real time using SQL

• Build fully managed and elastic stream processing

applications that process data for real-time visualizations

and alarms

Amazon Kinesis Analytics:

Service Overview


Pay for only what you use

Automatic elasticity

Standard SQL for analytics

Real-time processing

Easy to use

Use SQL to build real-time applications

Easily write SQL code to process

streaming data

Connect to streaming source

Continuously deliver SQL results

Connect to streaming source

• Streaming data sources include Amazon

Kinesis Firehose or Amazon Kinesis Streams

• Input formats include JSON, .csv, variable

column, or unstructured text

• Each input has a schema; schema is inferred,

but you can edit

• Reference data sources (S3) for data

enrichment

Write SQL code

• Build streaming applications with one-to-many

SQL statements

• Robust SQL support and advanced analytic

functions

• Extensions to the SQL standard to work

seamlessly with streaming data

• Support for at-least-once processing

semantics

Continuously deliver SQL results

• Send processed data to multiple destinations

• S3, Amazon Redshift, Amazon ES (through

Firehose)

• Streams (with AWS Lambda integration for

custom destinations)

• End-to-end processing speed as low as sub-

second

• Separation of processing and data delivery

What are common uses for

Amazon Kinesis Analytics?

Generate time series analytics

• Compute key performance indicators over time periods

• Combine with static or historical data in S3 or Amazon Redshift

Analytics

Streams

Firehose

Amazon

Redshift

S3

Streams

Firehose

Custom, real-

time

destinations

Feed real-time dashboards

• Validate and transform raw data, and then process to calculate

meaningful statistics

• Send processed data downstream for visualization in BI and

visualization services

Amazon

QuickSight

Analytics

Amazon ES

Amazon

Redshift

Amazon

RDS

Streams

Firehose

Create real-time alarms and notifications

• Build sequences of events from the stream, like user sessions in a

clickstream or app behavior through logs

• Identify events (or a series of events) of interest, and react to the

data through alarms and notifications

Analytics

Streams

Firehose

Streams

Amazon

SNS

Amazon

CloudWatch

Lambda

Example: NHL Tweet Analysis

Example Scenario Requirements

Data to capture

• Filter for NHL-related tweets

• Total number of tweets per hour that contain hashtags for

NHL teams

• Top 5 NHL-related hashtags per hour

Output Requirements

• Filtered tweets are saved to Amazon S3

• Hourly aggregate count is saved to Amazon ES

• Full team name of top 5 hashtags are saved to Amazon ES

Why use Amazon Kinesis Analytics for this solution?

Challenges

• Twitter stream can be noisy

• Tweet structure is complex, with several levels of

nested JSON

• NHL-related tweet volume is cyclical

With Amazon Kinesis Analytics:

• Easily filter out unwanted tweets

• Normalize tweet schema for simple SQL queries

• Automatically scale to meet demand

End-to-End Architecture

Amazon

Kinesis

Streams

Amazon

Kinesis

Analytics

Amazon

Kinesis

Firehose

Amazon

Elasticsearch

Service

Amazon S3

EC2

instance

Reference

data

Twitter Data Input

Amazon Kinesis streamEC2 instance

Source JSON Data• Hundreds of attributes

per tweet

• ~5KB per tweet

• Most attributes are

unnecessary for use

case

Amazon Kinesis

Streams input• 5 attributes per record

• ~250B per record

• Relevant attributes

How are tweets mapped to a schema?

Amazon Kinesis stream Amazon Kinesis Analytics

{

"id": 795296435386388500,

"text": "#Canes game tonight! #NHL",

"created_at": "11-06-2016 16:07:00",

"tags": [{

"tag": "Canes"

}, {

"tag": "NHL"

}]

}

id text created_at tag

795… #Canes… 11-06-2016… Canes

795… #Canes… 11-06-2016… NHL

Source data for


How is streaming data accessed with SQL?

STREAM• Analogous to a TABLE

• Represents continuous data flow

CREATE OR REPLACE STREAM "NHL_TWEET_STREAM" (

ID BIGINT, TWEET_TEXT VARCHAR(140), HASHTAG VARCHAR(140));

PUMP• Continuous INSERT query

• Inserts data from one in-application stream to another

CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS

INSERT INTO "NHL_TWEET_STREAM"

SELECT STREAM * FROM . . .

How do we model our data?

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

TOTAL_TWEETS_STREAM

•TWEET_COUNT

MENTION_COUNT_STREAM

•TEAMNAME

•MENTIONCOUNT

SELECT STREAM *

FROM "SOURCE_STREAM"

WHERE "HASHTAG" NOT IN

<exclude list>

SOURCE_STREAM

•id

•text

•tag

Amazon

Kinesis

stream

Amazon Kinesis

Firehose

Amazon Kinesis

FirehoseTeamName

• hashtag

• team

How do we filter unwanted tweets?

Use PUMP to insert filtered data into STREAM

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

SOURCE_STREAM

•id

•text

•tag

CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS

INSERT INTO "NHL_TWEET_STREAM"

SELECT STREAM "id", "text", LOWER("tag")

FROM "SOURCE_STREAM"

WHERE LOWER("tag") NOT IN ('nfl', 'mlb', 'nba');

How do we get team name from the hashtag?

• Create CSV file with hashtag to team name map in S3

• Configure Amazon Kinesis Analytics application to import

file as reference data

• Reference data appears as a table

• Join streaming data on reference data

hashtag,team

NYRangers,New York Rangers

NYR,New York Rangers

Canes,Carolina Hurricanes

Oilers,Edmonton Oilers

Sharks,San Jose Sharks

s3://mybucket/team_map.csv

Use Reference Data in Query

SELECT STREAM tn."team"

FROM "NHL_TWEET_STREAM" tweets

INNER JOIN "TeamName" tn

ON tweets."HASHTAG" =

LOWER(tn."hashtag")

TeamName

• hashtag

• team

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

hashtag,team

NYRangers,New York Rangers

NYR,New York Rangers

Canes,Carolina Hurricanes

Oilers,Edmonton Oilers

Sharks,San Jose Sharks

s3://mybucket/team_map.csv

New York Rangers

New York Rangers

Carolina Hurricanes

Edmonton Oilers

San Jose Sharks

How do we aggregate streaming data?

• A common requirement in streaming

analytics is to perform set-based operation(s) (count, average, max, min,..) over events

that arrive within a specified period of time

• Cannot simply aggregate over an entire table

like typical static database

• How do we define a subset in a potentially infinite

stream?

• Windowing functions!

Windowing Concepts

• Windows can be tumbling or sliding

• Windows are fixed length

Output record will have the timestamp of the end of the window

1 5 4 26 8 6 4

t1 t2 t5 t6t3 t4

Time

Window 1 Window 2 Window 3

Aggregate

Function (Sum)

18 14Output Events

Comparing Types of Windows

• Output created at the end of the window

• The output of the window will be single event based on the

aggregate function used

Tumbling windowAggregate per time interval

Sliding windowWindows constantly re-evaluated

How do we aggregate team mentions per hour?

• Use TOP_K_ITEMS_TUMBLING function

• Pass cursor to team name stream

• Define window size of 3600 seconds

INSERT INTO "MENTION_COUNT_STREAM"

SELECT STREAM *

FROM TABLE(TOP_K_ITEMS_TUMBLING(

CURSOR(SELECT STREAM tn."team"... ),

'team', -- name of column to aggregate

5, -- number of top items

3600 -- tumbling window size in seconds

));

Output to Amazon Kinesis Firehose

MENTION_COUNT_STREAM

•TEAMNAME

•MENTIONCOUNT

Amazon

Elasticsearch

Service

KibanaAmazon

Kinesis Firehose

{

"aggregationtime": "2016-11-06T14:42:03.335",

"teamname": "New York Rangers",

"mentioncount": 604

}

5 records, every hour

Visualize Results with Kibana


Best Practices

Managing Applications

Set up Cloudwatch Alarms

• MillisBehindLatest metric tracks how far

behind the application is from the source

• Alarm on MillisBehindLatest metric.

Consider triggering when 1-hour behind, on a

1-minute average. Adjust accordingly for

applications with lower end-to-end processing

needs.


Increase input parallelism to improve

performance

• By default, a single source in-application

stream is created

• If application is not keeping up with input

stream, consider increasing input parallelism to

create multiple source in-application streams


Limit number of applications reading from

same source

• Avoid ReadProvisionedThroughputExceeded

exceptions

• For an Amazon Kinesis Streams source, limit

to 2 total applications

• For an Amazon Kinesis Firehose source, limit

to 1 application

Defining Input Schema

• Review and adequately test inferred input

schema

• Manually update schema to handle nested

JSON with greater than 2 levels of depth

• Use SQL functions in your application for

unstructured data

Authoring Application Code

• Avoid time-based windows greater than one

hour

• Keep window sizes small during development

• Use smaller SQL queries, with multiple in-

application streams, rather than a single, large

query

Limits

• Maximum row size in an in-application stream

is 50 KB.

• Maximum input parallelism is 10 in-application

streams.

• Each application supports one streaming

source, and one reference data source. The

reference data source can be no larger than 1

GB in size.

Pricing

• Pay only for what you use.

• Charged an hourly rate, based on the average

number of Amazon Kinesis Processing Units

(KPU) used to run your application.

• A single KPU provides one vCPU, and 4 GB of

memory.

• $0.11 per KPU-hour (US East).

Thank you!

Remember to complete

your evaluations!

aws re:invent 2016: analyzing streaming data in real-time with amazon kinesis analytics (bdm304)

Technology