aws webcast - amazon kinesis and apache storm

@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed

in whole or in part without the express consent of Amazon.com, Inc.

CLICKSTREAM ANALYTICS –

AMAZON KINESIS AND

APACHE STORM

Agenda

Clickstream Analytics

Data Ingestion

Amazon Kinesis

Data Processing

Apache Storm

Amazon EMR

Q & A

Clickstream Analytics in Real-time

Clickstream Analytics

From Wikipedia

“… clicks anywhere in the webpage or application, the

action is logged on …”

“… useful for web activity analysis, software testing,

market research …”

It’s all about People & Products !!!

Clickstream Analytics in Real-time

Ingestion

Files to Events

Processing

Batch to Continuous

Consumption

Reports to Alerts

Real-Time Analytics

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Elastic

• Enable multiple apps to process in parallel

Continuous data flow

Low end-to-end latency

Continuous, real-time workloads

+

Data Ingestion

Global top-10

foo-analysis.com

Starting simple...

Global top-10Elastic Beanstalk

foo-analysis.com

Distributing the workload…

Global top-10

Elastic Beanstalk

foo-analysis.com

Local top-10

Local top-10

Local top-10

Or using a Elastic Data Broker…

Global top-10

Elastic Beanstalk

foo-analysis.com

K

I

N

E

S

I

S

Data

Record

StreamShard

Partition Key

Worker

My top-10

Data RecordSequence Number

14 17 18 21 23

Amazon Kinesis – Managed Stream

AW

S E

nd

po

int

S3

DynamoDB

Redshift

Data

Sources

Availability

Zone

Availability

Zone

Data

Sources

Data

Sources

Data

Sources

Data

Sources

Availability

Zone

Shard 1

Shard 2

Shard N

[Data

Archive]

[Metric

Extraction]

[Sliding Window

Analysis]

[Machine

Learning]

App. 1

App. 2

App. 3

App. 4

EMR

Amazon Kinesis – Common Data Broker

Amazon Kinesis – Distributed Streams

From batch to continuous processing

Scale UP or DOWN without losing sequencing

Workers can replay records for up to 24 hours

Scale up to GB/sec without losing durability

Records stored across multiple availability zones

Multiple parallel Kinesis Apps

RDBMS, S3, Data Warehouse

Data Processing

Batch

Real

Time

Clickstream – Real-time and Batch

Batch

Analysis

DW

Hadoop

Notifications

& Alerts

Dashboards/

visualizations

APIsStreaming

AnalyticsClickstream

Deep Learning

Dashboards/

visualizations

Spark

Storm

KCL

Data

Archive

Processing Stream in real-time

Storm Concepts

Streams

Unbounded sequence of tuples

Spout

Source of Stream e.g. Read from Twitter streaming API

Bolts

Processes input streams and produces new streams e.g. Functions, Filters, Aggregation, Joins

Topologies

Network of spouts and bolts

Storm Architecture

Master

Node

Cluster

CoordinationWorker

Processes

Worker

Nimbus

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor Worker

Worker

Worker

Launches

Workers

Apache Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

Integration with queuing system

Higher level abstractions

Demo: Real time stream processing

Real-time: Event-based processing

KinesisStormSpout

ProducerAmazonKinesis

Apache Storm

ElastiCache(Redis) Node.js Client

(D3)

http://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-a-Real-time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache

http://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-a-Real-time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache

Creating a Storm Topology

KinesisSpoutConfig(streamName, zookeeperEndpoint). withZookeeperPrefix(zookeeperPrefix)

.withInitialPositionInStream(initialPositionInStream)

.withRegion(Regions.fromName(regionName));…

builder.setSpout("Kinesis", spout, 2);builder.setBolt("Parse", new ParseReferrerBolt(),6).shuffleGrouping("Kinesis");builder.setBolt("Count", new RollingCountBolt(5, 2,elasticCacheRedisEndpoint),

6).fieldsGrouping("Parse", new Fields("referrer"));..StormSubmitter.submitTopology(topologyName, topoConf, builder.createTopology());

KinesisStormSpout

Sliding window using Tick Tuple

…public void execute(Tuple tuple){

if (TupleHelpers.isTickTuple(tuple)){

LOG.debug("Received tick tuple, triggering emit of current window counts");emitCurrentWindowCounts();

}else {

countObjAndAck(tuple);}

}

Using Redis as an Event relay

for (Entry<Object, Long> entry : counts.entrySet()){…msg.put("name", referrer);msg.put("time", currentEPOCH);msg.put("count", count);…jedis.publish("pubsubCounters",msg.toString());

}

ElastiCache(Redis)

NodeJs – PubSub to Server Side Events

function ticker(req,res) {… subscriber.subscribe("pubsubCounters");subscriber.on("message", function(channel, message) {

res.json(message);…res.json = function(obj) { res.write("data: "+obj+"\n\n"); }}

connect(){

... if(req.url == '/eventCounters') { ticker(req,res); }

Node.js

Visualizing the events in Client

var source = new EventSource('/ticker');source.addEventListener('message',tick);

function tick(e) {if(e){var eventData = JSON.parse(e.data);window[eventData.name].push([{ time: eventData.time,

y:eventData.count}]);

Client(D3)

Amazon EMR

Processing Streams with Hadoop

Amazon EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Introduction to Amazon EMR

Master instance group

Task instance groupCore instance group

HDFS HDFS

Amazon S3

Amazon EMR - Architecture

Master instance

Controls the cluster

Core instance

Life of cluster

DataNode and TaskTracker daemons

Task instances

Added or subtracted to perform work (SPOT)

S3 as underlying ‘file system’

Offline Analysis

Ad-hocAnalysis

Analyzing Kinesis using Amazon EMR

EMRS3Kinesis ApplicationProducer Amazon Kinesis

EMR

HivePig

SparkMapReduceAmazon Kinesis

Demo: Stream processing with Spark

Spark Streaming and Kinesis

Launch a EMR cluster with Spark

http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

Spark Streaming

http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html

Spark Streaming Kinesis integration

http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html


Kinesis Word Count Example

private object KinesisWordCountASL extends Logging {…val sparkConfig = new SparkConf().setAppName("KinesisWordCount")

val ssc = new StreamingContext(sparkConfig, batchInterval)

val unionStreams = ssc.union(kinesisStreams)

/* Convert each line of Array[Byte] to String, split into words, and count them */val words = unionStreams.flatMap(byteArray => new String(byteArray).split(" "))

/* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

Amazon Kinesis with Apache Storm:

http://d0.awsstatic.com/whitepapers/building-sliding-window-analysis-of-clickstream-data-kinesis.pdf

Amazon Kinesis with Amazon EMR

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html

Amazon Kinesis with Apache Spark


Q & A

http://d0.awsstatic.com/whitepapers/building-sliding-window-analysis-of-clickstream-data-kinesis.pdf

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html


THANK YOU !!!

http://aws.amazon.com/big-data

aws webcast - amazon kinesis and apache storm

Technology

elastic data brokerglobal

elastic beanstalkfooanalysis

data recordsequence

realtimeclickstream

checkpoint replay elastic

new streams

multiple apps

web activity analysis