aws webcast - amazon kinesis and apache storm

35
@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. CLICKSTREAM ANALYTICS – AMAZON KINESIS AND APACHE STORM

Upload: amazon-web-services

Post on 16-Jul-2015

2.658 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: AWS Webcast - Amazon Kinesis and Apache Storm

@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed

in whole or in part without the express consent of Amazon.com, Inc.

CLICKSTREAM ANALYTICS –

AMAZON KINESIS AND

APACHE STORM

Page 2: AWS Webcast - Amazon Kinesis and Apache Storm

Agenda

Clickstream Analytics

Data Ingestion

Amazon Kinesis

Data Processing

Apache Storm

Amazon EMR

Q & A

Page 3: AWS Webcast - Amazon Kinesis and Apache Storm

Clickstream Analytics in Real-time

Page 4: AWS Webcast - Amazon Kinesis and Apache Storm

Clickstream Analytics

From Wikipedia

“… clicks anywhere in the webpage or application, the

action is logged on …”

“… useful for web activity analysis, software testing,

market research …”

It’s all about People & Products !!!

Page 5: AWS Webcast - Amazon Kinesis and Apache Storm

Clickstream Analytics in Real-time

Ingestion

Files to Events

Processing

Batch to Continuous

Consumption

Reports to Alerts

Page 6: AWS Webcast - Amazon Kinesis and Apache Storm

Real-Time Analytics

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Elastic

• Enable multiple apps to process in parallel

Continuous data flow

Low end-to-end latency

Continuous, real-time workloads

+

Page 7: AWS Webcast - Amazon Kinesis and Apache Storm

Data Ingestion

Page 8: AWS Webcast - Amazon Kinesis and Apache Storm

Global top-10

foo-analysis.com

Starting simple...

Page 9: AWS Webcast - Amazon Kinesis and Apache Storm

Global top-10Elastic Beanstalk

foo-analysis.com

Distributing the workload…

Page 10: AWS Webcast - Amazon Kinesis and Apache Storm

Global top-10

Elastic Beanstalk

foo-analysis.com

Local top-10

Local top-10

Local top-10

Or using a Elastic Data Broker…

Page 11: AWS Webcast - Amazon Kinesis and Apache Storm

Global top-10

Elastic Beanstalk

foo-analysis.com

K

I

N

E

S

I

S

Data

Record

StreamShard

Partition Key

Worker

My top-10

Data RecordSequence Number

14 17 18 21 23

Amazon Kinesis – Managed Stream

Page 12: AWS Webcast - Amazon Kinesis and Apache Storm

AW

S E

nd

po

int

S3

DynamoDB

Redshift

Data

Sources

Availability

Zone

Availability

Zone

Data

Sources

Data

Sources

Data

Sources

Data

Sources

Availability

Zone

Shard 1

Shard 2

Shard N

[Data

Archive]

[Metric

Extraction]

[Sliding Window

Analysis]

[Machine

Learning]

App. 1

App. 2

App. 3

App. 4

EMR

Amazon Kinesis – Common Data Broker

Page 13: AWS Webcast - Amazon Kinesis and Apache Storm

Amazon Kinesis – Distributed Streams

From batch to continuous processing

Scale UP or DOWN without losing sequencing

Workers can replay records for up to 24 hours

Scale up to GB/sec without losing durability

Records stored across multiple availability zones

Multiple parallel Kinesis Apps

RDBMS, S3, Data Warehouse

Page 14: AWS Webcast - Amazon Kinesis and Apache Storm

Data Processing

Page 15: AWS Webcast - Amazon Kinesis and Apache Storm

Batch

Real

Time

Clickstream – Real-time and Batch

Batch

Analysis

DW

Hadoop

Notifications

& Alerts

Dashboards/

visualizations

APIsStreaming

AnalyticsClickstream

Deep Learning

Dashboards/

visualizations

Spark

Storm

KCL

Data

Archive

Page 16: AWS Webcast - Amazon Kinesis and Apache Storm

Processing Stream in real-time

Page 17: AWS Webcast - Amazon Kinesis and Apache Storm

Storm Concepts

Streams

Unbounded sequence of tuples

Spout

Source of Stream e.g. Read from Twitter streaming API

Bolts

Processes input streams and produces new streams e.g. Functions, Filters, Aggregation, Joins

Topologies

Network of spouts and bolts

Page 18: AWS Webcast - Amazon Kinesis and Apache Storm

Storm Architecture

Master

Node

Cluster

CoordinationWorker

Processes

Worker

Nimbus

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor Worker

Worker

Worker

Launches

Workers

Page 19: AWS Webcast - Amazon Kinesis and Apache Storm

Apache Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

Integration with queuing system

Higher level abstractions

Page 20: AWS Webcast - Amazon Kinesis and Apache Storm

Demo: Real time stream processing

Page 21: AWS Webcast - Amazon Kinesis and Apache Storm

Real-time: Event-based processing

KinesisStormSpout

ProducerAmazonKinesis

Apache Storm

ElastiCache(Redis) Node.js Client

(D3)

http://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-a-Real-time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache

Page 22: AWS Webcast - Amazon Kinesis and Apache Storm

Creating a Storm Topology

KinesisSpoutConfig(streamName, zookeeperEndpoint). withZookeeperPrefix(zookeeperPrefix)

.withInitialPositionInStream(initialPositionInStream)

.withRegion(Regions.fromName(regionName));…

builder.setSpout("Kinesis", spout, 2);builder.setBolt("Parse", new ParseReferrerBolt(),6).shuffleGrouping("Kinesis");builder.setBolt("Count", new RollingCountBolt(5, 2,elasticCacheRedisEndpoint),

6).fieldsGrouping("Parse", new Fields("referrer"));..StormSubmitter.submitTopology(topologyName, topoConf, builder.createTopology());

KinesisStormSpout

Page 23: AWS Webcast - Amazon Kinesis and Apache Storm

Sliding window using Tick Tuple

…public void execute(Tuple tuple){

if (TupleHelpers.isTickTuple(tuple)){

LOG.debug("Received tick tuple, triggering emit of current window counts");emitCurrentWindowCounts();

}else {

countObjAndAck(tuple);}

}

Page 24: AWS Webcast - Amazon Kinesis and Apache Storm

Using Redis as an Event relay

for (Entry<Object, Long> entry : counts.entrySet()){…msg.put("name", referrer);msg.put("time", currentEPOCH);msg.put("count", count);…jedis.publish("pubsubCounters",msg.toString());

}

ElastiCache(Redis)

Page 25: AWS Webcast - Amazon Kinesis and Apache Storm

NodeJs – PubSub to Server Side Events

function ticker(req,res) {… subscriber.subscribe("pubsubCounters");subscriber.on("message", function(channel, message) {

res.json(message);…res.json = function(obj) { res.write("data: "+obj+"\n\n"); }}

connect(){

... if(req.url == '/eventCounters') { ticker(req,res); }

Node.js

Page 26: AWS Webcast - Amazon Kinesis and Apache Storm

Visualizing the events in Client

var source = new EventSource('/ticker');source.addEventListener('message',tick);

function tick(e) {if(e){var eventData = JSON.parse(e.data);window[eventData.name].push([{ time: eventData.time,

y:eventData.count}]);

Client(D3)

Page 27: AWS Webcast - Amazon Kinesis and Apache Storm

Amazon EMR

Processing Streams with Hadoop

Page 28: AWS Webcast - Amazon Kinesis and Apache Storm

Amazon EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

Introduction to Amazon EMR

Page 29: AWS Webcast - Amazon Kinesis and Apache Storm

Master instance group

Task instance groupCore instance group

HDFS HDFS

Amazon S3

Amazon EMR - Architecture

Master instance

Controls the cluster

Core instance

Life of cluster

DataNode and TaskTracker daemons

Task instances

Added or subtracted to perform work (SPOT)

S3 as underlying ‘file system’

Page 30: AWS Webcast - Amazon Kinesis and Apache Storm

Offline Analysis

Ad-hocAnalysis

Analyzing Kinesis using Amazon EMR

EMRS3Kinesis ApplicationProducer Amazon Kinesis

EMR

HivePig

SparkMapReduceAmazon Kinesis

Page 31: AWS Webcast - Amazon Kinesis and Apache Storm

Demo: Stream processing with Spark

Page 32: AWS Webcast - Amazon Kinesis and Apache Storm

Spark Streaming and Kinesis

Launch a EMR cluster with Spark

http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster

Spark Streaming

http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html

Spark Streaming Kinesis integration

http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

Page 33: AWS Webcast - Amazon Kinesis and Apache Storm

Kinesis Word Count Example

private object KinesisWordCountASL extends Logging {…val sparkConfig = new SparkConf().setAppName("KinesisWordCount")

val ssc = new StreamingContext(sparkConfig, batchInterval)

val unionStreams = ssc.union(kinesisStreams)

/* Convert each line of Array[Byte] to String, split into words, and count them */val words = unionStreams.flatMap(byteArray => new String(byteArray).split(" "))

/* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

Page 34: AWS Webcast - Amazon Kinesis and Apache Storm

Amazon Kinesis with Apache Storm:

http://d0.awsstatic.com/whitepapers/building-sliding-window-analysis-of-clickstream-data-kinesis.pdf

Amazon Kinesis with Amazon EMR

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html

Amazon Kinesis with Apache Spark

http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

Q & A

Page 35: AWS Webcast - Amazon Kinesis and Apache Storm

THANK YOU !!!

http://aws.amazon.com/big-data