aws webcast - amazon kinesis and apache storm
TRANSCRIPT
@ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed
in whole or in part without the express consent of Amazon.com, Inc.
CLICKSTREAM ANALYTICS –
AMAZON KINESIS AND
APACHE STORM
Agenda
Clickstream Analytics
Data Ingestion
Amazon Kinesis
Data Processing
Apache Storm
Amazon EMR
Q & A
Clickstream Analytics in Real-time
Clickstream Analytics
From Wikipedia
“… clicks anywhere in the webpage or application, the
action is logged on …”
“… useful for web activity analysis, software testing,
market research …”
It’s all about People & Products !!!
Clickstream Analytics in Real-time
Ingestion
Files to Events
Processing
Batch to Continuous
Consumption
Reports to Alerts
Real-Time Analytics
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Elastic
• Enable multiple apps to process in parallel
Continuous data flow
Low end-to-end latency
Continuous, real-time workloads
+
Data Ingestion
Global top-10
foo-analysis.com
Starting simple...
Global top-10Elastic Beanstalk
foo-analysis.com
Distributing the workload…
Global top-10
Elastic Beanstalk
foo-analysis.com
Local top-10
Local top-10
Local top-10
Or using a Elastic Data Broker…
Global top-10
Elastic Beanstalk
foo-analysis.com
K
I
N
E
S
I
S
Data
Record
StreamShard
Partition Key
Worker
My top-10
Data RecordSequence Number
14 17 18 21 23
Amazon Kinesis – Managed Stream
AW
S E
nd
po
int
S3
DynamoDB
Redshift
Data
Sources
Availability
Zone
Availability
Zone
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Data
Archive]
[Metric
Extraction]
[Sliding Window
Analysis]
[Machine
Learning]
App. 1
App. 2
App. 3
App. 4
EMR
Amazon Kinesis – Common Data Broker
Amazon Kinesis – Distributed Streams
From batch to continuous processing
Scale UP or DOWN without losing sequencing
Workers can replay records for up to 24 hours
Scale up to GB/sec without losing durability
Records stored across multiple availability zones
Multiple parallel Kinesis Apps
RDBMS, S3, Data Warehouse
Data Processing
Batch
Real
Time
Clickstream – Real-time and Batch
Batch
Analysis
DW
Hadoop
Notifications
& Alerts
Dashboards/
visualizations
APIsStreaming
AnalyticsClickstream
Deep Learning
Dashboards/
visualizations
Spark
Storm
KCL
Data
Archive
Processing Stream in real-time
Storm Concepts
Streams
Unbounded sequence of tuples
Spout
Source of Stream e.g. Read from Twitter streaming API
Bolts
Processes input streams and produces new streams e.g. Functions, Filters, Aggregation, Joins
Topologies
Network of spouts and bolts
Storm Architecture
Master
Node
Cluster
CoordinationWorker
Processes
Worker
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor Worker
Worker
Worker
Launches
Workers
Apache Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
Integration with queuing system
Higher level abstractions
Demo: Real time stream processing
Real-time: Event-based processing
KinesisStormSpout
ProducerAmazonKinesis
Apache Storm
ElastiCache(Redis) Node.js Client
(D3)
http://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-a-Real-time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache
Creating a Storm Topology
KinesisSpoutConfig(streamName, zookeeperEndpoint). withZookeeperPrefix(zookeeperPrefix)
.withInitialPositionInStream(initialPositionInStream)
.withRegion(Regions.fromName(regionName));…
builder.setSpout("Kinesis", spout, 2);builder.setBolt("Parse", new ParseReferrerBolt(),6).shuffleGrouping("Kinesis");builder.setBolt("Count", new RollingCountBolt(5, 2,elasticCacheRedisEndpoint),
6).fieldsGrouping("Parse", new Fields("referrer"));..StormSubmitter.submitTopology(topologyName, topoConf, builder.createTopology());
KinesisStormSpout
Sliding window using Tick Tuple
…public void execute(Tuple tuple){
if (TupleHelpers.isTickTuple(tuple)){
LOG.debug("Received tick tuple, triggering emit of current window counts");emitCurrentWindowCounts();
}else {
countObjAndAck(tuple);}
}
Using Redis as an Event relay
for (Entry<Object, Long> entry : counts.entrySet()){…msg.put("name", referrer);msg.put("time", currentEPOCH);msg.put("count", count);…jedis.publish("pubsubCounters",msg.toString());
}
ElastiCache(Redis)
NodeJs – PubSub to Server Side Events
function ticker(req,res) {… subscriber.subscribe("pubsubCounters");subscriber.on("message", function(channel, message) {
res.json(message);…res.json = function(obj) { res.write("data: "+obj+"\n\n"); }}
connect(){
... if(req.url == '/eventCounters') { ticker(req,res); }
Node.js
Visualizing the events in Client
var source = new EventSource('/ticker');source.addEventListener('message',tick);
function tick(e) {if(e){var eventData = JSON.parse(e.data);window[eventData.name].push([{ time: eventData.time,
y:eventData.count}]);
Client(D3)
Amazon EMR
Processing Streams with Hadoop
Amazon EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
Introduction to Amazon EMR
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon EMR - Architecture
Master instance
Controls the cluster
Core instance
Life of cluster
DataNode and TaskTracker daemons
Task instances
Added or subtracted to perform work (SPOT)
S3 as underlying ‘file system’
Offline Analysis
Ad-hocAnalysis
Analyzing Kinesis using Amazon EMR
EMRS3Kinesis ApplicationProducer Amazon Kinesis
EMR
HivePig
SparkMapReduceAmazon Kinesis
Demo: Stream processing with Spark
Spark Streaming and Kinesis
Launch a EMR cluster with Spark
http://blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/Installing-Apache-Spark-on-an-Amazon-EMR-Cluster
Spark Streaming
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
Spark Streaming Kinesis integration
http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
Kinesis Word Count Example
private object KinesisWordCountASL extends Logging {…val sparkConfig = new SparkConf().setAppName("KinesisWordCount")
val ssc = new StreamingContext(sparkConfig, batchInterval)
val unionStreams = ssc.union(kinesisStreams)
/* Convert each line of Array[Byte] to String, split into words, and count them */val words = unionStreams.flatMap(byteArray => new String(byteArray).split(" "))
/* Map each word to a (word, 1) tuple so we can reduce/aggregate by key. */val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
Amazon Kinesis with Apache Storm:
http://d0.awsstatic.com/whitepapers/building-sliding-window-analysis-of-clickstream-data-kinesis.pdf
Amazon Kinesis with Amazon EMR
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-kinesis.html
Amazon Kinesis with Apache Spark
http://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
Q & A
THANK YOU !!!
http://aws.amazon.com/big-data