realtime distributed analysis of datastreams
DESCRIPTION
Ein Vortrag von Philipp Nolte aus dem Hauptseminar "Personalisierung mit großen Daten".TRANSCRIPT
RealtimeDistributed Analysis
of Datastreams
Philipp Nolte – University of Passau – January 2014
1
Learn
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
2
Limits
Imagine a traditional web analytics software:
Every page view incrementsthe url’s database row.
3
First Aid
Queue your writes and write in batches.
Shard your data: Partition horizontally.
4
Chronic Issues
Fault-tolerance is hard.
Applications become more and more complex.
You have to do all the work.
5
New Tools
Large scale computation systems such as Hadoop.
Scalable databases such as Casandra and Riak.
Easy to use frameworks such as Storm and Dempsy.
6
Lambda Architecture
Speed Layer
Serving Layer
Batch Layer
Theoretical, abstract architecture for working with big data.
7
Goal
Compute arbitrary functions on arbitrary data.
query = function ( all data )
8
Properties
Robust and fault-tolerant.
Low latency reads and updates.
Scalable.
Minimal maintenance.
9
Batch Layer
Stores the immutable master dataset.
Precomputes arbitrary batch views.
Home of batch processing and mapreduce systems such as Hadoop.
Speed Layer
Serving Layer
Batch Layer
10
Serving Layer
Read-only random-access to batch views.
Updated by batch layer.
Indexes batch views.
Home of real-time query systemssuch as Cloudera Impala for Hadoop.
Speed Layer
Serving Layer
Batch Layer
11
Speed Layer
Compensates for high-latency batch views.
Fast, incremental algorithms.
More complex because of random-writes.
Home of Apache HBase or Storm.
Speed Layer
Serving Layer
Batch Layer
12
Lambda Architecture
Data
Speed Layer
Serving Layer
Batch Layer
QueryBatch Views
Realtime Views
13
Available Data
Batch View Realtime View
Batch View Realtime View
Discard Realtime Viewas soon as it is represented
in the batch view.Time
14
Twitter’s Early DaysWorker
Worker
Worker
Worker
Queue
Queue
Hadoop Cassandra
Tweets
Map
URLs
Queue
Queue
Queue
Queue
Worker
Worker
Worker
Worker
15
StormGuaranteed message processing without
message brokers.
Horizontal scalability.
Fault-tolerance.
High level of abstraction.
Just works.
16
Storm Topologies
Spout
Spout
⚡️Bolt
⚡️Bolt
⚡️Bolt
⚡️Bolt
Stream
17
Parallel Tasks
Spout
Spout
⚡️Bolt
⚡️Bolt
⚡️Bolt
⚡️Bolt
StreamT
Task
T T T T
TTTTTTT
18
Demo
Storm in action
19
Know
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
20
The End.
Questions?
21