short introduction to storm
DESCRIPTION
Presentation given in class for Cloud Computing at Universitat Politècnica de CatalunyaTRANSCRIPT
STORMDISTRIBUTED AND FAULT-TOLERANT
REALTIME COMPUTATION
Jimmy ZögerCLC < FIB < UPC
2013-06-03
INTRODUCTION
• Like Hadoop for realtime processing instead of batch
•Open Source
•Developed by BackType which was later acquired by Twitter
•Developed for analyzing Twitter data
• Similar to S4
STORM TOPOLOGY
SPOUTS
SPOUTS
• The component responsible for feeding messages into the topology
• Emits tuples
• Can be reliable or unreliable (ack() and fail())
INTEGRATION
• Kestrel
• RabbitMQ
• Kafka
• JMS
• Integration is easy with the simple Spout abstraction
BOLTS
BOLTS
• A component that takes tuples as input and produces tuples as output
• Can do filtering, joining, functions, aggregations etc.
•Does not have to process a tuple immediately and may hold onto tuples to process later
• Comparison with Hadoop: A bolt can be a mapper or a reducer (or anything)
STORM TOPOLOGY
STORM TOPOLOGY
• Spouts, bolts and streams
•Distributed
• Runs indefinitely until it is stopped
• Arbitrary complexity
• Streams requiring multiple steps also requires multiple bolts
•No intermediate queues for streams
FAULT-TOLERANCE
•Nimbus daemon and Supervisor daemons are fail-fast and stateless
• Each worker sends heartbeats to Nimbus
• Transactional topologies → Guaranteed processing
NimbusZookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Zookeeper
USE CASES
• Counting words!
• Realtime analytics - trending topics on Twitter
•Online machine learning
• Continuous computation
•Distributed RPC
• Extract, Transform and Load (ETL)
FAST
One benchmark clocked it over a million tuples processed
per second per node
{x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠ {x,y,z} ↠
STORMDISTRIBUTED AND FAULT-TOLERANT
REALTIME COMPUTATION
Jimmy ZögerCLC < FIB < UPC
2013-06-03