Download - Apache Storm Tutorial
Team
2
Riccardo Di Stefano
Roberto Gaudenzi
Davide Mazza
Lorenzo Rutigliano
Sara Veterini
Federico Croce
Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
https://it.linkedin.com/in/lorenzo-rutigliano-00a007135/it
https://it.linkedin.com/in/sara-veterini-667684116
https://it.linkedin.com/in/roberto-gaudenzi-4b0422116
https://it.linkedin.com/in/federico-croce-921a19134/it
https://it.linkedin.com/in/riccardo-di-stefano-439a11134
https://it.linkedin.com/in/davide-mazza-33a9b291
Contacts and Links
3
https://github.com/davidemazza/ApacheStorm
http://www.slideshare.net/DavideMazza6/apache-storm-tutorial
3Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
IntroductionApache Storm is a free and open source distributed fault-tolerant realtime computation system that make easy to process unbounded streams of data.
> use-cases: financial applications, network monitoring, social network analysis, online machine learning, ecc..
> different from traditional batch systems (store and process) .
4Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Stream
Unbounded Sequence of Tuples
Tuple: Core unit of data, is a named list of values
6Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
TopologiesAn application is defined in Storm through a Topology that describes its logic as a DAG of operators and streams.
Spouts: are the sources of data streams. Usually read data from external sources (e.g. Twitter API) or from disk and emit them in the topology.
Bolts: processes input streams and (eventually) produce output streams. They represent the application logic.
7Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
ArchitectureTwo kinds of nodes in a Storm cluster:
➢ The Master node runs a daemon called “Nimbus” to which topologies are submitted. It is responsible for scheduling, job orchestration, and monitoring for failures.
➢ Each Worker (Slave) node runs a daemon called “Supervisor”, that can run one or more worker process in which applications are executed.
The coordination between this two entities is done through Zookeeper. It is mainly used to maintain state, because Nimbus and Supervisors are stateless.
8Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
ArchitectureThree entities are involved in running a topology:
➢ Worker Process: one or more per cluster, each one is related to only one topology (for design reasons related to fault-tolerance and isolation).
➢ Executor: thread in the Worker process. It runs one or more tasks for the same component (spout or bolt).
➢ Task: a component replica.
Therefore Workers provide inter-topology parallelism, Executors intra-topology and Tasks intra-component.
Worker process
Executor Task
TaskTask
Task
9Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
ExampleWe will show how to compute the average of the grades using a simple Storm topology.
We will use:
➢ one spout;➢ two bolts that work in parallel;➢ another bolt in which the previous two converge
11Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Spout
This represents the spout.
Its job is to read a stream of numbers.
Our stream represents the grades, so they are within 18 and 30
12Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Bolt
This represents the bolt.
We can distinguish three different bolts in our example:
1. SummationBolt: computes the sum of the numbers;2. CounterBolt: counts the numbers;3. AverageBolt: computes the average.
13Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Topology
20Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Stream Output
Trident
➢ A high-level abstraction on top of Storm
➢ Uses Spout and Bolt auto-generated by Trident before execution
➢ Trident has functions, filters, joins, grouping, and aggregation
➢ Process streams as a series of batches
22Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Topology
➢ Receives input stream from spout
➢ Do ordered sequence of operation (filter, aggregation, grouping, etc.,) on the stream
23Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Tuples & Spout
➢ TridentTuple is a named list of values.
➢ TridentTuple interface is the data model of a Trident topology
➢ TridentSpout is similar to Storm spout, with additional options
➢ TridentSpout has many sample implementation of trident spout
24Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations
➢ Filter
➢ Function
➢ Aggregation
➢ Grouping
➢ Merging and Joining
26Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: Filter
➢ Object used to perform the task of input validation.
➢ Gets a subset of trident tuple fields as input
➢ Returns either true or false
➢ True → tuple is kept in the output stream
➢ False → the tuple is removed from the stream
27Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: Function➢ Object used to perform a simple operation on a single trident tuple.
➢ Takes a subset of trident tuple fields
➢ Emits zero or more new trident tuple fields.
28Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: AggregationObject used to perform aggregation operations on an input batch or partition or stream.
➢ Aggregate → Aggregates each batch of trident tuple in isolation
➢ PartitionAggregate → Aggregates each partition instead of the entire batch of trident tuple.
➢ PersistentAggregate → Aggregates on all trident tuple across all batch.
29Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: Aggregation
30Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: Grouping
➢ Inbuilt operation and can be called by the groupBy method
➢ Repartitions the stream by doing a partitionBy on the specified fields
➢ Groups tuples together whose group fields are equal
31Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Operations: Merging and Joining
➢ Merging combines one or more streams
➢ Joining uses trident tuple field from both sides to check and join two streams.
32Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
State Maintenance
➢ State information can be stored in the topology itself
➢ if any tuple fails during processing, then the failed tuple is retried.
➢ If the tuple has failed before updating the state → retrying the tuple will make the state
stable.
➢ if the tuple has failed after updating the state → then retrying the same tuple will make
the state unstable
33Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
When to use Trident?
It will be difficult to achieve exactly once processing in the case of Storm
34Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Trident will be useful for those use-cases where you require exactly once processing.
Trident Demo: Twitter LanguagesWhich are the most used languages in Twitter?
The code is built on top of Trident and gets a stream of tweets using twitter4J library
For each tweet the language is extracted
A hashmap of counters is maintained and periodically published on a tweet by the code itself
36Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Trident example setupTo setup your twitter application:
● go to https://apps.twitter.com/ and create a new app● fill the form, leaving callback url empty● after creating the app, go to keys and access tokens● pick consumer secret and consumer key info● select create my access tokens if no tokens are present, then pick access
token and access token secret● open project TwitterTridentExample in Eclipse, open file twitter4j.properties
in the project, and copy your info
Now you are ready!
37Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
Homework
39Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017
https://github.com/davidemazza/ApacheStorm
Folder “Homework”