apache storm: hands-on session
TRANSCRIPT
Apache Storm: Hands-on SessionA.A. 2020/21
Fabiana Rossi
Laurea Magistrale in
Ingegneria Informatica - II anno
Macroarea di IngegneriaDipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Fabiana Rossi - SABD 2020/21 2
Resource Management
Data Storage
Data Processing
High-level Interfaces Su
pp
ort / In
teg
ratio
n
Apache Storm
• Apache Storm
• Open-source, real-time, scalable streaming system
• Provides an abstraction layer to execute DSP applications
• Initially developed by Twitter
• Topology
• DAG of spouts (sources of streams) and bolts (operators and
data sinks
• stream: sequence of key-value pairs
3
boltspout
Fabiana Rossi - SABD 2020/21
Stream grouping in Storm
• Data parallelism in Storm: how are streams
partitioned among multiple tasks (threads of
execution)?
• Shuffle grouping
• Randomly partitions the tuples
• Field grouping
• Hashes on a subset of the tuple attributes
4Fabiana Rossi - SABD 2020/21
Stream grouping in Storm
• All grouping (i.e., broadcast)
• Replicates the entire stream to all the consumer
tasks
• Global grouping
• Sends the entire stream to a single bolt
• Direct grouping
• Sends tuples to the consumer bolts in the same
executor
5Fabiana Rossi - SABD 2020/21
Storm architecture
6
• Master-worker architecture
Fabiana Rossi - SABD 2020/21
Storm components: Nimbus and Zookeeper
• Nimbus
– The master node
– Clients submit topologies to it
– Responsible for distributing and coordinating the
topology execution
• Zookeeper
– Nimbus uses a combination of the local disk(s)
and Zookeeper to store state about the topology
7Fabiana Rossi - SABD 2020/21
Storm components: worker
• Task: operator instance
– The actual work for a bolt or a spout is done in the
task
• Executor: smallest schedulable entity
– Execute one or more tasks related to same operator
• Worker process: Java process running one or
more executors
• Worker node: computing
resource, a container for
one or more worker processes
8Fabiana Rossi - SABD 2020/21
Storm components: supervisor
• Each worker node runs a supervisor
The supervisor:
• receives assignments from Nimbus (through
ZooKeeper) and spawns workers based on
the assignment
• sends to Nimbus (through ZooKeeper) a
periodic heartbeat;
• advertises the topologies that they are
currently running, and any vacancies that are
available to run more topologies
9Fabiana Rossi - SABD 2020/21
Example of a running topology
Fabiana Rossi - SABD 2020/2110
What makes a running topology
Fabiana Rossi - SABD 2020/2111
Configuring the parallelism of a topology
Number of worker processes
• How many worker processes to create for the topology across machines
in the cluster.
• Configuration option: TOPOLOGY_WORKERS
Number of executors (threads)
• How many executors to spawn per component.
• Configuration option: None (pass parallelism_hint parameter to setSpout
or setBolt)
Number of tasks
• How many tasks to create per component.
• Configuration option: TOPOLOGY_TASKS
Fabiana Rossi - SABD 2020/2112
Example of a running topology
Fabiana Rossi - SABD 2020/2113
Running a Topology in Storm
Storm allows two running mode: local, cluster
• Local mode: the topology is execute on a single node
• the local mode is usually used for testing purpose
• we can check whether our application runs as expected
• Cluster mode: the topology is distributed by Storm on
multiple workers
• The cluster mode should be used to run our application on
the real dataset
• Better exploits parallelism
• The application code is transparently distributed
• The topology is managed and monitored at run-time
14Fabiana Rossi - SABD 2020/21
Running a Topology in Storm
To run a topology in local mode, we just need to create
an in-process cluster
• it is a simplification of a cluster
• lightweight Storm functions wrap our code
• It can be instantiated using the LocalCluster class.
For example:
15
…conf.setMaxTaskParallelism(3);LocalCluster cluster = new LocalCluster();cluster.submitTopology("myTopology", conf, topology);Utils.sleep(10000); // wait [param] mscluster.killTopology("myTopology");cluster.shutdown();...
conf.setMaxTaskParallelism(...)
• This sets the number of worker processes to use to execute the topology.
Fabiana Rossi - SABD 2020/21
Running a Topology in Storm
To run a topology in cluster mode, we need to perform
the following steps:
1. Configure the application for the submission, using the
StormSubmitter class. For example:
16
...Config conf = new Config();conf.setNumWorkers(NUM_WORKERS);StormSubmitter.submitTopology("mytopology", conf, topology);...
NUM_WORKERS
• number of worker processes to be used for running the topology
Fabiana Rossi - SABD 2020/21
Running a Topology in Storm
2. Create a jar containing your code and all the dependencies of
your code• do not include the Storm library
• this can be easily done using Maven: use the Maven Assembly Plugin and
configure your pom.xml:
17
<plugin><artifactId>maven-assembly-plugin</artifactId><configuration>
<descriptorRefs><descriptorRef>jar-with-
dependencies</descriptorRef></descriptorRefs><archive>
<manifest>
<mainClass>com.path.to.main.Class</mainClass></manifest>
</archive></configuration>
</plugin>
Running a Topology in Storm
3. Submit the topology to the cluster using the storm client, as
follows
18
$ $STORM_HOME/bin/storm jar path/to/allmycode.jar full.classname.Topology arg1 arg2 arg3
Fabiana Rossi - SABD 2020/21
Running a Topology in Storm
19
application code control messages
Fabiana Rossi - SABD 2020/21
A container-based Storm cluster
Running a Topology in Storm
We are going to create a (local) Storm cluster using Docker
We need to run several containers, each of which will
manage a service of our system:
• Zookeeper
• Nimbus
• Worker1, Worker2, Worker3
• Storm Client (storm-cli): we use storm-cli to run topologies or
scripts that feed our DSP application
Auxiliary services: they that will be useful to interact with
our Storm topologies
• Redis
• RabbitMQ: a message queue service
21Fabiana Rossi - SABD 2020/21
Docker Compose
To easily coordinate the execution of these multiple services,
we use Docker Compose
• Read more at https://docs.docker.com/compose/
Docker Compose:
• is not bundled within the installation of Docker
• it can be installed following the official Docker documentation
• https://docs.docker.com/compose/install/
• Allows to easily express the container to be instantiated at once,
and the relations among them
• By itself, docker compose runs the composition on a single
machine; however, in combination with Docker Swarm,
containers can be deployed on multiple nodes
22Fabiana Rossi - SABD 2020/21
Docker Compose
• We specify how to compose containers in a easy-to-read file, by
default named docker-compose.yml
• To start the docker composition (in background with -d):
• To stop the docker composition:
• By default, docker-compose looks for the docker-compose.yml file in the current working directory; we can
change the file with the configuration using the -f flag
23
$ docker-compose up -d
$ docker-compose down
Fabiana Rossi - SABD 2020/21
Docker Compose
• There are different versions of the docker compose file format
• We will use the version 3, supported from Docker Compose 1.13
24
On the docker compose file format: https://docs.docker.com/compose/compose-file/
Fabiana Rossi - SABD 2020/21
Storm UI
Storm UI
In addition to bolts defined in your topology, Storm uses its own bolts
to perform background work when a topology component
acknowledges that it either succeeded or failed to process a tuple.
By default, Storm sets the number of acker executors to be equal to
the number of workers configured for this topology.
• Storm Examples
Example: Exclamation
• Problem: Suppose to have a random source
of words. Create a DSP application that adds
two exclamation points to each word.
28Fabiana Rossi - SABD 2020/21
Example: Exclamation
• Problem: Suppose to have a random source
of words. Create a DSP application that adds
two exclamation points to each word.
• Solution (1):
29Fabiana Rossi - SABD 2020/21
A simple topology: ExclamationTopology
30
...TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word", new RandomNamesSpout(), 1); builder.setBolt("exclaim1", new ExclamationBolt(), 1)
.shuffleGrouping("word");builder.setBolt("exclaim2", new ExclamationBolt(), 1)
.shuffleGrouping("exclaim1");
Config conf = new Config();conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar("ExclamationTopology", conf,builder.createTopology()
);...
Fabiana Rossi - SABD 2020/21
Example: Exclamation
• Problem: Suppose to have a random source of
words. Create a DSP application that adds two
exclamation points to each word.
• Solution (2):
31Fabiana Rossi - SABD 2020/21
Example: WordCount
• Problem: Suppose to have a random source
of sentences. Create a DSP application
that counts the number of occurrences of
each word.
32Fabiana Rossi - SABD 2020/21
Example: WordCount
• Problem: Suppose to have a random source
of sentences. Create a DSP application
that counts the number of occurrences of
each word.
• Solution:
33Fabiana Rossi - SABD 2020/21
WordCount
34
...TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout");
builder.setBolt("count", new WordCountBolt(), 12) .fieldsGrouping("split", new Fields("word"));
Config conf = new Config();...StormSubmitter.submitTopologyWithProgressBar(
"WordCount", conf, builder.createTopology()
);...
Fabiana Rossi - SABD 2020/21
Example: Rolling Count
• Problem: Suppose to have a random source
of words. Create a DSP application that
determines the top-N rank of words within a
sliding window of X secs and sliding interval
of Y secs.
35Fabiana Rossi - SABD 2020/21
Example: Rolling Count
• Problem: Suppose to have a random source of
words. Create a DSP application that determines the
top-N rank of words within a sliding window of X
secs and sliding interval of Y secs.
• Solution:
36Fabiana Rossi - SABD 2020/21
Rolling Count
37
...TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(spoutId, new RandomNamesSpout(), 5);
builder.setBolt(counterId, new RollingCountBolt(), 4) .fieldsGrouping(spoutId, new Fields("word"));
builder.setBolt(intermediateRankerId, new IntermediateRankingBolt(TOP_N), 4)
.fieldsGrouping(counterId, new Fields("obj"));
builder.setBolt(totalRankerId, new TotalRankingsBolt(TOP_N), 1) .globalGrouping(intermediateRankerId);
StormSubmitter.submitTopologyWithProgressBar(...);...
Fabiana Rossi - SABD 2020/21
Word Count on a Window (1)
• Storm 1.0 has explicitly introduced the concept of Window.
• We revise a simplified version of the previousWord Count application relying on the window primitives by Storm.
• The idea is to compute the word count in a sliding window.
31Fabiana Rossi - SABD 2020/21
Word Count on a Window (2)
32
• We create a data stream processing application
which comprises the following operators:
• a datasource, which emits sentences
• a splitter
• word count operator with a sliding window;
• the length of the sliding window is 9 secs
and it slides every 3 secs;
• To better visualize the results, we include an
auxiliary operator that exports results on a
message queue, implemented with rabbitMQ.
Fabiana Rossi - SABD 2020/21
Word Count on a Window (3)
33
...TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8)
.shuffleGrouping("spout");
builder.setBolt("count", new WordCountWindowBasedBolt()
.withWindow(BaseWindowedBolt.Duration.seconds(9), //
lengthBaseWindowedBolt.Duration.seconds(3) //
sliding)
, 12).fieldsGrouping("split", new Fields("word"));
StormSubmitter.submitTopologyWithProgressBar(...);...
Word Count on a Window (4)
34
public class WordCountWindowBasedBoltextends
BaseWindowedBolt {... public void execute(TupleWindow tuples) {
List<Tuple> incoming = tuples.getNew();for (Tuple tuple : incoming){ ... }
List<Tuple> expired = tuples.getExpired();for (Tuple tuple : expired){ ... }
}...
}
Implementation of the windowed operator
Fabiana Rossi - SABD 2020/21
DEBS Grand Challenge 2015 (1)
35
• Analysis of taxi trips based on data streams originating
from New York City taxis
• Input data streams: include starting point, drop-off point,
timestamps, and information related to the payment
• Query 1: identify the top 10 most frequent routes during
the last 30 minutes (sliding window)
• Use geo-spatial grids to define the events of interest
Fabiana Rossi - SABD 2020/21
DEBG Grand Challenge 2015 (2)
36
TopologyBuilder builder = new TopologyBuilder();builder.setSpout("datasource",
new RedisSpout(redisUrl, redisPort));
builder.setBolt("parser", new ParseLine()).setNumTasks(numTasks).shuffleGrouping("datasource");
builder.setBolt("filterByCoordinates", new FilterByCoordinates()).setNumTasks(numTasks).shuffleGrouping("parser");
builder.setBolt("metronome", new Metronome()).setNumTasks(numTasksMetronome).shuffleGrouping("filterByCoordinates");
builder.setBolt("computeCellID", new ComputeCellID()).setNumTasks(numTasks).shuffleGrouping("filterByCoordinates");
Fabiana Rossi - SABD 2020/21
DEBG Grand Challenge 2015 (3)
37
builder.setBolt("countByWindow", new CountByWindow()).setNumTasks(numTasks).fieldsGrouping("computeCellID",
new Fields(ComputeCellID.F_ROUTE))
.allGrouping("metronome", Metronome.S_METRONOME);
builder.setBolt("partialRank", new PartialRank(10)).setNumTasks(numTasks).fieldsGrouping("countByWindow",
new Fields(ComputeCellID.F_ROUTE));
builder.setBolt("globalRank", new GlobalRank(...), 1).setNumTasks(numTasksGlobalRank).shuffleGrouping("partialRank");
StormTopology stormTopology = builder.createTopology();
Fabiana Rossi - SABD 2020/21