apache storm tutorial

40
INTRODUCTION TO APACHE STORM Sapienza University of Rome Data Mining Class A.Y. 2016-2017

Upload: davide-mazza

Post on 06-Jan-2017

97 views

Category:

Technology


0 download

TRANSCRIPT

INTRODUCTION TO APACHE STORM

Sapienza University of Rome Data Mining ClassA.Y. 2016-2017

Team

2

Riccardo Di Stefano

Roberto Gaudenzi

Davide Mazza

Lorenzo Rutigliano

Sara Veterini

Federico Croce

Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

https://it.linkedin.com/in/lorenzo-rutigliano-00a007135/it

https://it.linkedin.com/in/sara-veterini-667684116

https://it.linkedin.com/in/roberto-gaudenzi-4b0422116

https://it.linkedin.com/in/federico-croce-921a19134/it

https://it.linkedin.com/in/riccardo-di-stefano-439a11134

https://it.linkedin.com/in/davide-mazza-33a9b291

Contacts and Links

3

https://github.com/davidemazza/ApacheStorm

http://www.slideshare.net/DavideMazza6/apache-storm-tutorial

[email protected]

3Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

IntroductionApache Storm is a free and open source distributed fault-tolerant realtime computation system that make easy to process unbounded streams of data.

> use-cases: financial applications, network monitoring, social network analysis, online machine learning, ecc..

> different from traditional batch systems (store and process) .

4Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Companies

5Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Unbounded Sequence of Tuples

Tuple: Core unit of data, is a named list of values

6Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

TopologiesAn application is defined in Storm through a Topology that describes its logic as a DAG of operators and streams.

Spouts: are the sources of data streams. Usually read data from external sources (e.g. Twitter API) or from disk and emit them in the topology.

Bolts: processes input streams and (eventually) produce output streams. They represent the application logic.

7Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

ArchitectureTwo kinds of nodes in a Storm cluster:

➢ The Master node runs a daemon called “Nimbus” to which topologies are submitted. It is responsible for scheduling, job orchestration, and monitoring for failures.

➢ Each Worker (Slave) node runs a daemon called “Supervisor”, that can run one or more worker process in which applications are executed.

The coordination between this two entities is done through Zookeeper. It is mainly used to maintain state, because Nimbus and Supervisors are stateless.

8Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

ArchitectureThree entities are involved in running a topology:

➢ Worker Process: one or more per cluster, each one is related to only one topology (for design reasons related to fault-tolerance and isolation).

➢ Executor: thread in the Worker process. It runs one or more tasks for the same component (spout or bolt).

➢ Task: a component replica.

Therefore Workers provide inter-topology parallelism, Executors intra-topology and Tasks intra-component.

Worker process

Executor Task

TaskTask

Task

9Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Simple Example

10Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

ExampleWe will show how to compute the average of the grades using a simple Storm topology.

We will use:

➢ one spout;➢ two bolts that work in parallel;➢ another bolt in which the previous two converge

11Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Spout

This represents the spout.

Its job is to read a stream of numbers.

Our stream represents the grades, so they are within 18 and 30

12Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Bolt

This represents the bolt.

We can distinguish three different bolts in our example:

1. SummationBolt: computes the sum of the numbers;2. CounterBolt: counts the numbers;3. AverageBolt: computes the average.

13Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Topology

14Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

15Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

16Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

17Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

18Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

19Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream

Topology

20Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Stream Output

Trident

21Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Trident

➢ A high-level abstraction on top of Storm

➢ Uses Spout and Bolt auto-generated by Trident before execution

➢ Trident has functions, filters, joins, grouping, and aggregation

➢ Process streams as a series of batches

22Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Topology

➢ Receives input stream from spout

➢ Do ordered sequence of operation (filter, aggregation, grouping, etc.,) on the stream

23Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Tuples & Spout

➢ TridentTuple is a named list of values.

➢ TridentTuple interface is the data model of a Trident topology

➢ TridentSpout is similar to Storm spout, with additional options

➢ TridentSpout has many sample implementation of trident spout

24Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Example of Spout

25Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations

➢ Filter

➢ Function

➢ Aggregation

➢ Grouping

➢ Merging and Joining

26Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: Filter

➢ Object used to perform the task of input validation.

➢ Gets a subset of trident tuple fields as input

➢ Returns either true or false

➢ True → tuple is kept in the output stream

➢ False → the tuple is removed from the stream

27Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: Function➢ Object used to perform a simple operation on a single trident tuple.

➢ Takes a subset of trident tuple fields

➢ Emits zero or more new trident tuple fields.

28Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: AggregationObject used to perform aggregation operations on an input batch or partition or stream.

➢ Aggregate → Aggregates each batch of trident tuple in isolation

➢ PartitionAggregate → Aggregates each partition instead of the entire batch of trident tuple.

➢ PersistentAggregate → Aggregates on all trident tuple across all batch.

29Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: Aggregation

30Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: Grouping

➢ Inbuilt operation and can be called by the groupBy method

➢ Repartitions the stream by doing a partitionBy on the specified fields

➢ Groups tuples together whose group fields are equal

31Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Operations: Merging and Joining

➢ Merging combines one or more streams

➢ Joining uses trident tuple field from both sides to check and join two streams.

32Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

State Maintenance

➢ State information can be stored in the topology itself

➢ if any tuple fails during processing, then the failed tuple is retried.

➢ If the tuple has failed before updating the state → retrying the tuple will make the state

stable.

➢ if the tuple has failed after updating the state → then retrying the same tuple will make

the state unstable

33Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

When to use Trident?

It will be difficult to achieve exactly once processing in the case of Storm

34Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Trident will be useful for those use-cases where you require exactly once processing.

Trident Example

35Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Trident Demo: Twitter LanguagesWhich are the most used languages in Twitter?

The code is built on top of Trident and gets a stream of tweets using twitter4J library

For each tweet the language is extracted

A hashmap of counters is maintained and periodically published on a tweet by the code itself

36Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Trident example setupTo setup your twitter application:

● go to https://apps.twitter.com/ and create a new app● fill the form, leaving callback url empty● after creating the app, go to keys and access tokens● pick consumer secret and consumer key info● select create my access tokens if no tokens are present, then pick access

token and access token secret● open project TwitterTridentExample in Eclipse, open file twitter4j.properties

in the project, and copy your info

Now you are ready!

37Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Homework

38Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

Homework

39Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017

https://github.com/davidemazza/ApacheStorm

Folder “Homework”

Thanks! 40Apache Storm - Sapienza University of Rome - Data Mining Class - A.Y. 2016-2017