apache storm and twitter streaming api integration

14
Welcome

Upload: udayaprasad-v

Post on 30-Jun-2015

518 views

Category:

Software


0 download

DESCRIPTION

1) Storm is a distributed, real-time computation system. 2) The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a bolt, a bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each make some kind of transformation on the data exposed by the spout. 1) Real-time systems must guarantee the data processing. 2) And also it should be horizontally scalable, means, just adding few nodes to improve the scalability of a cluster. 3) It should be fault-tolerance, means, if any error occurs or any node goes down, our system should work without any hesitation. 4) We need to get rid of all the intermediate message brokers, because they are complex, and slow, because, instead of sending messages directly from producer to consumers, it has to go through third party message brokers, moreover, those third party message brokers are persist the input data into the disk. This whole process will consume extra time to process the data. 5) In comparison with Storm, Hadoop is ok, because Hadoop also provides a high latency system, so if you take a few hours of down time, you still have high latency, but in real time systems, if you take few hours of down time. Then you no longer in real time, which means robustness requirements, are much harder. Storm satisfies all those properties without any hesitation. 1) Both Hadoop and Storm are distributed and fault-Tolerance systems, but, Hadoop mainly used for batch processing systems, whereas Storm used for Real-time computation systems. 2) Storm doesn’t have inbuilt Storage system, it mainly builds on “come and get some” strategy. In other side, Hadoop have HDFS as storage file system. 1) Both Storm and Flume used for real-time data processing, but Flume will not give you real-time computation systems. moreover flume depends on channel Message broker component, for, guaranteed data processing, here, channel always persist the data before sending it to Consumer, but for Storm, there is no intermediate message brokers concept, it Just Works like as lite as possible. Whatever business logic that you want to write, will goes under Bolt component of Storm.

TRANSCRIPT

Page 1: Apache Storm and twitter Streaming API integration

Welcome

Page 2: Apache Storm and twitter Streaming API integration

Integration of Storm and Twitter Streaming API

Page 3: Apache Storm and twitter Streaming API integration

Agenda

• What is Storm?• Storm Benefits• How Storm differentiates from Hadoop• Storm vs. Flume• Storm Example using Twitter Streaming API• Quiz

Page 4: Apache Storm and twitter Streaming API integration

• Storm is a Fault tolerant, distributed, real-time computation system.

• It’s a Non persistent API.• On a Storm cluster, we basically execute topologies,

which process streams of tuples (data).• Each Topology is a graph consisting of Spouts(which

produce tuples) and bolts (which transform tuples).

What is Storm?

Page 5: Apache Storm and twitter Streaming API integration

• Once Storm Topology submitted, also, if all the computation logic written in bolts are correct, then it just works.

Storm Benefits

Page 6: Apache Storm and twitter Streaming API integration

Storm HadoopDistributed & fault tolerant Distributed & fault tolerant

Real-time Computation system

Batch Processing system

Non persistent Persistent, Uses HDFS for file storage

Storm Vs. HadoopStorm Vs. Hadoop

Page 7: Apache Storm and twitter Streaming API integration

Storm FlumeReal-time Streaming systems Real-time Streaming systems

Real-time Computation system Not an Real-time Computation system

It will not Use any Message brokers for real-time processing of data

It uses Channel, as a message broker between Source and Sink

Storm Vs. Flume

Page 8: Apache Storm and twitter Streaming API integration

Topology Scenario:- I have taken one spout(TwitterSampleSpout) and three

bolts(WordSplitterBolt, IgnoreWordsBolt, WordCounterBolt) in this project.

Here spout(TwitterSampleSpout) work is to download Tweets from Twitter and send it back to WordSplitterBolt.

The WordSplitterBolt work is to split the entire text into words by using space delimiter, and it will send those words to IgnoreWordsBolt.

The IgnoreWordsBolt work is to ignore determiners like(a, an, the.. etc), it just act like a filter, later it will send the final list of words to WordCounterBolt. There actual count will happen, in console it will show top counted list of words. Just works like a Twitter trends.

This process will continue forever and aggregate all the list of words and find its count.

Storm Example using Twitter Streaming API

Page 9: Apache Storm and twitter Streaming API integration

TwitterSampleSpout

Page 10: Apache Storm and twitter Streaming API integration

WordSplitterBolt

Page 11: Apache Storm and twitter Streaming API integration

IgnoreWordsBolt

Page 12: Apache Storm and twitter Streaming API integration

WordCounterBolt

Page 13: Apache Storm and twitter Streaming API integration

Topology

Page 14: Apache Storm and twitter Streaming API integration

Thanks to all