streams processing with storm

Data streamsprocessing with

STORM

Mariusz Gil

data expire fast. very fast

realtime processing?

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

concept architecture

Stream

(val1, val2)(val3, val4)(val5, val6)

unbounded sequence of tuples

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Reliable and unreliable Spoutsreplay or forget about touple

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Kafka


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Kestrel


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-AMQP-Spout


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-JMS


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-PubSub*


tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Beanstalkd-Spout

Boltsprocess input streams and produce new streams

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Boltsprocess input streams and produce new streams

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tupl

etu

ple

tupl

etu

ple

tupl

etu

ple

tupl

etuple

tupletuple

tupletuple

tupletuple

Topologiesnetwork of spouts and bolts

TextSpout SplitSentenceBolt WordCountBolt

[sentence] [word] [word, count]

Topologiesnetwork of spouts and bolts

TextSpout SplitSentenceBolt

WordCountBolt

[sentence]

[word]

[word, count]

TextSpout SplitSentenceBolt

[sentence]

xyzBolt

servers architecture

Nimbusprocess responsible for distributing processing across the cluster

Supervisorsworker process responsible for executing subset of topology

zookeeperscoordination layer between Nimbus and Supervisors

fastCLUSTER STATE IS STOREDLOCALLY OR IN ZOOKEEPERSfail

sample code

Spouts

public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand;

@Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); }

@Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); }

@Override public void ack(Object id) { }

@Override public void fail(Object id) { }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

Bolts

public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word);

if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}

Bolts

public static class ExclamationBolt implements IRichBolt { OutputCollector _collector;

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; }

public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); }

public void cleanup() { }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; }}

Topology

public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Config conf = new Config();

conf.setDebug(true);

if (args != null && args.length > 0) { conf.setNumWorkers(3);

StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);

LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

Thread.sleep(10000);

cluster.shutdown(); } }}

Bolts

public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

import storm

class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word])

SplitSentenceBolt().run()

github.com/nathanmarz/storm-starter

streams groupping

Topology

public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Config conf = new Config();

conf.setDebug(true);

if (args != null && args.length > 0) { conf.setNumWorkers(3);

StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);

LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

Thread.sleep(10000);

cluster.shutdown(); } }}

Grouppingshuffle, fields, all, global, none, direct, local or shuffle

distributed rpc

RPCdistributed

arguments

results

[request-id, arguments]

[request-id, results]

RPCdistributed

arguments

results

[request-id, arguments]

[request-id, results]public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); }}

public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3);

LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster();

cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));

System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello"));

cluster.shutdown(); drpc.shutdown();}

realtime analytics

personalization

search

revenue

optimization

monitoring

content search

realtime analytics

generating feeds

integrated with

elastic search,

Hbase,hadoop

and hdfs

realtime scoring

moments generation

integrated with

kafka queues and

hdfs storage

Storm-YARN enables

Storm applications to

utilize the

computational

resources in a Hadoop

cluster along with

accessing Hadoop

storage resources

such As HBase and

HDFS

thanks!mail: [email protected]: @mariuszgil

streams processing with storm

Technology