streams processing with storm
TRANSCRIPT
Data streamsprocessing with
STORM
Mariusz Gil
data expire fast. very fast
realtime processing?
Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
concept architecture
Stream
(val1, val2)(val3, val4)(val5, val6)
unbounded sequence of tuples
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Reliable and unreliable Spoutsreplay or forget about touple
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-Kafka
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-Kestrel
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-AMQP-Spout
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-JMS
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-PubSub*
Spoutssource of streams
tupletuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Storm-Beanstalkd-Spout
Boltsprocess input streams and produce new streams
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
Boltsprocess input streams and produce new streams
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tuple
tupl
etu
ple
tupl
etu
ple
tupl
etu
ple
tupl
etuple
tupletuple
tupletuple
tupletuple
Topologiesnetwork of spouts and bolts
TextSpout SplitSentenceBolt WordCountBolt
[sentence] [word] [word, count]
Topologiesnetwork of spouts and bolts
TextSpout SplitSentenceBolt
WordCountBolt
[sentence]
[word]
[word, count]
TextSpout SplitSentenceBolt
[sentence]
xyzBolt
servers architecture
Nimbusprocess responsible for distributing processing across the cluster
Supervisorsworker process responsible for executing subset of topology
zookeeperscoordination layer between Nimbus and Supervisors
fastCLUSTER STATE IS STOREDLOCALLY OR IN ZOOKEEPERSfail
sample code
Spouts
public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand;
@Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); }
@Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); }
@Override public void ack(Object id) { }
@Override public void fail(Object id) { }
@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}
Bolts
public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();
@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word);
if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}
Bolts
public static class ExclamationBolt implements IRichBolt { OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; }
public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); }
public void cleanup() { }
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; }}
Topology
public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) { conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown(); } }}
Bolts
public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}
import storm
class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word])
SplitSentenceBolt().run()
github.com/nathanmarz/storm-starter
streams groupping
Topology
public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) { conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown(); } }}
Grouppingshuffle, fields, all, global, none, direct, local or shuffle
distributed rpc
RPCdistributed
arguments
results
[request-id, arguments]
[request-id, results]
RPCdistributed
arguments
results
[request-id, arguments]
[request-id, results]public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); }
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); }}
public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3);
LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster();
cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));
System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello"));
cluster.shutdown(); drpc.shutdown();}
realtime analytics
personalization
search
revenue
optimization
monitoring
content search
realtime analytics
generating feeds
integrated with
elastic search,
Hbase,hadoop
and hdfs
realtime scoring
moments generation
integrated with
kafka queues and
hdfs storage
Storm-YARN enables
Storm applications to
utilize the
computational
resources in a Hadoop
cluster along with
accessing Hadoop
storage resources
such As HBase and
HDFS
thanks!mail: [email protected]: @mariuszgil