storm real-time processing

Download Storm real-time processing

If you can't read please download the document

Upload: michael-vogiatzis

Post on 16-Apr-2017

2.288 views

Category:

Technology


0 download

TRANSCRIPT

Slide 1

Storm

Real-time computation made easy

Michael Vogiatzis

Whats Storm?

Distributed real-time computation system

Fault tolerant

Fast

Scalable

Guaranteed message processing

Open source

Multilang capabilities

Purpose

Ok but why?

Motivation

Queues Workers paradigm

Scaling is hard

System is not robust

Coding is not fun!

No abstraction

Low level message passing

Intermediate message brokers

Use cases

Stream processing

Consume stream, update db, etc

Distributed RPC

Intense function on top of storm

Ongoing computation

Computing music trends on Twitter

Architecture

Elements

Streams

Set of tuples

Unbounded sequence of data

Spout

Source of streams

Bolts

Application logic

Functions

Streaming aggregations, joins, DB ops

Topology

Storm UI

Demo

Unshorten URLs

Evil Shorteners

Demo

Trident

Higher level of abstraction on top of Storm

Batch processing

Keeps state using your persistence store e.g. DBs, Memcached, etc.

Exactly once semanticsTuples can be replayed!

Similar API to Pig / Cascading

Trident operations

Operation




Input fields Function fields

Trident operations

Joins

Aggregations

Grouping

Functions

Filtering

Sorting

Trident State

Solid API for reading / writing to stateful sources

State updates are idempotent

Different kind of fault-tolerance depending on the different Spout implementations

Learn by example

Compute Male Female count on a particular topic on Twitter over time

Trident Gender

Stream of incoming tweets

Filter out the non-relevant to topic

Check gender by checking first name

Update either male or female counter

Input (Spout impl.)

Receives public stream (~1% of tweets) and emits them into the system

List tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));

}

Filter

Implement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))

String[] words = {instagram, flickr, pinterest, picasa};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?

for (String word : words) if (s.getText().toLowerCase().contains(word)) return true; return false; }}

Function

Implement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))

Tuple before:[{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6}]

Function

Implement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))

Tuple before:[{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6}]

Tuple after: [{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6},
Iris]

State Query

Implement a QueryFunction to query the persistence storage. .stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))

public List batchRetrieve(GenderDB state, List tuples) {List batchToQuery = new ArrayList();for (TridentTuple t : tuples){ String name = t.getStringByField("name"); batchToQuery.add(name);

}return state.getGenders(batchToQuery);

}

State Query

Tuple before: [{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6},
Iris]

State Query

Tuple before: [{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6},
Iris]

Tuple after: [{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6},
Iris,
Female]

Grouping

.groupBy(new Fields("gender"))

Groups the tuples containing the same gender value together

Re-partitions the stream

Tuples are sent over the network

Grouping

Tuples before: 1st Partition: [{TweetJson1}, Iris, Female]1st Partition: [{TweetJson2}, Michael, Male]2nd Partition: [{TweetJson3}, Lena, Female]

Grouping

Tuples before: 1st Partition: [{TweetJson1}, Iris, Female]1st Partition: [{TweetJson2}, Michael, Male]2nd Partition: [{TweetJson3}, Lena, Female]

Group By Gender

Tuple after: new 1st Partition: [{TweetJson1}, Iris, Female]new 1st Partition: [{TweetJson3}, Lena, Female]new 2nd Partition: [{TweetJson2}, Michael, Male]

Aggregators (general case)

Run the init() function before processing the batch

Aggregate through a number of tuples (usually grouped-by before) and emit one or more results based on the aggregate method.

public interface Aggregator extends Operation { T init(Object batchId, TridentCollector collector); void aggregate(T state, TridentTuple tuple, TridentCollector collector); void complete(T state, TridentCollector collector);}

Combiner Aggregator

Run init(TridentTuple t) on every tuple

Run combine method to tuple values until no tuples are left, then return single value.

public class Count implements CombinerAggregator { public Long init(TridentTuple tuple) { return 1L; } public Long combine(Long val1, Long val2) { return val1 + val2; } public Long zero() { return 0L; }}

Reducer Aggregator

Run init() to get an initial value

Iterate over the value to emit a single result

public interface ReducerAggregator extends Serializable { T init(); T reduce(T curr, TridentTuple tuple);}

Back to the example

For each gender batch run Count() aggregator

Not only aggregate, but also store the value to memory

Why?

Over time count

Back to the example

For each gender batch run Count() aggregator

Not only aggregate, but also store the value to memory

Why?

Over time count

persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))

Putting it all together

TridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name")) .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10)

.groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();

Demo

Gender count

Some minus

Hard debugging

pseudo-distributed mode but still..

Object serialization

When using 3rd party libraries

Register your own serializers for better performance e.g. Kryo

I didnt tackle

Reliability

Guaranteed message processing

Distributed RPC example

Storm-deploy companion

One-click storm cluster automated deploy i.e. EC2

Contributions

Overall

Express your realtime needs naturally

Growing community

System rapidly improving

Not a Hadoop/MR competitor

Fun to use

Resources

Storm Unshortening examplehttps://github.com/mvogiatzis/storm-unshortening

Understanding the Storm Parallelism http://bit.ly/RCx4Ln

http://storm-project.net/

https://github.com/nathanmarz/storm

The End

Michael Vogiatzis

Follow me @mvogiatzis

Q & A

Click to edit Master title style

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

30/4/2013

Click to edit Master title style

Click to edit Master subtitle style

30/4/2013

Click to edit Master title style

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

30/4/2013

Click to edit Master title style

Click to edit Master text styles

30/4/2013

Click to edit Master title style

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

30/4/2013

Click to edit Master title style

Click to edit Master text styles

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

Click to edit Master text styles

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

30/4/2013

Click to edit Master title style

30/4/2013

30/4/2013

Click to edit Master title style

Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

Click to edit Master text styles

30/4/2013

Click to edit Master title style

Click to edit Master text styles

30/4/2013

Click to edit Master title style

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

30/4/2013

Click to edit Master title style

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

30/4/2013