storm: distributed and fault-tolerant realtime computation

75
Nathan Marz Twitter Distributed and fault-tolerant realtime computation Storm

Upload: nathanmarz

Post on 08-Sep-2014

90.557 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Storm: distributed and fault-tolerant realtime computation

Nathan MarzTwitter

Distributed and fault-tolerant realtime computation

Storm

Page 2: Storm: distributed and fault-tolerant realtime computation

Storm at Twitter

Twitter Web Analytics

Page 3: Storm: distributed and fault-tolerant realtime computation

Before Storm

Queues Workers

Page 4: Storm: distributed and fault-tolerant realtime computation

Example

(simplified)

Page 5: Storm: distributed and fault-tolerant realtime computation

Example

Workers schemify tweets and append to Hadoop

Page 6: Storm: distributed and fault-tolerant realtime computation

Example

Workers update statistics on URLs by incrementing counters in Cassandra

Page 7: Storm: distributed and fault-tolerant realtime computation

Example

Distribute tweets randomly on multiple queues

Page 8: Storm: distributed and fault-tolerant realtime computation

Example

Workers share the load of schemifying tweets

Page 9: Storm: distributed and fault-tolerant realtime computation

Example

Desire all updates for same URL go to same worker

Page 10: Storm: distributed and fault-tolerant realtime computation

Message locality

• Because:

• No transactions in Cassandra (and no atomic increments at the time)

• More effective batching of updates

Page 11: Storm: distributed and fault-tolerant realtime computation

• Have a queue for each consuming worker

• Choose queue for a URL using consistent hashing

Implementing message locality

Page 12: Storm: distributed and fault-tolerant realtime computation

Example

Workers choose queue to enqueue to using hash/mod of URL

Page 13: Storm: distributed and fault-tolerant realtime computation

Example

All updates for same URL guaranteed to go to same worker

Page 14: Storm: distributed and fault-tolerant realtime computation

Adding a worker

Page 15: Storm: distributed and fault-tolerant realtime computation

Adding a workerDeploy

Reconfigure/redeploy

Page 16: Storm: distributed and fault-tolerant realtime computation

Problems

• Scaling is painful

• Poor fault-tolerance

• Coding is tedious

Page 17: Storm: distributed and fault-tolerant realtime computation

What we want

• Guaranteed data processing

• Horizontal scalability

• Fault-tolerance

• No intermediate message brokers!

• Higher level abstraction than message passing

• “Just works”

Page 18: Storm: distributed and fault-tolerant realtime computation

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

Page 19: Storm: distributed and fault-tolerant realtime computation

Streamprocessing

Continuouscomputation

DistributedRPC

Use cases

Page 20: Storm: distributed and fault-tolerant realtime computation

Storm Cluster

Page 21: Storm: distributed and fault-tolerant realtime computation

Storm Cluster

Master node (similar to Hadoop JobTracker)

Page 22: Storm: distributed and fault-tolerant realtime computation

Storm Cluster

Used for cluster coordination

Page 23: Storm: distributed and fault-tolerant realtime computation

Storm Cluster

Run worker processes

Page 24: Storm: distributed and fault-tolerant realtime computation

Starting a topology

Page 25: Storm: distributed and fault-tolerant realtime computation

Killing a topology

Page 26: Storm: distributed and fault-tolerant realtime computation

Concepts

• Streams

• Spouts

• Bolts

• Topologies

Page 27: Storm: distributed and fault-tolerant realtime computation

Streams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Page 28: Storm: distributed and fault-tolerant realtime computation

Spouts

Source of streams

Page 29: Storm: distributed and fault-tolerant realtime computation

Spout examples

• Read from Kestrel queue

• Read from Twitter streaming API

Page 30: Storm: distributed and fault-tolerant realtime computation

Bolts

Processes input streams and produces new streams

Page 31: Storm: distributed and fault-tolerant realtime computation

Bolts

• Functions

• Filters

• Aggregation

• Joins

• Talk to databases

Page 32: Storm: distributed and fault-tolerant realtime computation

Topology

Network of spouts and bolts

Page 33: Storm: distributed and fault-tolerant realtime computation

Tasks

Spouts and bolts execute as many tasks across the cluster

Page 34: Storm: distributed and fault-tolerant realtime computation

Stream grouping

When a tuple is emitted, which task does it go to?

Page 35: Storm: distributed and fault-tolerant realtime computation

Stream grouping

• Shuffle grouping: pick a random task

• Fields grouping: consistent hashing on a subset of tuple fields

• All grouping: send to all tasks

• Global grouping: pick task with lowest id

Page 36: Storm: distributed and fault-tolerant realtime computation

Topology

shuffle

[“url”]

shuffle

shuffle

[“id1”, “id2”]

all

Page 37: Storm: distributed and fault-tolerant realtime computation

Streaming word count

TopologyBuilder is used to construct topologies in Java

Page 38: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Define a spout in the topology with parallelism of 5 tasks

Page 39: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Split sentences into words with parallelism of 8 tasks

Page 40: Storm: distributed and fault-tolerant realtime computation

Consumer decides what data it receives and how it gets grouped

Streaming word count

Split sentences into words with parallelism of 8 tasks

Page 41: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Create a word count stream

Page 42: Storm: distributed and fault-tolerant realtime computation

Streaming word count

splitsentence.py

Page 43: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Page 44: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Submitting topology to a cluster

Page 45: Storm: distributed and fault-tolerant realtime computation

Streaming word count

Running topology in local mode

Page 46: Storm: distributed and fault-tolerant realtime computation

Demo

Page 47: Storm: distributed and fault-tolerant realtime computation

Traditional data processing

Page 48: Storm: distributed and fault-tolerant realtime computation

Traditional data processing

Intense processing (Hadoop, databases, etc.)

Page 49: Storm: distributed and fault-tolerant realtime computation

Traditional data processing

Light processing on a single machine to resolve queries

Page 50: Storm: distributed and fault-tolerant realtime computation

Distributed RPC

Distributed RPC lets you do intense processing at query-time

Page 51: Storm: distributed and fault-tolerant realtime computation

Game changer

Page 52: Storm: distributed and fault-tolerant realtime computation

Distributed RPC

Data flow for Distributed RPC

Page 53: Storm: distributed and fault-tolerant realtime computation

DRPC Example

Computing “reach” of a URL on the fly

Page 54: Storm: distributed and fault-tolerant realtime computation

Reach

Reach is the number of unique peopleexposed to a URL on Twitter

Page 55: Storm: distributed and fault-tolerant realtime computation

Computing reach

URL

Tweeter

Tweeter

Tweeter

Follower

Follower

Follower

Follower

Follower

Follower

Distinct follower

Distinct follower

Distinct follower

Count Reach

Page 56: Storm: distributed and fault-tolerant realtime computation

Reach topology

Page 57: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

“Tuple tree”

Page 58: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

• A spout tuple is not fully processed until all tuples in the tree have been completed

Page 59: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

• If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

Page 60: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

Reliability API

Page 61: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

“Anchoring” creates a new edge in the tuple tree

Page 62: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

Marks a single node in the tree as complete

Page 63: Storm: distributed and fault-tolerant realtime computation

Guaranteeing message processing

• Storm tracks tuple trees for you in an extremely efficient way

Page 64: Storm: distributed and fault-tolerant realtime computation

Storm UI

Page 65: Storm: distributed and fault-tolerant realtime computation

Storm UI

Page 66: Storm: distributed and fault-tolerant realtime computation

Storm UI

Page 67: Storm: distributed and fault-tolerant realtime computation

Storm on EC2

https://github.com/nathanmarz/storm-deploy

One-click deploy tool

Page 68: Storm: distributed and fault-tolerant realtime computation

Documentation

Page 69: Storm: distributed and fault-tolerant realtime computation

Synchronize a large amount of frequently changing state into a topology

State spout (almost done)

Page 70: Storm: distributed and fault-tolerant realtime computation

State spout (almost done)

Optimizing reach topology by eliminating the database calls

Page 71: Storm: distributed and fault-tolerant realtime computation

State spout (almost done)

Each GetFollowers task keeps a synchronous cache of a subset of the social graph

Page 72: Storm: distributed and fault-tolerant realtime computation

State spout (almost done)

This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter’s stream

Page 73: Storm: distributed and fault-tolerant realtime computation

Future work

• Storm on Mesos

• “Swapping”

• Auto-scaling

• Higher level abstractions

Page 74: Storm: distributed and fault-tolerant realtime computation

Questions?

http://github.com/nathanmarz/storm

Page 75: Storm: distributed and fault-tolerant realtime computation

What Storm does

• Distributes code and configurations

• Robust process management

• Monitors topologies and reassigns failed tasks

• Provides reliability by tracking tuple trees

• Routing and partitioning of streams

• Serialization

• Fine-grained performance stats of topologies