Download - 1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan @WalmartLabs Muppet Scalable MapUpdate data-stream processing

1

Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan

@WalmartLabs

MuppetScalable MapUpdate data-stream processing

Road Map

• Motivation

• The MapUpdate framework

• An example data-stream computation

• Muppet implementation

2

The challenge

• Growing numbers of large, fast data streams– 300+ million Twitter status updates daily– 5+ million Foursquare checkins daily– 3+ billion Facebook Likes and comments daily

• Streams never stop

• Growing numbers of applications for data streams– Computations need to scale with the data– Applications need to stay up-to-date (“What’s going on now?”)

• Machines fail

3

The wish list

• Deliver low-latency processing– Application stays near real-time with its input stream– Computed data can be queried live

• Scale up on commodity hardware with computation and stream rate

• Easy to program– Simple model to enable rapid development of many applications– Ideally resemble widely adopted MapReduce

4

Data-stream computation

• Big data: MapReduce (Hadoop)– Map and Reduce steps– Batch process large input (e.g., from HDFS)– Hadoop distributes computation

• Fast data: MapUpdate (Muppet)– Map and Update steps– Continuously process streaming input (e.g., from network)– Muppet maintains computation and manages memory/storage

5

The MapReduce framework (Hadoop)

• Event– A <key, value> pair of data

• Map– A function that performs (stateless) computation on incoming events

• Reduce– A function that combines all input for a particular key

• Application– Map -> Reduce

6

The MapUpdate framework (Muppet)

• Event– A <key, value> pair of data

• Map– A function that performs (stateless) computation on incoming events

• Update– A function that updates a slate using incoming events

• Application– A directed graph of Mappers and Updaters

7

A MapUpdate application

8

An example Muppet application

Checkin counts on Foursquare

• Identify Foursquare checkins at various retailers

• Maintain a live count of retailer checkins

• Enable a display of the current counts at any time

9


Checkin counts on Foursquare

• Source: Read Foursquare stream and create key-value-pair events.

• Map: For each checkin event, identify a retailer and publish if found.

• Update: For each retailer checkin, increment appropriate count.

Updater slates hold live retailer check-in counts.

10


• Source: Read Foursquare stream and create key-value-pair events.Input (excerpt):

{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}

11


• Source: Read Foursquare stream and create key-value-pair events.Output:453407,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}

12


• Map: For each checkin event, identify a retailer and publish if found.Input:453407,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}

13


• Map: For each checkin event, identify a retailer and publish if found.Output:Walmart.1288052100,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" }}

14


• Update: For each retailer checkin, increment appropriate count.Input:Walmart.1288052100,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" }}

15


• Update: For each retailer checkin, increment appropriate count.Slate:Walmart.1288052100,

{ "retailer": "Walmart", "timeslot": 1288052100, "interval": 900, "count": 1}

16

The Source (stream receiver)

while ($checkin = <$sock>) {

$checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++;

my $event; eval { $event = decode_json($checkin); }; if ($@ or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); }}

17

The Map (Foursquare::CheckinMapper)

sub map { my $self = shift; my $event = shift;

my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900;

my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); }}

18

The Update (Foursquare::RetailerUpdater)

use Muppet::Updater;package Foursquare::RetailerUpdater;@ISA = qw( Muppet::Updater );

use strict;

sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift;

$slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1;

return $slate;}1;

19

The application configuration (flow graph)

{ "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ]},{ "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ]}

20

Example results

21

Implementation

22

Implementation

• Slate management

– Slates are cached for performance

– Cache is sharded by key for load distribution across machines

– Slates are written to distributed key-value store for durability

• Event flow

– Event queues buffer transient load spikes within an application

– Host failover remaps load away from an unresponsive machine

23

Challenges

• Host failover

• Hotspots (uneven load)

• Parallelization

• Slate caching

• Overload stability

24

Hotspots

• Some key distributions are highly nonuniform (e.g., Zipfian)

– Keys based on natural-language word usage– Keys based on a set of varying popularity

• Mappers: Run any event anywhere.

• Updaters: Popular keys need access to the same slate.

– Split associative and commutative computations• Split computation parallelizes partial results.• Propagate partial results to final result.

– Reduce slate serialization/deserialization overhead

25

Usage

• Time

– Running since mid-2010

• Developers

– More than a dozen developers at WalmartLabs have used Muppet to develop their applications

• Data

– Billions of events, tens of millions of slates processed

26

Related work

• MapReducework toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate)– MapReduce Online (Condie et al.)– Nova (Olston et al.)

• Event-flow systemssystems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer– S4 (Neumeyer et al.)– Storm (Marz et al.)

• Streaming-query systemssystems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators)– Aurora (StreamBase Systems) (Zdonik et al.)– SPADE for System S (InfoSphere Streams) (Gedik et al.)

27

Conclusion

Big Data : MapReduce :: Fast Data : MapUpdate

Create soft-real-time applications on a simple programming model.

Distributed stream-processing infrastructure scales computation across cores.

28

29

MuppetScalable data-stream processingBig Fast Data @WalmartLabs

Download - 1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan @WalmartLabs Muppet Scalable MapUpdate data-stream processing

Top Related