1
Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan
@WalmartLabs
MuppetScalable MapUpdate data-stream processing
Road Map
• Motivation
• The MapUpdate framework
• An example data-stream computation
• Muppet implementation
2
The challenge
• Growing numbers of large, fast data streams– 300+ million Twitter status updates daily– 5+ million Foursquare checkins daily– 3+ billion Facebook Likes and comments daily
• Streams never stop
• Growing numbers of applications for data streams– Computations need to scale with the data– Applications need to stay up-to-date (“What’s going on now?”)
• Machines fail
3
The wish list
• Deliver low-latency processing– Application stays near real-time with its input stream– Computed data can be queried live
• Scale up on commodity hardware with computation and stream rate
• Easy to program– Simple model to enable rapid development of many applications– Ideally resemble widely adopted MapReduce
4
Data-stream computation
• Big data: MapReduce (Hadoop)– Map and Reduce steps– Batch process large input (e.g., from HDFS)– Hadoop distributes computation
• Fast data: MapUpdate (Muppet)– Map and Update steps– Continuously process streaming input (e.g., from network)– Muppet maintains computation and manages memory/storage
5
The MapReduce framework (Hadoop)
• Event– A <key, value> pair of data
• Map– A function that performs (stateless) computation on incoming events
• Reduce– A function that combines all input for a particular key
• Application– Map -> Reduce
6
The MapUpdate framework (Muppet)
• Event– A <key, value> pair of data
• Map– A function that performs (stateless) computation on incoming events
• Update– A function that updates a slate using incoming events
• Application– A directed graph of Mappers and Updaters
7
An example Muppet application
Checkin counts on Foursquare
• Identify Foursquare checkins at various retailers
• Maintain a live count of retailer checkins
• Enable a display of the current counts at any time
9
An example Muppet application
Checkin counts on Foursquare
• Source: Read Foursquare stream and create key-value-pair events.
• Map: For each checkin event, identify a retailer and publish if found.
• Update: For each retailer checkin, increment appropriate count.
Updater slates hold live retailer check-in counts.
10
An example Muppet application
• Source: Read Foursquare stream and create key-value-pair events.Input (excerpt):
{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}
11
An example Muppet application
• Source: Read Foursquare stream and create key-value-pair events.Output:453407,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}
12
An example Muppet application
• Map: For each checkin event, identify a retailer and publish if found.Input:453407,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }}
13
An example Muppet application
• Map: For each checkin event, identify a retailer and publish if found.Output:Walmart.1288052100,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" }}
14
An example Muppet application
• Update: For each retailer checkin, increment appropriate count.Input:Walmart.1288052100,{ "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" }}
15
An example Muppet application
• Update: For each retailer checkin, increment appropriate count.Slate:Walmart.1288052100,
{ "retailer": "Walmart", "timeslot": 1288052100, "interval": 900, "count": 1}
16
The Source (stream receiver)
while ($checkin = <$sock>) {
$checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++;
my $event; eval { $event = decode_json($checkin); }; if ($@ or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); }}
17
The Map (Foursquare::CheckinMapper)
sub map { my $self = shift; my $event = shift;
my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900;
my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); }}
18
The Update (Foursquare::RetailerUpdater)
use Muppet::Updater;package Foursquare::RetailerUpdater;@ISA = qw( Muppet::Updater );
use strict;
sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift;
$slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1;
return $slate;}1;
19
The application configuration (flow graph)
{ "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ]},{ "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ]}
20
Implementation
• Slate management
– Slates are cached for performance
– Cache is sharded by key for load distribution across machines
– Slates are written to distributed key-value store for durability
• Event flow
– Event queues buffer transient load spikes within an application
– Host failover remaps load away from an unresponsive machine
23
Challenges
• Host failover
• Hotspots (uneven load)
• Parallelization
• Slate caching
• Overload stability
24
Hotspots
• Some key distributions are highly nonuniform (e.g., Zipfian)
– Keys based on natural-language word usage– Keys based on a set of varying popularity
• Mappers: Run any event anywhere.
• Updaters: Popular keys need access to the same slate.
– Split associative and commutative computations• Split computation parallelizes partial results.• Propagate partial results to final result.
– Reduce slate serialization/deserialization overhead
25
Usage
• Time
– Running since mid-2010
• Developers
– More than a dozen developers at WalmartLabs have used Muppet to develop their applications
• Data
– Billions of events, tens of millions of slates processed
26
Related work
• MapReducework toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate)– MapReduce Online (Condie et al.)– Nova (Olston et al.)
• Event-flow systemssystems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer– S4 (Neumeyer et al.)– Storm (Marz et al.)
• Streaming-query systemssystems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators)– Aurora (StreamBase Systems) (Zdonik et al.)– SPADE for System S (InfoSphere Streams) (Gedik et al.)
27
Conclusion
Big Data : MapReduce :: Fast Data : MapUpdate
Create soft-real-time applications on a simple programming model.
Distributed stream-processing infrastructure scales computation across cores.
28