piccolo – paper discussion big data reading group

Piccolo – Paper Discussion

Big Data Reading Group

9/20/2010

Motivation / Goals

• Rising demand for distributing computation• PageRank, K-Means, N-Body simulation

• Data-centric frameworks simplify programming• Existing models (e.g. MapReduce) are insufficient• Designed for large scale data analysis as opposed to in-memory

computation

• Make in-memory computations fast• Enable asynchronous computation

9/20/2010 Piccolo – Paper Discussion2

Overview

• Global in-memory key-value tables for sharing state• Concurrently running instances of kernel applications modifying

global state• Locality optimized (user specified policies)• Reduced synchronization (accumulation, global barriers)• Checkpoint-based recovery


System Design


Table interface


Optimization

• Ensure Locality• Group kernel with partition• Group partitions• Guarantee: one partition completely on single machine

• Reduce Synchronization• Accumulation to avoid write/write conflicts• No pairwise kernel synchronization• Global barriers sufficient


Load balancing

• Assigning partitions• Round robin• Optimized for data location

• Work stealing• Biggest task first (master estimates based on number of keys in partition)• Master decides

• Restrictions• Cannot kill running task (modifies shared state, restore is very expensive)• Partitions need to be moved


Table migration

• Migrate table from wa to wb• Message M1 from master to all workers• All workers flush to wa

• All workers send all new requests to wb

• wb buffers all requests• wa sends paused state to wb

• All workers ackknowledge phase 1 => master sends M2 to wa and wb

• wa flushes to wb and leaves “paused”• wb first works buffered requests then resumes normal operation


Fault tolerance

• User assisted checkpoint / restore• Chandy Lamport• Asynchronic -> periodic• Synchronic -> barrier

• Problem: When to start barrier checkpoint• Replay log might get very long• Checkpoint might not use enough free CPU time before barrier

• Solution: When first worker finished all his jobs

• No checkpoint during table migration and vice versa


Applications

• PageRank, k-means, n-body, matrix multiplication• Parallel, iterative computations• Local reads + local/remote writes or local/remote reads + local writes• Can be implemented as multiple MapReduce jobs

• Distributed web crawler• Idempotent operation• Cannot be realized in MapReduce


Scaling


Fixed input size

Scaled input size

Comparison with Hadoop / MPI


• PageRank, k-means (Hadoop)• Piccolo 4x and 11x faster• For PageRank:• 50% in sort• Join data streams• 15% (de)serialization• Read/write HDFS

• Matrix multiplication (MPI)• Piccolo 10% faster• MPI waits for slowest node

many times

Work stealing / slow worker / checkpoints


• Work stealing / slow worker• PageRank has skewed

partitions• One slow worker (50% CPU)

• Checkpoints• Naïve - start after all workers

finished• Optimized – start after first

worker finished

Checkpoint limits / scalability


• Hypothetical data center• Typical machine uptime of 1 year• Worst-case scenario• Optimistic?

• Looked different on some older slides

Distributed Crawler


• 32 Machines saturate 100Mbps• There are single servers doing

this• Piccolo would scale higher

Summary

• Piccolo provides an easy to use distributed shared memory model• It applies many restrictions• Simple interface• Reduced synchronization• Relaxed consistency• Accumulation• Locality

• But it performs well• Iterative computations• Saves going to disk compared to MapReduce

• A specialized tool for data intensive in-memory computing


piccolo – paper discussion big data reading group

Documents

piccolo paper discussion8how

kernel functions user

kernel execution local

remote kernel

master estimates

table interface

waall workers

local writescan