distributed cluster computing platforms

Introduction to Distributed Systems

Distributed Cluster Computing PlatformsOutlineWhat is the purpose of Data Intensive Super Computing?MapReducePregelDryadSpark/SharkDistributed Graph ComputingWhy DISCDISC stands for Data Intensive Super ComputingA lot of applications.scientific data, web search engine, social networkeconomic, GISNew data are continuously generatedPeople want to understand the dataBigData analysis is now considered as a very important method for scientific research.

What are the required features for the platform to handle DISC?Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine.Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks.Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application.Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself.Try to understand all these four features during the introduction of the concrete platform below.Google MapReduceProgramming ModelImplementationRefinementsEvaluationConclusionMotivation: large scale data processingProcess lots of data to produce other derived data Input: crawled documents, web request logs etc.Output: inverted indices, web page graph structure,top queries in a day etc.Want to use hundreds or thousands of CPUs but want to only focus on the functionalityMapReduce hides messy details in a library: ParallelizationData distributionFault-tolerance Load balancing

Motivation: Large Scale Data ProcessingWant to process lots of data ( > 1 TB)Want to parallelize across hundreds/thousands of CPUs Want to make this easy"Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data."From: http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.htmlMapReduceAutomatic parallelization & distributionFault-tolerantProvides status and monitoring toolsClean abstraction for programmersProgramming ModelBorrows from functional programmingUsers implement interface of two functions:

map (in_key, in_value) -> (out_key, intermediate_value) list

reduce (out_key, intermediate_value list) ->out_value listmapRecords from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).map() produces one or more intermediate values along with an output key from the input.reduceAfter the map phase is over, all the intermediate values for a given output key are combined together into a listreduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)Architecture

Parallelismmap() functions run in parallel, creating different intermediate values from different input data setsreduce() functions also run in parallel, each working on a different output keyAll values are processed independentlyBottleneck: reduce phase cant start until map phase is completely finished.Example: Count word occurrencesmap(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));Example vs. Actual Source CodeExample is written in pseudo-codeActual implementation is in C++, using a MapReduce libraryBindings for Python and Java exist via interfacesTrue code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)ExamplePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.Map outputWorker 1: (the 1), (weather 1), (is 1), (good 1).Worker 2: (today 1), (is 1), (good 1).Worker 3: (good 1), (weather 1), (is 1), (good 1).Reduce InputWorker 1:(the 1)Worker 2:(is 1), (is 1), (is 1)Worker 3:(weather 1), (weather 1)Worker 4:(today 1)Worker 5:(good 1), (good 1), (good 1), (good 1)Reduce OutputWorker 1:(the 1)Worker 2:(is 3)Worker 3:(weather 2)Worker 4:(today 1)Worker 5:(good 4)Some Other Real ExamplesTerm frequencies through the whole Web repositoryCount of URL access frequencyReverse web-link graphImplementation OverviewTypical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programsArchitecture

Execution

Parallel Execution

Task Granularity And PipeliningFine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines

LocalityMaster program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rackmap() task inputs are divided into 64 MB blocks: same size as Google File System chunksWithout this, rack switches limit read rateEffect: Thousands of machines read input at local disk speedFault ToleranceMaster detects worker failuresRe-executes completed & in-progress map() tasksRe-executes in-progress reduce() tasksMaster notices particular input key/values cause crashes in map(), and skips those values on re-execution.Effect: Can work around bugs in third-party libraries!

Fault ToleranceOn worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely)Robust: lost 1600 of 1800 machines once, but finished fine

OptimizationsNo reduce can start until map is complete:A single slow disk controller can rate-limit the whole processMaster redundantly executes slow-moving map tasks; uses results of first copy to finish, (one finishes first wins)Why is it safe to redundantly execute map tasks? Wouldnt this mess up the total computation?Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)OptimizationsCombiner functions can run on same machine as a mapperCauses a mini-reduce phase to occur before the real reduce phase, to save bandwidth

Under what conditions is it sound to use a combiner?RefinementSorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined countersPerformanceTests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 GbpsTwo benchmarks:

MR_GrepScan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

MR_SortSort 1010 100-byte records (modeled after TeraSort benchmark)MR_GrepLocality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs

MR_SortBackup tasks reduce job completion time significantly System deals well with failures

Normal No Backup Tasks200 processes killedMore and more MapReduce

MapReduce Programs In Google Source TreeExample uses:distributed grepdistributed sortweb link-graph reversalterm-vector per hostweb access log statsinverted index constructiondocument clustering machine learningstatistical machine translationReal MapReduce : Rewrite of Production Indexing SystemRewrote Google's production indexing system using MapReduce Set of 10,14,17,21,24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machinesMapReduce ConclusionsMapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applicationsFun to use: focus on problem, let library deal w/ messy details MapReduce ProgramsSortingSearchingIndexingClassificationTF-IDFBreadth-First Search / SSSPPageRankClusteringMapReduce for PageRankPageRank: Random Walks Over The WebIf a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page?The PageRank of a page captures this notionMore popular or worthwhile pages get a higher rank40PageRank: Visually

41PageRank: FormulaGiven page A, and pages T1 through Tn linking to A, PageRank is defined as:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

C(P) is the cardinality (out-degree) of page Pd is the damping (random URL) factor 42PageRank: IntuitionCalculation is iterative: PRi+1 is based on PRiEach page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1d is a tunable parameter (usually = 0.85) encapsulating the random jump factorPR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))43PageRank: First ImplementationCreate two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR valuesIterate over all pages in the graph, distributing PR from 'current' into 'next' of linkeescurrent := next; next := fresh_table();Go back to iteration step or end if converged44Distribution of the AlgorithmKey insights allowing parallelization:The 'next' table depends on 'current', but not on any other rows of 'next'Individual rows of the adjacency matrix can be processed in parallelSparse matrix rows are relatively small

45Distribution of the AlgorithmConsequences of insights:We can map each row of 'current' to a list of PageRank fragments to assign to linkeesThese fragments can be reduced into a single PageRank value for a page by summingGraph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 146

47Phase 1: Parse HTMLMap task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls))PRinit is the seed PageRank for URLlist-of-urls contains all pages pointed to by URL

Reduce task is just the identity function48Phase 2: PageRank DistributionMap task takes (URL, (cur_rank, url_list))For each u in url_list, emit (u, cur_rank/|url_list|)Emit (URL, url_list) to carry the points-to list along through iterations

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))49Phase 2: PageRank DistributionReduce task gets (URL, url_list) and many (URL, val) valuesSum vals and fix up with dEmit (URL, (new_rank, url_list))PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))50Finishing up...A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?)If so, write out the PageRank lists - done!Otherwise, feed output of Phase 2 into another Phase 2 iteration51PageRank ConclusionsMapReduce isn't the greatest at iterated computation, but still helps run the heavy liftingKey element in parallelization is independent PageRank computations in a given stepParallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows)Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs

So, do you think that MapReduce is suitable for PageRank?(homework, give concrete reason for why and why not.)52Dryad

Dryad DesignImplementationPolicies as Plug-insBuilding on Dryad

Design Space54ThroughputLatencyInternetPrivatedatacenterData-parallelSharedmemoryDryadSearchHPCGridTransactionDryad is optimized for: throughput, data-parallel computation, in a private data-center.54Data Partitioning55RAM

DATA

DATAA common scenario: too much data to process. Instead of trying to be clever, just use more machines and a brute-force algorithm.552-D PipingUnix Pipes: 1-Dgrep | sed | sort | awk | perl

Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50

56

Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.56Dryad = Execution Layer57Job (Application)DryadClusterPipelineShellMachineIn the same way as the Unix shell does not understand the pipeline running on top, but manages its execution (i.e., killing processes when one exits), Dryad does not understand the job running on top.57Dryad DesignImplementationPolicies as Plug-insBuilding on Dryad

58OutlineVirtualized 2-D Pipelines59

This is a possible schedule of a Dryad job using 2 machines.59

Virtualized 2-D Pipelines60




2D DAG multi-machine virtualizedThe Unix pipeline is generalized 3-ways:2D instead of 1D spans multiple machines resources are virtualized: you can run the same large job on many or few machines63Dryad Job Structure64grepsedsortawkperlgrepgrepsedsortsortawk

InputfilesVertices (processes)OutputfilesChannelsStage

grep1000 | sed500 | sort1000 | awk500 | perl50This is the basic Dryad terminology.64Channels65XMItemsFinite Streams of items

distributed filesystem files (persistent) SMB/NTFS files (temporary) TCP pipes (inter-machine) memory FIFOs (intra-machine)Channels are very abstract, enabling a variety of transport mechanisms.The performance and fault-tolerance of these machanisms vary widely.65

Architecture66Files, TCP, FIFO, Networkjob scheduledata planecontrol planeNSPDPDPDVVVJob managerclusterThe brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)

66JM codevertex code

Staging

1. Build2. Send .exe3. Start JM5. Generate graph7. Serializevertices8. MonitorVertex execution4. Querycluster resourcesCluster services6. Initialize verticesComputation Staging67Fault Tolerance

Vertex failures and channel failures are handled differently.

68Dryad DesignImplementationPolicies and Resource ManagementBuilding on Dryad

69OutlinePolicy Managers70RRXXXXStage RRRStage XJob ManagerR managerX ManagerR-X ManagerConnection R-XEach stage has a stage manager.Each inter-stage set of edges has a connection manager.

The managers get upcalls for all important events in the corresponding vertices, and can make policy decisions.The user can change the managers.The managers can even rewrite the graph at run-time.70X[0]X[1]X[3]X[2]X[2]Completed verticesSlow vertexDuplicatevertexDuplicate Execution ManagerDuplication Policy = f(running times, data volumes)The handling of apparently very slow computation by duplication of vertices is handled by a stage manager.

71SSSSAAASSTSSSSSST# 1# 2# 1# 3# 3# 2# 3# 2# 1staticdynamicrack #Aggregation Manager72Aggregating data with associative operators can be done in a bandwidth-preserving fashion in the intermediate aggregations are placed close to the source data.72Data Distribution(Group By)73DestSourceDestSourceDestSourcemnm x nRedistributing data is an important step for load balancing or when changing keys.73TT[0-?)[?-100)Range-Distribution ManagerSDDDSSSSSTstaticdynamic74Hist[0-30),[30-100)[30-100)[0-30)[0-100)Using a connection manager one can load-balance the data distribution at run-time, based on data statistics obtained from sampling the data stream. In this case the number of destination vertices and the ranges for each vertex are decided dynamically.

74Goal: Declarative Programming75XTSXXSSTTTXstaticdynamicWe evolve towards a programming model in which resources are always allocated dynamically, based on demand.75Dryad DesignImplementationPolicies as Plug-insBuilding on Dryad

76OutlineSoftware Stack77Windows ServerCluster ServicesDistributed Filesystem DryadDistributed ShellPSQLDryadLINQPerlSQLserverC++Windows ServerWindows ServerWindows ServerC++CIFS/NTFSlegacycodesed, awk, grep, etc.SSISQueriesC#VectorsMachine LearningC#Job queueing, monitoringThere is a rich software ecosystem built around dryad. In this talk I will focus on a few of the layers developed at Microsoft Research SVC.

77SkyServer Query 1878DDMM4nSS4nYYHnnXXnUUNNLLselect distinct P.ObjIDinto results from photoPrimary U, neighbors N, photoPrimary Lwhere U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g))

distributed cluster computing platforms

Documents

scientific data

gisnew data

index data

derived data input

data source lines

different input data

different intermediate

given output key