programming models and frameworks advanced cloud...

66
Programming Models and Frameworks Advanced Cloud Computing 15-719/18-847b Garth Gibson Greg Ganger Majd Sakr Jan 30, 2017 15719/18847b Adv. Cloud Computing 1

Upload: others

Post on 13-Jan-2020

9 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Programming Models and Frameworks

Advanced Cloud Computing 15-719/18-847b Garth Gibson Greg Ganger Majd Sakr

Jan 30, 2017 15719/18847b Adv. Cloud Computing 1

Page 2: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Advanced Cloud Computing Programming Models

•  Ref 1: MapReduce: simplified data processing on large clusters.  

Jeffrey Dean and Sanjay Ghemawat.  OSDI’04.  2004.

http://static.usenix.org/event/osdi04/tech/full_papers/dean/

dean.pdf

•  Ref 2: Spark: cluster computing with working sets. Matei Zaharia,

Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

USENIX Hot Topics in Cloud Computing (HotCloud’10).

http://www.cs.berkeley.edu/~matei/papers/2010/

hotcloud_spark.pdf

Jan 30, 2017 15719/18847b Adv. Cloud Computing 2

Page 3: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Advanced Cloud Computing Programming Models

•  Optional •  Ref 3: DyradLinQ: A system for general-purpose distributed data-

parallel computing using a high-level language.  Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey.  OSDI’08.  http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf

•  Ref 4: Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein (2010). "GraphLab: A New Parallel Framework for Machine Learning." Conf on Uncertainty in Artificial Intelligence (UAI). http://www.select.cs.cmu.edu/publications/scripts/papers.cgi

Jan 30, 2017 15719/18847b Adv. Cloud Computing 3

Page 4: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Advanced Cloud Computing Programming Models

•  Optional

•  Ref 5: TensorFlow: A system for large-scale machine learning.

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy

Davis, Jeff Dean, Matthieu Devin, Sanjay Ghemawatt, Geoffrey

Irving, Michael Isard.  OSDI’16.  

https://www.usenix.org/system/files/conference/osdi16/osdi16-

abadi.pdf

Jan 30, 2017 15719/18847b Adv. Cloud Computing 4

Page 5: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Recall the SaaS, PaaS, IaaS Taxonomy

•  Service, Platform or Infrastructure as a Service o  SaaS: service is a complete application (client-server computing)

•  Not usually a programming abstraction

o  PaaS: high level (language) programming model for cloud computer •  Eg. Rapid prototyping languages •  Turing complete but resource management hidden

o  IaaS: low level (language) computing model for cloud computer •  Eg. Assembler as a language •  Basic hardware model with all (virtual) resources exposed

•  For PaaS and IaaS, cloud programming is needed o  How is this different from CS 101? Scale, fault tolerance, elasticity, ….

Jan 30, 2017 15719/18847b Adv. Cloud Computing 5

Page 6: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Embarrassingly parallel “Killer app:” Web servers

•  Online retail stores (like amazon.com for example) o  Most of the computational demand is for browsing product marketing,

forming and rendering web pages, managing customer session state •  Actual order taking and billing not as demanding, have separate specialized

services (Amazon bookseller backend)

o  One customer session needs a small fraction of one server o  No interaction between customers (unless inventory near exhaustion)

•  Parallelism is more cores running identical copy of web server •  Load balancing, maybe in name service, is parallel programming

o  Elasticity needs template service, load monitoring, cluster allocation o  These need not require user programming, just configuration

Jan 30, 2017 15719/18847b Adv. Cloud Computing 6

Page 7: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Eg., Obama for America Elastic Load Balancer

Jan 30, 2017 15719/18847b Adv. Cloud Computing 7

Page 8: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

What about larger apps?

•  Parallel programming is hard – how can cloud frameworks help?

•  Collection-oriented languages (Sipelstein&Blelloch, Proc IEEE v79, n4, 1991) o  Also known as Data-parallel

o  Specify a computation on an element; apply to each in collection •  Analogy to SIMD: single instruction on multiple data

o  Specify an operation on the collection as a whole •  Union/intersection, permute/sort, filter/select/map •  Reduce-reorderable (A) /reduce-ordered (B)

–  (A) Eg., ADD(1,7,2) = (1+7)+2 = (2+1)+7 = 10 –  (B) Eg., CONCAT(“the “, “lazy “, “fox “) = “the lazy fox “

•  Note the link to MapReduce …. its no accident

Jan 30, 2017 15719/18847b Adv. Cloud Computing 8

Page 9: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

High Performance Computing Approach

•  HPC was almost the only home for parallel computing in the 90s •  Physical simulation was the killer app – weather, vehicle design,

explosions/collisions, etc – replace “wet labs” with “dry labs” o  Physics is the same everywhere, so define a mesh on a set of particles, code the

physics you want to simulate at one mesh point as a property of the influence of nearby mesh points, and iterate

o  Bulk Synchronous Processing (BSP): run all updates of mesh points in parallel based on value at last time point, form new set of values & repeat

•  Defined “Weak Scaling” for bigger machines – rather than make a fixed problem go faster (strong scaling), make bigger problem go same speed o  Most demanding users set problem size to match total available memory

Jan 30, 2017 15719/18847b Adv. Cloud Computing 9

Page 10: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

High Performance Computing Frameworks

•  Machines cost O($10-100) million, so o  emphasis was on maximizing utilization of machines (congress checks) o  low-level speed and hardware specific optimizations (esp. network) o  preference for expert programmers following established best practices

•  Developed MPI (Message Passing Interface) framework (eg. MPICH) o  Launch N threads with library routines for everything you need:

•  Naming, addressing, membership, messaging, synchronization (barriers) •  Transforms, physics modules, math libraries, etc

o  Resource allocators and schedulers space share jobs on physical cluster o  Fault tolerance by checkpoint/restart requiring programmer save/restore o  Proto-elasticity: kill N-node job & reschedule a past checkpoint on M nodes

•  Very manual, deep learning curve, few commercial runaway successes

Jan 30, 2017 15719/18847b Adv. Cloud Computing 10

Page 11: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Broadening HPC: Grid Computing

•  Grid Computing started with commodity servers (predates Cloud) o  1989 concept of “killer micros” that would kill off supercomputers

•  Frameworks were less specialized, easier to use (& less efficient) o  Beowulf, PVM (parallel virtual machine), Condor, Rocks, Sun Grid Engine

•  For funding reasons grid emphasized multi-institution sharing o  So authentication, authorization, single-signon, parallel-ftp o  Heterogeneous workflow (run job A on mach. B, then job C on mach. D)

•  Basic model: jobs selected from batch queue, take over cluster •  Simplified “pile of work”: when a core comes free, take a task from

the run queue and run to completion

Jan 30, 2017 15719/18847b Adv. Cloud Computing 11

Page 12: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Cloud Programming, back to the future

•  HPC demanded too much expertise, too many details and tuning

•  Cloud frameworks all about making parallel programming easier o  Willing to sacrifice efficiency (too willing perhaps)

o  Willing to specialize to application (rather than machine)

•  Canonical BigData user has data & processing needs that require lots

of computer, but doesn’t have CS or HPC training & experience o  Wants to learn least amount of computer science to get results this week

o  Might later want to learn more if same jobs become a personal bottleneck

Jan 30, 2017 15719/18847b Adv. Cloud Computing 12

Page 13: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Jan 30, 2017 15719/18847b Adv. Cloud Computing 13

2005 NIST Arabic-English Competition

Translate 100 articles •  2005 : Google wins!

Qualitatively better 1st entry

Not most sophisticated approach No one knew Arabic Brute force statistics

But more data & compute !! 200M words from UN translations 1 billion words of Arabic docs 1000 processor cluster

! Can’t compete w/o big data

BLEU Score

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Google ISI IBM+CMU UMD JHU+CU Edinburgh

Systran

Mitre

FSC

0.7

Topic identification

Human-edittable translation

Usable translation

Expert human translator

Useless

Page 14: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Cloud Programming Case Studies

•  MapReduce o  Package two Sipelstein91 operators filter/map and reduce as the base of a

data parallel programming model built around Java libraries

•  DryadLinq o  Compile workflows of different data processing programs into

schedulable processes

•  Spark o  Work to keep partial results in memory, and declarative programming

•  TensorFlow o  Specialize to iterative machine learning

Jan 30, 2017 15719/18847b Adv. Cloud Computing 14

Page 15: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

MapReduce (Majd)

Jan 30, 2017 15719/18847b Adv. Cloud Computing 15

Page 16: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

DryadLinq

•  Simplify efficient data parallel code o  Compiler support for imperative and

declarative (eg., database) operations o  Extends MapReduce to workflows

that can be collectively optimized

•  Data flows on edges between processes at vertices (workflows) •  Coding is processes at vertices and expressions representing workflow •  Interesting part of the compiler operates on the expressions

o  Inspired by traditional database query optimizations – rewrite the execution plan with equivalent plan that is expected to execute faster

Jan 30, 2017 15719/18847b Adv. Cloud Computing 16

Page 17: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

DryadLinq

•  Data flowing through a graph abstraction o  Vertices are programs (possibly different with each vertex) o  Edges are data channels (pipe-like) o  Requires programs to have no side-effects (no changes to shared state) o  Apply function similar to MapReduce reduce – open ended user code

•  Compiler operates on expressions, rewriting execution sequences o  Exploits prior work on compiler for workflows on sets (LINQ) o  Extends traditional database query planning with less type restrictive code

•  Unlike traditional plans, virtualizes resources (so might spill to storage)

o  Knows how to partition sets (hash, range and round robin) over nodes o  Doesn’t always know what processes do, so less powerful optimizer than

database – where it can’t infer what is happening, it takes hints from users o  Can auto-pipeline, remove redundant partitioning, reorder partitionings, etc

Jan 30, 2017 15719/18847b Adv. Cloud Computing 17

Page 18: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Spark: Optimize/generalize MR for iterative apps

•  MapReduce uses disks for input, tmp, & output

•  Want to use memory mostly

•  Machine Learning apps iterate over same data to “solve” something o  Way too much use of disk when the data is not giant

•  Spark is MR rewrite: more general (dryad-like graphs of work), more interactive (scala interpreter) & more efficient (in-memory)

Jan 30, 2017 15719/18847b Adv. Cloud Computing 18

Through files (disk)

Page 19: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Spark Resilient Distributed Datasets (RDD)

•  Spark programs are functional, deterministic

=> same input means same result o  This is the basis of selective re-execution and automated fault-tolerance

•  Spark is a set/collection (called an RDD) oriented system o  Splits a set into partitions, and assign to workers to parallelize operation

•  Store invocation (code & args) with inputs as a closure o  Treat this as a “future” – compute now or later at system’s choice (lazy)

o  If code & inputs already at node X, “args” is faster to send than results •  Futures can be used as compression on wire & in replica nodes

Jan 30, 2017 15719/18847b Adv. Cloud Computing 19

Page 20: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Spark Resilient Distributed Datasets (RDD) con’t

•  Many operators are builtins (well-known properties, like Dryad) o  Spark automates transforms when pipelining multiple builtin operations

•  Spark is lazy – only specific operators force computation o  E,g, materialize in file system

o  Build programs interactively, computing only when and what user needs

o  Lineage is chain of invocations: future on future … delayed compute

•  Replication/FT: ship & cache RDDs on other nodes o  Can recompute everything there if needed, but mostly don’t

o  Save space in memory on replicas and network bandwidth

o  Need entire lineage to be replicated in non-overlapping fault domains

Jan 30, 2017 15719/18847b Adv. Cloud Computing 20

Page 21: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Spark “combining python functions” example

•  rdd_x.map(foo).map(bar)

•  Function foo() takes in a record x and outputs a record y

•  Function bar() takes in a record y and outputs a record z

•  Spark automatically creates a function foo_bar() that takes in a

record x and outputs a record z.

Feb 1, 2016 15719 Adv. Cloud Computing 21

Page 22: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Next day plan

•  Encapsulation and virtual machines o  Guest lecturer, Michael Kozuch, Intel Labs

Feb 1, 2016 15719 Adv. Cloud Computing 22

Page 23: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Programming Models MapReduce

15-719/18-847b Advanced Cloud Computing Spring 2017

Majd Sakr, Garth Gibson, Greg Ganger

January 30, 2017

1

Page 24: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Motivation

• How do you perform batch processing of large data sets using low cost clusters with thousands of commodity machines which frequently experience partial failure or slowdowns

Page 25: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Batch Processing of Large Datasets

• Challenges – Parallel programming – Job orchestration – Scheduling – Load Balancing – Communication – Fault Tolerance – Performance – …

Page 26: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Google MapReduce

• Data parallel framework for processing Big Data on large commodity hardware

• Transparently tolerates – Data faults – Computation faults

• Achieves – Scalability and fault tolerance

Page 27: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Commodity Clusters � MapReduce is designed to efficiently process big data using regular

commodity computers � A theoretical 1000-CPU machine would cost a very large amount of

money, far more than 1000 single-CPU or 250 quad-core commodity machines

� Premise: MapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster to solve Big Data problems

5

Page 28: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Three Strategies

• Parallelism – Break down jobs into distributed independent

tasks to exploit parallelism

• Scheduling – Consider data-locality and variations in overall

system workloads for scheduling

• Fault Tolerance – Transparently tolerate data and task failures

Page 29: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Hadoop MapReduce • Hadoop is an open source implementation of MapReduce

– ~2006

• Hadoop presents MapReduce as an analytics engine and under

the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS) • HDFS mimics Google File System (GFS)

• Applications in MapReduce are represented as jobs

• Each job encompasses several map and reduce tasks • Map and reduce tasks operate on data independently and in parallel

7

Page 30: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

MapReduce In a Nutshell • MapReduce incorporates two phases

• Map Phase • Reduce phase

Map Task

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Partition Partition

Partition

Partition

Partition Partition Partition Partition

Partition

To HDFS Dataset

HDFS

HDFS BLK

HDFS BLK

HDFS BLK

HDFS BLK

Map Phase Shuffle Stage

Merge & Sort Stage Reduce Stage

Reduce Phase

Partition

Split 0

Split 1

Split 2

Split 3

Partition Partition

Partition

Partition Partition

Partition

Partition

Page 31: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Data Distribution • In a MapReduce cluster, data is distributed to all the nodes of the cluster

as it is being loaded

• An underlying distributed file systems (e.g., GFS, HDFS) splits large data files into chunks which are managed by different nodes in the cluster

• Even though the file chunks are distributed across several machines, they form a single namespace

9

Input data: A large file

Node 1

Chunk of input data

Node 2

Chunk of input data

Node 3

Chunk of input data

Page 32: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Network Topology In MapReduce

• MapReduce assumes a tree style network topology

• Nodes are spread over different racks embraced in one or many data centers

• The bandwidth between two nodes is dependent on their relative locations in the network topology

• The assumption is that nodes that are on the same rack will have higher bandwidth between them as opposed to nodes that are off-rack

Page 33: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Computing Units: Tasks � MapReduce divides the workload into multiple independent tasks and

automatically schedule them on cluster nodes

� A work performed by each task is done in isolation from one another

� The amount of communication which can be performed by tasks is limited mainly for scalability and fault-tolerance reasons

11

Page 34: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

MapReduce Phases • In MapReduce, splits are processed in

isolation by tasks called Mappers

• The output from the mappers is denoted as intermediate output and brought into a second set of tasks called Reducers

• The process of reading intermediate output into a set of Reducers is known as shuffling

• The Reducers produce the final output

• Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase

C0 C1 C2 C3

M0 M1 M2 M3

IO IO IO IO

R0 R1

FO FO

splits

mappers

Reducers

Map

Pha

se

Redu

ce P

hase

Shuffling Data

Merge & Sort

Page 35: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Keys and Values • The programmer in MapReduce has to specify two functions, the

map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program

• In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs

• The map and reduce functions receive and emit (K, V) pairs

(K, V) Pairs

Map Function

(K’, V’) Pairs

Reduce Function

(K’’, V’’) Pairs

Input Splits Intermediate Output Final Output

Page 36: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Splits

• A split can contain reference to one or more HDFS blocks • Configurable parameter

• Each map task processes one split • # of splits dictates the # of map tasks • By default one split contains reference to one HDFS block • Map tasks are scheduled in the vicinity of HDFS blocks to

reduce network traffic

Page 37: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Partitions • Map tasks store intermediate output on local disk (not HDFS) • A subset of intermediate key space is assigned to each Reducer

• hash(key) mod R

• These subsets are known as partitions

Different colors represent different keys (potentially) from different Mappers

Partitions are the input to Reducers

Page 38: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Hadoop MapReduce: A Closer Look

file

file

InputFormat

Split Split Split

RR RR RR

Map Map Map

Input (K, V) pairs

Partitioner

Intermediate (K, V) pairs

Sort

Reduce

OutputFormat

Files loaded from local HDFS store

RecordReaders

Final (K, V) pairs

Writeback to local HDFS store

file

file

InputFormat

Split Split Split

RR RR RR

Map Map Map

Input (K, V) pairs

Partitioner

Intermediate (K, V) pairs

Sort

Reduce

OutputFormat

Files loaded from local HDFS store

RecordReaders

Final (K, V) pairs

Writeback to local HDFS store

Node 1 Node 2

Shuffling Process

Intermediate (K,V) pairs

exchanged by all nodes

Page 39: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Input Files • Input files are where the data for a MapReduce task is

initially stored

• The input files typically reside in a distributed file system (e.g. HDFS)

• The format of input files is arbitrary • Line-based log files • Binary files • Multi-line input records • Or something else entirely

17

file

file

Page 40: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

InputFormat • How the input files are split up and read is defined

by the InputFormat • InputFormat is a class that does the following:

• Selects the files that should be used for input • Defines the InputSplits that break a file • Provides a factory for RecordReader objects that read the file

18

file

file

InputFormat

Files loaded from local HDFS store

Page 41: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

InputFormat Types

• Several InputFormats are provided with Hadoop:

19

InputFormat Description Key Value TextInputFormat Default format;

reads lines of text files

The byte offset of the line

The line contents

KeyValueInputFormat Parses lines into (K, V) pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat A Hadoop-specific high-performance binary format

user-defined user-defined

InputFormat Description Key Value TextInputFormat Default format;

reads lines of text files

The byte offset of the line

The line contents

KeyValueInputFormat Parses lines into (K, V) pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat A Hadoop-specific high-performance binary format

user-defined user-defined

InputFormat Description Key Value TextInputFormat Default format;

reads lines of text files

The byte offset of the line

The line contents

KeyValueInputFormat Parses lines into (K, V) pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat A Hadoop-specific high-performance binary format

user-defined user-defined

InputFormat Description Key Value TextInputFormat Default format;

reads lines of text files

The byte offset of the line

The line contents

KeyValueInputFormat Parses lines into (K, V) pairs

Everything up to the first tab character

The remainder of the line

SequenceFileInputFormat A Hadoop-specific high-performance binary format

user-defined user-defined

MyInputFormat A user-specified input format

user-defined user-defined

Page 42: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Input Splits • An input split describes a unit of data that a single map task in a

MapReduce program will process

• By dividing the file into splits, we allow several map tasks to operate on a single file in parallel

• If the file is very large, this can improve performance significantly through parallelism

• Each map task corresponds to a single input split

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

Page 43: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

RecordReader • The input split defines a slice of data but does not describe how

to access it

• The RecordReader class actually loads data from its source and converts it into (K, V) pairs suitable for reading by Mappers

• The RecordReader is invoked repeatedly on the input until the entire split is consumed

• Each invocation of the RecordReader leads to another call of the map function defined by the programmer

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

RR RR RR

Page 44: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Mapper and Reducer • The Mapper performs the user-defined work of the

first phase of the MapReduce program

• A new instance of Mapper is created for each split

• The Reducer performs the user-defined work of the second phase of the MapReduce program

• A new instance of Reducer is created for each partition • For each key in the partition assigned to a Reducer, the

Reducer is called once

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

RR RR RR

Map Map Map

Partitioner

Sort

Reduce

Page 45: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Partitioner • Each mapper may emit (K, V) pairs to any partition

• Therefore, the map nodes must all agree on

where to send different pieces of intermediate data

• The partitioner class determines which partition a given (K,V) pair will go to

• The default partitioner computes a hash value for a given key and assigns it to a partition based on this result

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

RR RR RR

Map Map Map

Partitioner

Sort

Reduce

Page 46: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Sort (merge) • Each Reducer is responsible for reducing

the values associated with (several) intermediate keys

• The set of intermediate keys on a single node is automatically sorted (merged) by MapReduce before they are presented to the Reducer

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

RR RR RR

Map Map Map

Partitioner

Sort

Reduce

Page 47: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

OutputFormat • The OutputFormat class defines the way (K,V) pairs

produced by Reducers are written to output files

• The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS

• Several OutputFormats are provided by Hadoop:

file

file

InputFormat

Split Split Split

Files loaded from local HDFS store

RR RR RR

Map Map Map

Partitioner

Sort

Reduce

OutputFormat

OutputFormat Description TextOutputFormat Default; writes lines in "key \t

value" format

SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat Generates no output files

OutputFormat Description TextOutputFormat Default; writes lines in "key \t

value" format

SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat Generates no output files

OutputFormat Description TextOutputFormat Default; writes lines in "key \t

value" format

SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat Generates no output files

OutputFormat Description TextOutputFormat Default; writes lines in "key \t

value" format

SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs

NullOutputFormat Generates no output files

Page 48: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Combiner Functions • MapReduce applications are limited by the bandwidth available on the cluster • It pays to minimize the data shuffled between map and reduce tasks • Hadoop allows the user to specify a combiner function (just like the reduce

function) to be run on a map output only if the Reduce function is commutative and associative.

MT

MT

MT

MT

MT

MT

RT LEGEND: •R = Rack •N = Node •MT = Map Task •RT = Reduce Task •Y = Year •T = Temperature

MT (1950, 0) (1950, 20) (1950, 10)

(1950, 20)

Map output

Combiner output

N

N

R

N

N

R

(Y, T)

Page 49: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

MapReduce In a Nutshell • MapReduce incorporates two phases, Map and Reduce.

Page 50: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

The Shuffle in MapReduce

Page 51: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Job Scheduling in MapReduce

• In MapReduce, an application is represented as a job

• A job encompasses multiple map and reduce tasks

• Job schedulers in MapReduce are pluggable

• Hadoop MapReduce by default FIFO scheduler for jobs

• Schedules jobs in order of submission • Starvation with long-running jobs • No job preemption • No evaluation of job priority or size

29

Page 52: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Multi-user Job Scheduling in MapReduce

• Fair scheduler (Facebook) – Pools of jobs, each pool is assigned a set of shares – Jobs get (on ave) equal share of slots over time – Across pools, use Fair scheduler, within pool, FIFO

or Fair scheduler • Capacity scheduler (Yahoo!)

– Creates job queues – Each queue is configured with # (capacity) of slots – Within queue, scheduling is priority based

Page 53: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Task Scheduling in MapReduce • MapReduce adopts a master-slave architecture

• The master node in MapReduce is referred

to as Job Tracker (JT)

• Each slave node in MapReduce is referred to as Task Tracker (TT)

• MapReduce adopts a pull scheduling strategy rather than a push one • I.e., JT does not push map and reduce tasks to TTs but rather TTs pull them by

making requests

31

JT

T0 T1 T2

Tasks Queue

TT Task Slots

TT Task Slots

T0 T1

Page 54: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Map and Reduce Task Scheduling

• Every TT sends a heartbeat message periodically to JT encompassing a request for a map or a reduce task to run

I. Map Task Scheduling: • JT satisfies requests for map tasks via attempting to schedule mappers in the

vicinity of their input splits (i.e., it considers locality)

II. Reduce Task Scheduling: • However, JT simply assigns the next yet-to-run reduce task to a requesting TT

regardless of TT’s network location and its implied effect on the reducer’s shuffle time (i.e., it does not consider locality)

32

Page 55: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Task Scheduling

Page 56: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

• A golden principle adopted by Hadoop is: “Moving computation towards data is cheaper than moving data towards computation” – Hadoop applies this principle to Map task scheduling

• With map task scheduling, once a slave (or a TaskTracker- TT) polls for a map task, M, at the master node (or the JobTracker- JT), JT attempts to assign TT an M that has its input data local to TT

Core Switch

TaskTracker1

Request a Map Task

Schedule a Map Task at an Empty Map Slot on TaskTracker1

Rack Switch 1 Rack Switch 2

TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3

Task Scheduling in Hadoop

Page 57: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Task Scheduling in Hadoop • Hadoop does not apply the locality principle to Reduce task scheduling

• With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for a reduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any R

CS

RS1

TT1

RS2

TT2 TT3 TT4 TT5 JT Request Reduce Task R

Assign R to TT1

Shuffle Partitions

CS= Core Switch & RS = Rack Switch

A locality problem, where R is scheduled at TT1 while its partitions exist at TT4

35

Page 58: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Fault Tolerance in Hadoop

• Data redundancy • Achieved at the storage layer through replicas (default is 3) • Stored at physically separate machines • Can tolerate

– Corrupted files – Faulty nodes

• HDFS: – Computes checksums for all data written to it – Verifies when reading

• Task Resiliency (task slowdown or failure) • Monitors to detect faulty or slow tasks • Replicates tasks

36

Page 59: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Task Resiliency • MapReduce can guide jobs toward a successful completion even when jobs are

run on a large cluster where probability of failures increases

• The primary way that MapReduce achieves fault tolerance is through restarting tasks

• If a TT fails to communicate with JT for a period of time (by default, 1 minute in Hadoop), JT will assume that TT in question has crashed • If the job is still in the map phase, JT asks another TT to re-execute all

Mappers that previously ran at the failed TT

• If the job is in the reduce phase, JT asks another TT to re-execute all Reducers that were in progress on the failed TT

37

Page 60: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Speculative Execution • A MapReduce job is dominated by the slowest task

• MapReduce attempts to locate slow tasks (stragglers) and run redundant

(speculative) tasks that will optimistically commit before the corresponding stragglers

• This process is known as speculative execution

• Only one copy of a straggler is allowed to be speculated

• Whichever copy of a task commits first, it becomes the definitive copy, and the other copy is killed by JT

Page 61: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Locating Stragglers

• How does Hadoop locate stragglers? • Hadoop monitors each task progress using a progress score between

0 and 1 based on the amount of data processed

• If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler

PS= 2/3

PS= 1/12

9 Not a straggler T1

T2

Time

A straggler

Page 62: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Issues with Speculative Execution • Susceptible in heterogeneous environments

– If transient congestion, lots of speculative tasks • Launches speculative tasks without checking speed of

TT or load of speculative task – Slow TT will become slower

• Locality always trumps – If 2 speculative tasks ST1 & ST2

• With stragglers T1@70% and T2@20% • If task slot is local to ST2’s HDFS block, ST2 gets scheduled

• Three reduce stages treated equally – Shuffle stage is typically slower than the merge & sort and

reduce stages

Page 63: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

MapReduce Applications

Input Data Output

Data Reduce

Shuffled Data

Local Disk or Network Network Network

Map

Data Pattern Shuffle Data/Map Input Ratio

Example Map Input Shuffle Data

0 Sobel Edge Detection

<< 1 Grep

≈ 1 Sort

>> 1 Some IR apps

Page 64: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

Grep Example

0:00:08

0:00:10

0:00:11

0:00:09

0:00:13

0:00:10

0:00:09

0:00:08

0:00:09

0:00:11

0:00:09

0:00:09

0:00:08

0:00:09

0:00:05

0:00:37

0:00:35

0:00:36

00:00:00 00:00:09 00:00:17 00:00:26 00:00:35 00:00:43 00:00:52

Map task 1

Map task 2

Map task 3

Map task 4

Map task 5

Map task 6

Map task 7

Map task 8

Map task 9

Map task 10

Map task 11

Map task 12

Map task 13

Map task 14

Map task 15

Reduce Task 1

Reduce Task 1 - Shuffling

Reduce Task 1 - Sorting

Grep Search Map & Reduce task details

Page 65: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

TeraSort Example

0:00:12

0:00:12

0:00:13

0:00:16

0:00:13

0:00:31

0:00:12

0:00:13

0:00:18

0:00:15

0:00:16

0:00:15

0:00:13

0:00:22

0:00:09

0:00:10

0:02:54

0:01:28

0:01:50

00:00:00 00:00:43 00:01:26 00:02:10 00:02:53 00:03:36

Map task 1

Map task 2

Map task 3

Map task 4

Map task 5

Map task 6

Map task 7

Map task 8

Map task 9

Map task 10

Map task 11

Map task 12

Map task 13

Map task 14

Map task 15

Map task 16

Reduce Task 1

Reduce Task 1 - Shuffling

Reduce Task 1 - Sorting

TeraSort Map & Reduce Task Details

Page 66: Programming Models and Frameworks Advanced Cloud …garth/15719/lectures/15719-S17-L4-Programming-Models-combined.pdfProgramming Models and Frameworks Advanced Cloud Computing 15-719/18-847b

What Makes MapReduce Popular?

• MapReduce is characterized by: 1. Its simplified programming model which allows the user to

quickly write and test distributed systems 2. Its efficient and automatic distribution of data and workload

across machines 3. Its flat scalability curve. Specifically, after a Mapreduce

program is written and functioning on 10 nodes, very little-if any- work is required for making that same program run on 1000 nodes

4. Its fault tolerance approach

44