survey of programming models for data oriented grid computing

34
Survey of Programming Survey of Programming Models Models for Data Oriented Grid for Data Oriented Grid Computing Computing Douglas Thain Douglas Thain [email protected] [email protected] University of Notre Dame University of Notre Dame 1 November 2007 1 November 2007

Upload: nanda

Post on 15-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Survey of Programming Models for Data Oriented Grid Computing. Douglas Thain [email protected] University of Notre Dame 1 November 2007. Data Oriented Programming Models. Overview and Challenges Examples of Languages DAG Oriented Database Oriented Abstraction/Pattern Oriented - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Survey of Programming Models for Data Oriented Grid Computing

Survey of Programming ModelsSurvey of Programming Modelsfor Data Oriented Grid Computingfor Data Oriented Grid Computing

Douglas ThainDouglas [email protected]@nd.edu

University of Notre DameUniversity of Notre Dame1 November 20071 November 2007

Page 2: Survey of Programming Models for Data Oriented Grid Computing

Data Oriented Programming ModelsData Oriented Programming Models

Overview and ChallengesOverview and ChallengesExamples of LanguagesExamples of Languages– DAG OrientedDAG Oriented– Database OrientedDatabase Oriented– Abstraction/Pattern OrientedAbstraction/Pattern Oriented

Current Work at Notre DameCurrent Work at Notre Dame– Assembly: ChirpAssembly: Chirp– Abstraction: All PairsAbstraction: All Pairs– Language: DataLabLanguage: DataLab

Ruminations on Language RequirementsRuminations on Language Requirements

Page 3: Survey of Programming Models for Data Oriented Grid Computing

OverviewOverviewSurvey of models for expressing large Survey of models for expressing large data data intensiveintensive workloads, typically constructed by workloads, typically constructed by assembling sequential components together.assembling sequential components together.Some challenges are similar to CPU parallelism:Some challenges are similar to CPU parallelism:– How does the user express parallelism?How does the user express parallelism?– Can the system discover and exploit parallelism?Can the system discover and exploit parallelism?– What is the optimal decomposition of a problem?What is the optimal decomposition of a problem?

Some challenges are particular to data:Some challenges are particular to data:– System state persists across executions.System state persists across executions.– Component behavior is not well specified.Component behavior is not well specified.– Bad decisions can result in 1000x slowdown.Bad decisions can result in 1000x slowdown.– Thus, goal is usually to avoid awful cases.Thus, goal is usually to avoid awful cases.

Page 4: Survey of Programming Models for Data Oriented Grid Computing

CommonalitiesCommonalities

Most data-oriented languages are Most data-oriented languages are declarativedeclarative rather than rather than imperativeimperative..Why is this necessary?Why is this necessary?– Enormous number of failure modes.Enormous number of failure modes.

User doesn’t want to know the whole ugly story.User doesn’t want to know the whole ugly story.

– Primitive operations are persistent.Primitive operations are persistent.Batch job has lifetime independent of submitter.Batch job has lifetime independent of submitter.

– Probability of coordinator failing is high.Probability of coordinator failing is high.Need transactions, leases, and logging to recover.Need transactions, leases, and logging to recover.

– System bears the responsibility of cleanup.System bears the responsibility of cleanup.Cannot simply Ctrl-C the coordinator cleanly.Cannot simply Ctrl-C the coordinator cleanly.

Page 5: Survey of Programming Models for Data Oriented Grid Computing

From Previous TalksFrom Previous TalksLanguagesLanguages– The system provides a set of primitive operations that The system provides a set of primitive operations that

the user may combine together in many ways.the user may combine together in many ways.– The system may optimize certain cases, but cannot The system may optimize certain cases, but cannot

predict all uses, so the programmer must be careful.predict all uses, so the programmer must be careful.

Abstractions or PatternsAbstractions or Patterns– The system provides a very restricted interface and The system provides a very restricted interface and

the user can only solve problems that fit.the user can only solve problems that fit.– The system can provide a very good implementation The system can provide a very good implementation

of the restricted case, so the user can be naive.of the restricted case, so the user can be naive.

Obviously, there is a continuum between the two Obviously, there is a continuum between the two extremes. Most grid languages tend closer to extremes. Most grid languages tend closer to the abstraction side of the graph.the abstraction side of the graph.

Page 6: Survey of Programming Models for Data Oriented Grid Computing

DAG Oriented LanguagesDAG Oriented Languages

Page 7: Survey of Programming Models for Data Oriented Grid Computing

DAGManDAGMan

Douglas Thain and Miron Livny, “Condor and the Grid” in Berman, Hey, and Fox, “Grid Computing: Making the Global Infrastructre a Reality, John Wiley, 2003.

A

B

DC

JOB A a.submitJOB B b.submitJOB C c.submitJOB D d.submitPARENT A CHILD BPARENT B CHILD CPARENT B CHILD D

DAGMan Condor

submit dagsubmit

jobs

CPU

CPU

CPU

executejobs

LOGjobstatus

Page 8: Survey of Programming Models for Data Oriented Grid Computing

Data DependenciesData Dependencies

A

B

DC

data

data data

Control dependencies are almost always expressible as data dependencies.

If the system is aware of the data interactions, it can protect limited resources, make better scheduling decisions, and be more robust to failures.

Example: Don’t stage out intermediate files, leave them in place for next execution; if lost, re-execute the creator.

Page 9: Survey of Programming Models for Data Oriented Grid Computing

BAD-FSBAD-FS

John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny, "Explicit Control in a Batch Aware Distributed File System", NSDI 2004.

A

B

C

S1

D

E

F

S2

INPUT

out.dat out.dat

JOB A a.submitJOB B b.submitJOB C c.submitPARENT A CHILD BPARENT B CHILD C

VOLUME INPUT source-urlVOLUME S1 scratchVOLUME S2 scratchMOUNT INPUT A /dataMOUNT INPUT D /dataMOUNT S1 A /tmpMOUNT S2 D /tmpEXTRACT S1 out.dat target-urlEXTRACT S2 out.dat target-url

Page 10: Survey of Programming Models for Data Oriented Grid Computing

PegasusPegasus

Ewa Deelman et al, “Pegasus: Mapping Scientific Workflows onto the Grid”,Scientific Programming Journal, Voume 13, number 3, 2005.

A

B

DC

temp

temp temp

out2out1

in1

A

B

DC

transfer togridftp://other/data

/tmp/data1 /tmp/data2

transfer togridftp://home/out2

transfer togridftp://home/out1

transfer togridftp://server/data

SinglePass

Translation

Abstract DAG Concrete DAG

toDAGMan

Page 11: Survey of Programming Models for Data Oriented Grid Computing

DryadDryad

M. Isard, M Budiu, Y. Yu, A. Birrell, D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks”, Eurosys 2007.

grep

BA C

grep sort

cmp

fold

output

Iterative construction in C++:

GraphBuilder X = grep^A;GraphBuilder Y = grep^B;GraphBuilder Z = X <= Y;GraphBuilder S = sort^C;GraphBuilder F = fold(Z,S);

Page 12: Survey of Programming Models for Data Oriented Grid Computing

Database Oriented LanguagesDatabase Oriented Languages

In a given field, many researchers exploit In a given field, many researchers exploit a common toolchain with many different a common toolchain with many different inputs in order to explore a param space.inputs in order to explore a param space.Idea: Represent programs as standard Idea: Represent programs as standard transformations from one data space to transformations from one data space to another. Store all results in a database.another. Store all results in a database.Virtual Data:Virtual Data: User simply performs a User simply performs a query in the target space, and doesn’t care query in the target space, and doesn’t care whether results are computed or stored.whether results are computed or stored.

Page 13: Survey of Programming Models for Data Oriented Grid Computing

ChimeraChimera

Ian Foster, Jens Vöckler, Michael Wilde, Yong Zhao, “Chimera: A Virtual Data System For Representing, Querying, and Automating Data Derivation”, 2002.

TR simulate( input p, output a ) { exec “sim.exe –temp $p >$a”;}

TR analyze( input b, output c ) { exec “analyze.exe $b >$c”;}

TR runexpt( input p, output d ) { file temp; simulate(p,temp); analyze(temp,d);}

Note: This is not the real Chimera syntax,it has been simplified for clarity.

DV runexpt( 10, file )would return existing data.

Transformation Database

DV runexpt( 20, file )

DV runexpt( 30, file )

DV runexpt( 10, file )

DV runexpt( 15, file )would execute code

and then return the data

Derivation Database

Page 14: Survey of Programming Models for Data Oriented Grid Computing

GridDBGridDB

David Liu, Michael Franklin, “GridDB: A Data Centric Overlay for Scientific Grids”,VLDB 2004.

Same NSF project, except:Same NSF project, except:– SQL is the interaction language.SQL is the interaction language.– Separate pushing of inputs from output query.Separate pushing of inputs from output query.

input table output table

sim.exeINSERT SELECT

Page 15: Survey of Programming Models for Data Oriented Grid Computing

SwiftSwift

(Run or) reorientRun (Run ir, string dir, string ov ) {(Run or) reorientRun (Run ir, string dir, string ov ) {foreach Volume iv, i in ir.v {foreach Volume iv, i in ir.v {

or.[i] = reorient(iv,dir,ov);or.[i] = reorient(iv,dir,ov);}}

}}

Y. Zhao, et al, “Swift: Fast, Reliable, Loosely Coupled Parallel Computation”,IEEE International Workshop on Scientific Workflows, 2007

Derived from Chimera, with three key differences:- Much improved syntax (IMHO)- Complex data types.- Full program text and state stored in file system.

Page 16: Survey of Programming Models for Data Oriented Grid Computing

Abstraction/Pattern LanguagesAbstraction/Pattern Languages

A single system structure is suitable for A single system structure is suitable for solving a wide variety of problems. solving a wide variety of problems.

Often, the code for the system structure is Often, the code for the system structure is far more complicated than the application.far more complicated than the application.

Solution: Let the user provide a few Solution: Let the user provide a few snippets of code to embed in a larger snippets of code to embed in a larger class or pattern.class or pattern.

Page 17: Survey of Programming Models for Data Oriented Grid Computing

Master-WorkerMaster-Worker

Master

work queue

WorkerWorker

WorkerWorker

WorkerWorker

Work Assignments

Complete Results

WorkerWorker

WorkerWorker

MW

Used to attack brute-force optimization problems.100,000s of CPUs in BOINC, Folding@Home, etc...

addworkunits

Page 18: Survey of Programming Models for Data Oriented Grid Computing

Master-WorkerMaster-Worker

Goux et al, “An enabling framework for Master-Worker applications on the Computational Grid”, HPDC 2000.

void master() {

queueWorkUnit( base_case );

while( r = getNextResult() ) { if( appl condition ) { queueWorkUnit( more );

} else {

printResult(r);

}

}

}

void worker() {

while(1) {

u = getNextWorkUnit();

r = application work;

transmitResult(r);

}

}

Implemented on Condor/Condor-Gusing PVM/files/sockets for communication.

Page 19: Survey of Programming Models for Data Oriented Grid Computing

Map-ReduceMap-Reduce

mapnouns

verbs

mapnouns

verbs

mapnouns

verbs

reduce

reduce

uniquenouns

uniqueverbs

doc

doc

doc

inputs:(file,word)

intermediates(word,count)

output:(word,count)

Sample Application:

Identify all unique nouns and verbs in 1M documents

Page 20: Survey of Programming Models for Data Oriented Grid Computing

Map-ReduceMap-Reduce

Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”. OSDI 2004.

class MyMR : Map Reduce {

void map( Enum tuples ) {

foreach (k,v) in tuples {

kind = NounOrVerb(v);

EmitIntermediate(kind,v);

}

}

void reduce( Key key ,Enum vals ) { foreach v in vals { total ++; } Emit( key, total );}

};

Implemented on the Google infrastructure,hiding problems such as data co-location, failure, stragglers.

Page 21: Survey of Programming Models for Data Oriented Grid Computing

All-Pairs Image ComparisonAll-Pairs Image Comparison

11 .8.8 .1.1 00 00 .1.1

11 00 .1.1 .1.1 00

11 00 .1.1 .3.3

11 00 00

11 .1.1

11

F

Current Workload:4000 images256 KB each10s per F

Future Workload:60000 images1MB each1s per F

Page 22: Survey of Programming Models for Data Oriented Grid Computing

Non-Expert User Using 500 CPUsNon-Expert User Using 500 CPUsTry 1: Each F is a batch job.Failure: Dispatch latency >> F runtime.

HN

CPU CPU CPU CPUF F F FCPUF

Try 2: Each row is a batch job.Failure: Too many small ops on FS.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 3: Bundle all files into one package.Failure: Everyone loads 1GB at once.

HN

CPU CPU CPU CPUF F F FCPUFFFF FF

FFFF

FFF

FFF

Try 4: User gives up and attemptsto solve an easier or smaller problem.

Page 23: Survey of Programming Models for Data Oriented Grid Computing

All Pairs AbstractionAll Pairs Abstraction

Chris Moretti, Jared Bulosan, Douglas Thain, and Patrick Flynn, “All-Pairs: An Abstraction for Data Intensive Computing”, under review, 2007.

set S of filesbinary function F

F

M = AllPairs(F,S)

invocation

Page 24: Survey of Programming Models for Data Oriented Grid Computing

All Pairs Production SystemAll Pairs Production System

Web Portal300 active storage units500 CPUs, 40TB disk

F G H

S T

All-PairsEngine

2 - AllPairs(F,S)

F F F

F F F

3 - O(log n) distributionby spanning tree.

6 - Return resultmatrix to user.

1 - Upload F and Sinto web portal.

5 - Collect andassemble results.

4 – Choose optimal partitioningand submit batch jobs.

Page 25: Survey of Programming Models for Data Oriented Grid Computing

Initial Results on Real WorkloadInitial Results on Real Workload

Page 26: Survey of Programming Models for Data Oriented Grid Computing

Current Work onCurrent Work onProgramming Active StorageProgramming Active Storage

Page 27: Survey of Programming Models for Data Oriented Grid Computing

Layers of Language DesignLayers of Language Design

A programming environment consists of A programming environment consists of several layers of concepts:several layers of concepts:– Assembly language: A fundamental set of Assembly language: A fundamental set of

operations that define and constrain the operations that define and constrain the possible programs. (load, store, add...)possible programs. (load, store, add...)

– Abstractions: Groupings of operations that Abstractions: Groupings of operations that express the most common idioms employed express the most common idioms employed by end users. (stacks, functions, arrays)by end users. (stacks, functions, arrays)

– Language: A concrete syntax that compactly Language: A concrete syntax that compactly represents the abstractions of the language. represents the abstractions of the language. a[x]*f(x);a[x]*f(x);

Page 28: Survey of Programming Models for Data Oriented Grid Computing

Assembly LanguageAssembly Language

Array of active storage servers that combine Array of active storage servers that combine basic data storage with remote execution.basic data storage with remote execution.Data Operations:Data Operations:– open, read, write, close, getdir, unlink, stat, ...open, read, write, close, getdir, unlink, stat, ...– getacl, setacl, getfile, putfilegetacl, setacl, getfile, putfile

CPU Operations:CPU Operations:– job_begin – create a new job, return the txn #job_begin – create a new job, return the txn #– job_commit – enable the job to executejob_commit – enable the job to execute– job_wait – wait for the job to reach a final statejob_wait – wait for the job to reach a final state– job_kill – force the job into a final statejob_kill – force the job into a final state– job_remove – remove state associated with the jobjob_remove – remove state associated with the job

Using our own implementation gives us very Using our own implementation gives us very precise control over the system semantics.precise control over the system semantics.

Page 29: Survey of Programming Models for Data Oriented Grid Computing

http://www.cse.nd.edu/~ccl/viz

Page 30: Survey of Programming Models for Data Oriented Grid Computing

AbstractionsAbstractions

unixfilesys

chirpserver

unixfilesys

chirpserver

unixfilesys

chirpserver

chirpserver

tcshemacs

perl

parrot

set S

chirpserver

X Y

F

A B C

file F

distributed data structures

Y = F(X)

job_startjob_commitjob_waitjob_remove

file system function evaluation

Page 31: Survey of Programming Models for Data Oriented Grid Computing

Language Syntax: DataLabLanguage Syntax: DataLab

chirpserver

chirpserver

chirpserver

chirpserver

set S

chirpserver

A B C

apply F on S into T

set T

A B C

F F F

F

Page 32: Survey of Programming Models for Data Oriented Grid Computing

RuminationsRuminationsWhat is unique about a programming language What is unique about a programming language for large scale data intensive computing?for large scale data intensive computing?– Manipulates remote persistent state.Manipulates remote persistent state.– Likely to compete with others for resources.Likely to compete with others for resources.– Encounters an insane set of failure modes.Encounters an insane set of failure modes.

Two very distinct purposes:Two very distinct purposes:– Constructing new kinds of systems with novel Constructing new kinds of systems with novel

concurrency and data access patterns? (Imperative) concurrency and data access patterns? (Imperative) – Harnessing existing systems within certain well known Harnessing existing systems within certain well known

patterns of interactions? (Declarative)patterns of interactions? (Declarative)

Either way, need to choose the assembly Either way, need to choose the assembly language very carefully!language very carefully!

Page 33: Survey of Programming Models for Data Oriented Grid Computing

Properties of Assembly LanguageProperties of Assembly Language

Need a transactional interface for Need a transactional interface for manipulating remote persistent state.manipulating remote persistent state.– Recover from network failures.Recover from network failures.– Recover from coordinator failure.Recover from coordinator failure.– Precise cancellation of long-running ops.Precise cancellation of long-running ops.

Persistent storage for program state.Persistent storage for program state.– Need a place to store transaction #s.Need a place to store transaction #s.– Allows for fast failure recovery without Allows for fast failure recovery without

scanning all participants. (i.e. avoid fsck.)scanning all participants. (i.e. avoid fsck.)– Simplifies debugging, monitoring, auditing.Simplifies debugging, monitoring, auditing.

Precise semantics under failure conditions.Precise semantics under failure conditions.

Page 34: Survey of Programming Models for Data Oriented Grid Computing

Discussion Topics?Discussion Topics?

Assertion: Getting the semantics of the Assertion: Getting the semantics of the assembly language right is more important than assembly language right is more important than the syntax of the language correct. ???the syntax of the language correct. ???

Creating robust algorithms is too much to ask of Creating robust algorithms is too much to ask of the end user. Therefore: declarative for end the end user. Therefore: declarative for end users, imperative for system builders. ???users, imperative for system builders. ???

Creating robust algorithms is too hard to solve in Creating robust algorithms is too hard to solve in the general case. Therefore: expose the general case. Therefore: expose sophisticated controls that allow the end user to sophisticated controls that allow the end user to make the right decisions. ???make the right decisions. ???