compute grid - harvard business...

38
Compute Grid: Parallel Processing RCS Lunch & Learn Training Series Bob Freeman, PhD Director, Research Technology Operations HBS 8 November, 2017

Upload: others

Post on 12-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

ComputeGrid:ParallelProcessing

RCSLunch&LearnTrainingSeries

BobFreeman,PhDDirector,ResearchTechnologyOperations

HBS

8November,2017

Page 2: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Overview• Q&A• Introduction• Serialvsparallel• ApproachestoParallelization• Submittingparalleljobsonthecomputegrid• Paralleltasks• ParallelCode

Page 3: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

SerialvsParallelwork

Page 4: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Traditionally,softwarehasbeenwrittenforserialcomputers• Toberunonasinglecomputer havingasingleCentralProcessingUnit(CPU)• Problemisbrokenintoadiscretesetofinstructions• Instructionsareexecutedoneaftertheother• Oneoneinstruction canbeexecutedatanymomentintime

SerialvsMulticoreApproaches

Page 5: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Inthesimplestsense,parallelcomputingisthesimultaneoususeofmultiplecomputeresources tosolveacomputationalproblem:• ToberunusingmultipleCPUs• Aproblemisbrokenintodiscreteparts(eitherbyyouortheapplicationitself)thatcanbe

solvedconcurrently• Eachpartisfurtherbrokendowntoaseriesofinstructions• InstructionsfromeachpartexecutesimultaneouslyondifferentCPUsordifferentmachines

SerialvsMulticoreApproaches

Page 6: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Manydifferentparallelizationapproaches,whichwewon'tdiscuss:

6

Sharedmemory Distributedmemory

HybridDistributed-Sharedmemory

SerialvsMulticoreApproaches

Page 7: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

So,wearegoingtobrieflytouchontwoapproaches:

• Paralleltasks• Tasksinthebackground• gnu_parallel• Pleasantlyparallelizing

• Parallelcode• Considerationsforparallelizing• Parallelframeworks&examples

WewillnotdiscussparallelizedframeworkssuchasHadoop,ApacheSpark,MongoDB,ElasticSearch,etc

ParallelProcessing…

Page 8: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

NotaBene!!• Inordertoruninparallel,programs(code)mustbeexplicitlyprogrammedtodoso.• Andyoumustasktheschedulertoreservethosecoresforyourprogram/worktouse.

Thus,requestingcoresfromtheschedulerdoesnotautomagicallyparallelizeyourcode!# SAMPLE JOB FILE#!/bin/bash#BSUB -q normal # Queue to submit to (comma separated)#BSUB -n 8 # Number of cores...blastn –query seqs.fasta –db nt –out seqs.nt.blastn # WRONG!!blastn –query seqs.fasta –db nt –out seqs.nt.blastn –num_threads $LSB_MAX_NUM_PROCESSORS# YES!!

# SAMPLE PARALLELIZED CODEbsub –q normal –n 4 –W 24:00 -R "rusage[mem=4000]" stata-mp4 –d myfile.do

# SAMPLE PARALLEL TASKSbsub –q normal –n 4 –W 24:00 -R "rusage[mem=4000]" \

parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt

# SAMPLE PLEASANT PARALLELIZATIONfor file in folder/*.txt; do

echo $filebsub -q normal -W 24:00 -R "rusage[mem=1000]" python process_input_data.py $file

done

ParallelJobsontheComputeGrid…

Page 9: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

ParallelTasks

Page 10: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Shells,bydefault,havetheabilitytomultitask:doingmorethanonethingatatimeInBASH,thiscanbeaccomplishedbysendingacommandtothebackground:• Explicitly,with&• Afterthefact,with^Zandbg

Whenyouputataskinthebackground• Thetaskkeepsrunning,whileyoucontinuetoworkattheshellintheforeground• Ifanyoutputisdone,itappearsonyourscreenimmediately• Ifinputisrequired,theprocessprintsamessageandstops• Whenitisdone,amessagewillbeprinted

Backgroundtasks

FromProcesses&JobControl:http://slideplayer.com/slide/4592906/

Page 11: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

11

GNU parallel isashelltoolforexecutingjobsinparallelusingoneormorecomputers.• singlecommandorsmallscriptthathastoberunforeachofthelinesintheinput.• typicalinputisalistoffiles,alistofhosts,alistofusers,alistofURLs,oralistof

tables.• Manyoptionsforworkingwithcontrolandoutputofresults• Canspecifythedegreeofparallelization

# create list of files to unzip

for index in `seq 1 100`; do echo "unzip myfile$index.zip" >> tasklist.txt; done

# Ask the compute cluster to do this for me in parallel, using 4 CPU/coresbsub –q normal –n 4 –W 2:00 -R "rusage[mem=4000]" \

parallel –joblog .log --outputasfiles –j\$LSB_MAX_NUM_PROCESSORS :::: tasklist.txt

GnuParallel Approach

Page 12: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Problem:HowdoIBLAST200,000transcriptsagainstNR?Solution:Fake aparallelBLAST.Buthow?• Divideyourinputfileinton separatefiles• BLASTeachsmallerinputfileonaseparatecore• Runningonn coreswillbealmostexactlyasn timesfaster!

Why?• Eachcoredoesn'tneedtotalktooneanother• Youcouldsubmitn jobsindividually,butnotrecommended• Usemoresophisticatedtechniques:

jobarrays,gnu_parallel,GridRunner• Shouldn'tconfusethiswithtrulyparallelmpiBLAST

Theefficiencyofyourworkdependsonhowparallelizedyoumakeyourtask:• Youwanttoensurethatyourjobsspendmostoftheirtime

computing,andnotinthequeueordoingcomputeprep

12

schedule BLASTmodule load Jobfinish

schedule BLASTmodule load Jobfinishversus X100??Whatwouldyouchoose?

ConceptofPleasantParallelization

Page 13: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

• SplitinputfileintoNfilesthatrun1to6hrs each• Canbedonewithperl orpythonscript,unix split,etc• Userscriptparsesthedatafile whosenameispassedasthecommandparameter

for file in my*.datdo

echo $filebsub –q normal –W 6:00 -R "rusage[mem=1000]" \

python process_data_file.py $filesleep 1

done

Foradvancedusers,onecansubmitthisasonejobinajobarray,afeatureonmostschedulers

# create script for job array (process_data_file_array.py)# and now submit file as job array

num_files=`wc –l < $( ls -1 my*.dat )`bsub –J myarray[1-$num_files] –q normal –W 6:00 -R "rusage[mem=1000]" \

python process_data_file_array.py

Thisprocessisidealforseriallynumberedfiles,parametersweeps,&optimizationroutines!!13

Manual(Script)Approach

Page 14: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

ParallelCode

Page 15: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Can my code be parallelized?

� Does it have large loops that repeat the same operations?

� Does your code do multiple tasks that are not dependent on one another? If so is the dependency weak?

� Can any dependencies or information sharing be overlapped with computation? If not, is the amount of communications small?

� Do multiple tasks depend on the same data?

� Does the order of operations matter? If so how strict does it have to be?

23

Page 16: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Basic guidance for efficient parallelization:

� Is it even worth parallelizing my code?

� Does your code take an intractably long amount of time to complete?

� Do you run a single large model or do statistics on multiple small runs?

� Would the amount of time it take to parallelize your code be worth the gain in speed?

� Parallelizing established code vs. starting from scratch

� Established code: Maybe easier / faster to parallelize, but my not give good performance or scaling

� Start from scratch: Takes longer, but will give better performance, accuracy, and gives the opportunity to turn a “black box” into a code you understand

24

Page 17: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Basic guidance for efficient parallelization:

� Increase the fraction of your program that can be parallelized. Identify the most time consuming parts of your program and parallelize them. This could require modifying your intrinsic algorithm and code’s organization

� Balance parallel workload

� Minimize time spent in communication

� Use simple arrays instead of user defined derived types

� Partition data. Distribute arrays and matrices – allocate specific memory for each MPI process

25

Page 18: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Designing parallel programs - partitioning:

31

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks.

Domain Decomposition: Data associate with a problem is partitioned – each parallel task works on a portion of the data

There are different ways to partition the data

Page 19: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Designing parallel programs - partitioning:

32

One of the first steps in designing a parallel program is to break the problem into discrete “chunks” that can be distributed to multiple parallel tasks.

Functional Decomposition: Problem is decomposed according to the work that must be done. Each parallel task performs a fraction of the total computation.

Page 20: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Designing parallel programs - communication:

33

Most parallel applications require tasks to share data with each other. Cost of communication: Computational resources are used to package and transmit data. Requires frequently synchronization – some tasks will wait instead of doing work. Could saturate network bandwidth. Latency vs. Bandwidth: Latency is the time it takes to send a minimal message between two tasks. Bandwidth is the amount of data that can be communicated per unit of time. Sending many small messages can cause latency to dominate communication overhead. Synchronous vs. Asynchronous communication: Synchronous communication is referred to as blocking communication – other work stops until the communication is completed. Asynchronous communication is referred to as non-blocking since other work can be done while communication is taking place. Scope of communication: Point-to-point communication – data transmission between tasks. Collective communication – involves all tasks (in a communication group) This is only partial list of things to consider!

Page 21: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Designing parallel programs – load balancing:

34

Load balancing is the practice of distributing approximately equal amount of work so that all tasks are kept busy all the time.

How to Achieve Load Balance? Equally partition the work given to each task: For array/matrix operations equally distribute the data set among parallel tasks. For loop iterations where the work done for each iteration is equal, evenly distribute iterations among tasks. Use dynamic work assignment: Certain class problems result in load imbalance even if data is distributed evenly among tasks (sparse matrices, adaptive grid methods, many body simulations, etc.). Use scheduler – task pool approach. As each task finishes, it queues to get a new piece of work. Modify your algorithm to handle imbalances dynamically.

Page 22: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Designing parallel programs – I/O:

35

The Bad News: � I/O operations are inhibitors of parallelism � I/O operations are orders of magnitude slower than memory operations � Parallel file systems may be immature or not available on all systems � I/O that must be conducted over network can cause severe bottlenecks

The Good News: � Parallel file systems are available (e.g., Lustre) � MPI parallel I/O interface has been available since 1996 as a part of MPI-2 I/O Tips: � Reduce overall I/O as much as possible � If you have access to parallel file system, use it � Writing large chunks of data rather than small ones is significantly more efficient � Fewer, larger files perform much better than many small files � Have a subset of parallel tasks to perform the I/O instead of using all tasks, or � Confine I/O to a single tasks and then broadcast (gather) data to (from) other tasks

Page 23: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

• C/C++• Fortran• MATLAB• Python• R• Perl• Julia• Scala• ….

LanguagesthatuseParallelComputing

Page 24: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

• Bydefault,R,Python,Perl,andMATLAB*arenotmultithreaded…sodonotaskforortrytousemorethan1core/CPU!!

• Foralltheseprograms,youcannotusethedrop-downGUImenus,andyoumustsetthe#ofCPUs/coredynamically!DONOTUSESTATICVALUES!

• ForR,youcanuseappropriateroutineswithRparallel• Nowpartofbase-R• IncludesRforeach,RdoMC,orRsnow

• ForPython,youcanusethemultiprocessinglibrary(ormanyothers)• ForPerl,there'sthreadsorParallel::ForkManager• MATLABhasparpool,anddonotsettheworkerthreadcountinGUIsettings

# R example (parallel.R)library(doMC)mclapply(seq_len(), run2, mc.cores = Sys.getenv('LSB_MAX_NUM_PROCESSORS'))

bsub –q normal –n 4 -app R-5g R CMD BATCH parallel.R # custom submission command

# MATLAB example (parallel.m)hPar = parpool( 'local' , str2num( getenv('LSB_MAX_NUM_PROCESSORS') ) );…

matlab-5g –n4 parallel.m # uses command-line wrapper

ParallelOptionsinR,Python,&MATLAB

Seemoreinfoonourwebsiteathttp://grid.rcs.hbs.org/parallel-processing

Page 25: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Stata/MP Performance Report Summary (1)

1 Summary

Stata/MP1 is the version of Stata that is programmed to take full advantage of multicore and multipro-cessor computers. It is exactly like Stata/SE in all ways except that it distributes many of Stata’s mostcomputationally demanding tasks across all the cores in your computer and thereby runs faster—muchfaster.

In a perfect world, software would run 2 times faster on 2 cores, 3 times faster on 3 cores, and soon. Stata/MP achieves about 75% e�ciency. It runs 1.7 times faster on 2 cores, 2.4 times faster on4 cores, and 3.2 times faster on 8 cores (see figure 1). Half the commands run faster than that. Theother half run slower than the median speedup, and some of those commands are not sped up at all,either because they are inherently sequential(most time-series commands) or because theyhave not been parallelized (graphics, mixed).

In terms of evaluating average performanceimprovement, commands that take longer torun—such as estimation commands—are ofgreater importance. When estimation com-mands are taken as a group, Stata/MP achievesan even greater e�ciency of approximately85%. Taken at the median, estimation com-mands run 1.9 times faster on 2 cores, 3.1 timesfaster on 4 cores, and 4.1 times faster on 8cores. Stata/MP supports up to 64 cores.

This paper provides a detailed report onthe performance of Stata/MP. Command-by-command performance assessments are pro-vided in section 8.

Median performance(estimation)

Median performance(all commands)

Logisticregression

Theoreticalupper bound

Lower bound (no improvement)1

2

4

8

Sp

ee

d r

ela

tive

to

sp

ee

d o

f si

ng

le c

ore

1 2 4 8Number of cores

Possible performance region

Figure 1. Performance of Stata/MP. Speed onmultiple cores relative to speed on a single core.

1. Support for this e↵ort was partially provided by the U.S. National Institutes of Health, National Institute on Aginggrants 1R43AG019542-01A1, 2R44AG019542-02, and 5R44AG019542-03. We also thank Cornell Institute for Social andEconomic Research (CISER) at Cornell University for graciously providing access to several highly parallel SMP platforms.CISER sta↵, in particular John Abowd, Kim Burlingame, Janet Heslop, and Lars Vilhuber, were exceptionally helpful inscheduling time and helping with configuration. The views expressed here do not necessarily reflect those of any of theparties thanked above.

Revision 3.0.1 30jan2016

Stataoffersa293-pagereportonitsparallelizationefforts.Theyareprettyimpressive.However:

Example:StataParallelization

Withmultiplecores,onemightexpecttoachievethetheoreticalupperboundofdoublingthespeedbydoublingthenumberofcores—2coresruntwiceasfastas1,4runtwiceasfastas2,andsoon.However,therearethreereasonswhysuchperfectscalabilitycannotbeexpected: 1)somecalculationshavepartsthatcannotbepartitionedintoparallelprocesses;2)evenwhentherearepartsthatcanbepartitioned,determininghowtopartitionthemtakescomputertime;and3)multicore/multiprocessorsystemsonlyduplicateprocessorsandcores,notalltheothersystemresources.

Stata/MPachieved75%efficiencyoveralland85%efficiencyamongestimationcommands.

Speedismoreimportantforproblemsthatarequantifiedaslargeintermsofthesizeofthedatasetorsomeotheraspectoftheproblem,suchasthenumberofcovariates.Onlargeproblems,Stata/MPwith2coresrunshalfofStata’scommandsatleast1.7timesfasterthanonasinglecore.With4cores,thesamecommandsrunatleast2.4timesfasterthanonasinglecore.

Thisparallelizationbenefitismostlyrealizedinbatchmode…mostofinteractiveStataiswaitingforuserinput(orleftidle),asCPUefficiency

istypically<5%- 10%

Page 26: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

• Bydefault,R,Python,Perl,andMATLAB*arenotmultithreaded…sodonotaskforortrytousemorethan1core/CPU!!

• ForR,youcanuseappropriateroutineswithRparallel• Nowpartofbase-R• IncludesRforeach,RdoMC,orRsnow• multicore baseenableparallelizationthroughtheapply()functions,butwillnotworkonWindows

systemsduetohowparallelizationisachieved(nofork())

# R example (parallel.R)library(doMC)mclapply(seq_len(), run2, mc.cores = Sys.getenv('LSB_MAX_NUM_PROCESSORS'))

bsub –q normal –n 4 -app R-5g R CMD BATCH parallel.R # custom submission command

ParallelProcessinginR

Seemoreinfoonourwebsiteathttp://grid.rcs.hbs.org/parallel-r

Page 27: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

# library(parallel): snow single-node parallel cluster

library(parallel)# wraps the makeSOCKcluster() function and launches the specified number # of R processes on the local machinecluster <- makeCluster(Sys.getenv('LSB_MAX_NUM_PROCESSORS'))# one must explicitly make vars/functions available in the sub-processes.

clusterExport(cluster, list('myProc'))# now result <- clusterApply(cluster, 1:10, function(i) myProc())ResultstopCluster(cluster)

bsub –q normal –n 4 -app R-5g R CMD BATCH parallel_snow.R # custom submission command

# library(parallel): foreach + multicore

library(foreach)library(doMC)

registerDoMC(cores=Sys.getenv('LSB_MAX_NUM_PROCESSORS'))result <- foreach(i=1:10, .combine=c) %dopar% {

myProc()}result

bsub –q normal –n 4 -app R-5g R CMD BATCH parallel_foreach.R # custom submission command

ParallelProcessinginR

Seemoreinfoonourwebsiteathttp://grid.rcs.hbs.org/parallel-r

Page 28: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Bydefault,R,Python,Perl,andMATLAB*arenotmultithreaded…sodonotaskforortrytousemorethan1core/CPU!!• Pythonhas‘multiprocessing’module• Evolvedfromthethreadingmodule• Usessubprocesses,insteadofthreads,tobypasspython’sGlobalInterpreterLock• Richsubclasses,includingPooloffersaconvenientmeansofparallelizingtheexecutionofa

functionacrossmultipleinputvalues,distributingtheinputdataacrossprocesses(dataparallelism).

• RunsonbothUnix&Windowssystems

‘Multiprocessing’inPython

Page 29: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

import multiprocessing, os

def worker(num):

"""thread worker function""”

print 'Worker:', num

return

if __name__ == '__main__':

jobs = []

cores = os.environ['LSB_MAX_NUM_PROCESSORS']

for i in range(cores):

p = multiprocessing.Process(target=worker, args=(i,))

jobs.append(p)

p.start()

$ python multiprocessing_simpleargs.py

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Worker: 4

bsub –q normal –n 5 –W 1:00 -R "rusage[mem=1000]" python parallel_workers.py

‘Multiprocessing’inPython

Seemoreinfoonourwebsiteathttp://grid.rcs.hbs.org/parallel-processing

Page 30: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Bydefault,R,Python,Perl,andMATLAB*arenotmultithreaded…sodonotaskforortrytousemorethan1core/CPU!!• MATLABhasparpool,theParallelComputingToolbox(PCT)standardonallinstallations• Thisoperatesonasinglemachine!• Onsomesystems(e.g.FASRC'sOdyssey),workerscanbespawnedacrossmultiplemachinesforlarge

scalework• DCS:DistributedComputingToolbox

# MATLAB example (parallel.m)hPar = parpool( 'local' , str2num( getenv('LSB_MAX_NUM_PROCESSORS') ) );

R = 1; darts = 1e7; count = 0; % Prepare settingsparfor i = 1:darts

x = R * rand(1);y = R * rand(1);if x^2 + y^2 <= R^2

count = count + 1end

endmyPI = 4 * count / darts;

% Log results & close down parallel poolfprintf( hLog , ‘The computed value of pi is %2.7f\n’ , myPI );delete(gcp);

matlab-5g –n4 par_compute_pi.m # command-line wrapper bsub –n 4 –q normal –W 2:00 -R "rusage[mem=1000]" matlab par_compute_pi.m # custom submission

MulticoreOptionsMATLAB

Seemoreinfoonourwebsiteathttp://grid.rcs.hbs.org/parallel-processing

Page 31: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

OtherImportantPoints&Troubleshooting

Page 32: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Choosingcorecountcanbedifficult,especiallyifthere'samixofserialandparallelsteps….

• Thinkabouthowlongyourcodewillbeineithermodes• Determinethefractionresourceuseacrossthewholejob• If<20%inmulticoreuse,thensplitupthetasksintotwoseparatejobs• Canusejobdependenciestomakesubmissioneasier

32

+

OK,canrunasonelongjob

MixedMulticoreandSerialWorkflows

Page 33: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

Notallprogramscanbescaledwell.Thisisdueto• Overheadofprogramstart• Overheadofcommunicationbetweenprocesses(threads)withintheprogram• (worse:)Waitingtowritetothenetworkordisk(I/O)• Other,serialpartsoftheprogram(partsthatcannotbeparallelized)

Scalingtestsareimportanttohelpyoudeterminetheoptimal#ofcorestouse!!

33

ScalingTestsEnsuresEfficiency

Page 34: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

34

# Create a SLURM script for an analysis that can be used for multiple CPU (core) values

# Input seqs.fa file has 350 FASTA sequences so we can get good parallelization values:

-- file: blast_scale_test.slurm ---#!/bin/bash##SBATCH -p serial_requeue # Partition to submit to (comma separated)#SBATCH -J blastx # Job name#SBATCH -N 1 # Ensure that all cores are on one machine#SBATCH -t 0-4:00 # Runtime in D-HH:MM (or use minutes)#SBATCH --mem 10000 # Memory pool in MB for all cores#SBATCH --mail-type=END,FAIL # Type of email notification: BEGIN,END,FAIL,ALL

source new-modules.sh; module load ncbi-blast/2.2.31+-fasrc01export BLASTDB=/n/regal/informatics_public/

blastx –in seqs.fa -db $BLASTDB/custom/other/model_chordate_proteins \-out sk_shuffle_seqs.n${1}.modelchordate.blastx -num_threads $1

-----

# and now submit file multiple times with different core valuesfor i in 1 2 4 8 16; do

echo $i# sbatch flags here will override those in the SLURM submission scriptsbatch -n $i -J blastx$i -o blastx_n$i.out -e blastx_n$i.err blast_scale_test.slurm $isleep 1

done

YourOwnScalingTests!

Page 35: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

35

[bfreeman@rclogin04 ~]$ sacct -u bfreeman -S 4/6/16 --format=jobid,\elapsed,alloccpus,cputime,totalcpu,state

JobID Elapsed AllocCPUS CPUTime TotalCPU State ------------ ---------- ---------- ---------- ---------- ----------59817008 16:12:26 1 16:12:26 16:03:34 COMPLETED 59817008.ba+ 16:12:26 1 16:12:26 16:03:34 COMPLETED 59817024 10:49:16 2 21:38:32 17:53:07 COMPLETED 59817024.ba+ 10:49:16 2 21:38:32 17:53:07 COMPLETED 59817026 06:03:38 4 1-00:14:32 15:56:55 COMPLETED 59817026.ba+ 06:03:38 4 1-00:14:32 15:56:55 COMPLETED 59817028 04:55:44 8 1-15:25:52 21:27:30 COMPLETED 59817028.ba+ 04:55:44 8 1-15:25:52 21:27:30 COMPLETED 59817043 03:01:51 16 2-00:29:36 1-01:33:03 COMPLETED 59817043.ba+ 03:01:51 16 2-00:29:36 1-01:33:03 COMPLETED 59847485 02:04:58 32 2-18:38:56 1-11:42:36 COMPLETED 59847485.ba+ 02:04:58 32 2-18:38:56 1-11:42:36 COMPLETED

1 2 4 8 16 32Elapsed 16:12:26 10:49:16 6:03:38 4:55:44 3:01:51 2:04:58Ideal 16:12:26 8:06:13 4:03:07 2:01:33 1:00:47 0:30:23CPUTime 16:12:26 21:38:32 24:14:32 39:25:52 48:29:36 66:38:56Ideal 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26NoGain 16:12:26 32:24:52 64:49:44 129:39:28 259:18:56 518:37:52TotalCPU 16:03:34 17:53:07 15:56:55 21:27:30 25:33:03 35:42:36Ideal 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34NoGain 16:03:34 32:07:08 64:14:16 128:28:32 256:57:04 513:54:08

YourOwnScalingTestResults!

Page 36: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

36

1 2 4 8 16 32

Elapsed 16:12:26 10:49:16 6:03:38 4:55:44 3:01:51 2:04:58

Ideal 16:12:26 8:06:13 4:03:07 2:01:33 1:00:47 0:30:23

CPUTime 16:12:26 21:38:32 24:14:32 39:25:52 48:29:36 66:38:56

Ideal 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26 16:12:26

NoGain 16:12:26 32:24:52 64:49:44 129:39:28 259:18:56 518:37:52

TotalCPU 16:03:34 17:53:07 15:56:55 21:27:30 25:33:03 35:42:36

Ideal 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34 16:03:34

NoGain 16:03:34 32:07:08 64:14:16 128:28:32 256:57:04 513:54:08

0:14:24

2:24:00

0:00:001 2 4 8 16 32

Elapsed

Ideal

2:24:00

24:00:00

240:00:00

2400:00:00

1 2 4 8 16 32

CPUTime

Ideal

NoGain

2:24:00

24:00:00

240:00:00

2400:00:00

1 2 4 8 16 32

TotalCPU

Ideal

NoGain

YourOwnScalingTestResults!

Page 37: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

RCS Website&Documentation-- onlyauthoritativesourcehttps://grid.rcs.hbs.org/

Submitahelprequest [email protected]

Bestwaytohelpustohelpyou?Giveus...DescriptionofproblemAdditionalinfo(login/batch?queue?JobIDs?)StepstoReproduce(1.,2.,3...)ActualresultsExpectedresults

GettingHelp

Page 38: Compute Grid - Harvard Business Schooltraining.rcs.hbs.org/files/hbstraining/files/parallel_processing.pdf · GNUparallel is a shell tool for executing jobs in parallel using one

• Pleasetalktoyourpeers,and…• Wewishyousuccessinyourresearch!

• http://intranet.hbs.edu/dept/research/• https://grid.rcs.hbs.org/• https://training.rcs.hbs.org/

• @hbs_rcs

ResearchComputingServices