what marcc does...•marcc is free of cost to pis. the schools pay for the operaons •authenfcafon:...

44
What MARCC Does 1 M aryland A dvanced R esearch C omputing C enter Jaime E. Combariza, PhD Director

Upload: others

Post on 17-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

What MARCC Does

1

•Maryland • Advanced • Research • Computing • Center

Jaime E. Combariza, PhD Director

Slidesavailableonline

2

•www.marcc.jhu.edu/training

•MARCCHELP•[email protected]•IncludeasmuchinformaFonaspossible,forexample:

•Thejobidofthejobwithproblems•Fullpathtothebatchsubmissionscript•Anyspecificerrormessages•Ifpossibleasnapshotwitherrors

•FrequentlyAskedQuesFons•hQps://www.marcc.jhu.edu/geRng-started/faqs/

HighPerformanceCompuFngAnalogy

3

Ingredients/Recipe Scientific Applications

Oven= MARCC

Users

Research Project

MARCC/Bluecrab

4

Model&Funding

5

•GrantfromtheStateofMarylandtoJohnsHopkinsUniversitytobuildanHPC/bigdatafacility.•Building,ITstackandNetworking•OperaFonalcostcoveredby5schools:•KriegerSchoolofArts&Sciences(JHU)•WhiFngSchoolsofEngineering(JHU)•SchoolofMedicine(JHU)•BloombergSchoolsofPublicHealth(JHU)•UniversityofMarylandatCollegePark(UMCP)

DefiniFons

6

DefiniFons

7

Cluster(HighPerformanceComputeCluster) Aggrega8onofserverswithhighspeedconnec8vityandfilesystemsa>achedtotheservers.

CPU CentralProcessingUnitaka“Processor”

GPU GraphicalProcessingUnit

Node(login/compute/management/others) Aserverwithsomeamountofmemory,cores,CPUs/GPUs

Core a“processingunit”withinanode(24our28cores)

Memory/RAM Amountofmemorypercoreandnode(128/96)

Filesystem StorageaQachedtothecluster

Network ConnecFvitybetweennodesandfilesystems/communicaFon

Sohware ScienFficapplicaFons(Python,Matlab,Samtools,Gaussian)

8

Nodes Type Descrip8on TotalNocores TFLOPs

648 Regularcomputenodes Haswell24-core128GBRAM 15,552 622

50 LargeMemnodes IvyBridge48-core1024GBRAM 2,400 57.648 GPUnodes Haswell24-core,2NvidiaK80s 1,152 225.5

- Lustre 2PetaByteFilesystem

- ZFS 13TB(forma>ed)Originalsystem 19,104cores 905

48 Regularcomputenodes Broadwell,28-core128GBRAM 1,344 55.91

24 GPUnodes Broadwell,28-core2NvidiaK80s 672 117.75

4 GPUP100 Broadwell,28-coreplus2P100pernode 112 4.65+4/7/gpu2 GPUV100 64-coreand28-core

28 Condo Haswell24-core128GBRAM 672 26.88

8 Condo Broadwell,28-core128GBRAM 224 9.32

52 Condo SkylakeGold612624cores 1152TotalResourcesasof2/22/2019 +23,300cores 1.5+PFLOPs

Compute Nodes

HPCResources&Model

9

•Approx21,120coresand15PBstorageKSAS: 13.4 M Quarter

WSE: 13.4 M Quarter

SOM: 6.8 M Quarter

BSPH: 2.6 M Quarter

UMCP: 6.4 M Quarter

Reserve: 2.0 M/Q

AllocaFons

10

•DeansrequestedapplicaFonsfromallfacultymembers•AllocaFonsgrantedaccordingtoavailableresources

•hQp://marcc.jhu.edu/request-access/marcc-allocaFon-request/

Remarks

11

•MARCCisfreeofcosttoPIs.TheschoolspayfortheoperaFons•AuthenFcaFon:viaTwo-factorauthenFcaFon•Open-dataoranykindofconfidenFaldata.dbGaPisfine,inmostcases.•SecureResearchEnvironment(MSE)forHIPAAdata•IfaddiFonalresources(allocaFon)neededplantoaddacondo(computenodes)

Storage

12

Storage

13

Directory Quota Backup(cost) Addi8onalstorage

$HOME 20GBytesonZFS YES(nocost) NO

~/scratch 1TBpergrouponLustre,useraccess NO YES(>10TB,ViceDean)

~/work sharedquotawith~/scratch,groupaccess

~/data 1TBdefaultperPI,upto10TBpergrouponZFS,requestMARCC

YES(nocost) N/A

~/work-zfs 50TBpergrouponZFS,requestViceDean YES($40/TB/yr) $40/TB/yr+backupcost

~/project <6months,uponrequest(ViceDean) NO N/A

Temporaryfiles

14

•Temporaryfilesgoin~/scratch•Pleasedonotuse/tmpor/dev/shm•Ifneeded,pleasecleanfilesaherjobiscompleted•PleaseifatallpossibledonotdoheavyI/Oto“data”.Usescratch/work

ConnecFng

15

•Windows:PuQy,bash,XSHELL•Mac:terminal,XQuartz,•VNC(limited);OpenOnDemand(OOD)•ssh[-YX]gateway2.marcc.jhu.edu-luserid

•ssh-Ylogin.marcc.jhu.edu-luserid

Filetransfer

16

1.UseFilezillahQps://www.marcc.jhu.edu/geRng-started/faqs/2.sshdtn.marcc.jhu.edu-luserid3.Use“aspera”(FAQ)4.Useglobusconnect•Downloadclientorusewebsite•Createendpoint(ifneeded)•AuthenFcateusingJHUsinglesign-on

DataTransfer(globus.org)

17

sshfsMounts(Basic)

18

•Fuse/sshfsisnowenabledonthecluster•ItfollowsthetwofactorauthenFcaFonprotocol•CheckwithyourlocalITpersontofindouthowtomountdifferentfilesystemsorMARCC’swebsiteforanexample:•hQps://www.marcc.jhu.edu/geRng-started/basic/

Sohware&Modules

19

•MARCCmanagessohwareavailabilityusingthe“environmentmodules”(LmodfromTACC)•module--help•moduleavail•modulespiderpython•modulelist(ml)•moduleloadgaussian

MoreonModules

20

•Module“load”changestheuser’spathtoprependthepackagebeingloadedtotheuser’senvironment.•Example:python•>whichpython•/usr/bin/python•pythonversionthatcomeswiththeOS

modulespider

21

[jcombar1@login-node03~]$mlspiderpython

python:------------------------------------------------------------------------------------------------------------------------------------------Versions:python/2.7-anacondapython/2.7-anaconda53python/2.7python/3.6-anacondapython/3.6python/3.7-anacondapython/3.7Otherpossiblemodulesmatches:biopython------------------------------------------------------------------------------------------------------------------------------------------Tofindotherpossiblemodulematchesexecute:

$module-rspider'.*python.*'

------------------------------------------------------------------------------------------------------------------------------------------FordetailedinformaFonaboutaspecific"python"module(includinghowtoloadthemodules)usethemodule'sfullname.Forexample:

$modulespiderpython/3.7

moduleshow

22

[jcombar1@login-node03~]$mlshowpython/2.7——————————————————————————————————————/sohware/lmod/modulefiles/apps/python/2.7.lua:-----------------------------------------------------------------------------------------------------------------help([[anaconda-loadstheanacondasohware&applicaFonenvironment

Thisadds/sohware/apps/python/2.7toseveraloftheenvironmentvariables.

Aherloading,toseesysteminstalledpackagesforthisPythoninstallaFonpleasetype:condalist

SeeMARCC'shelppageat:hQps://www.marcc.jhu.edu/geRng-started/local-python-packages/]])whaFs("loadsthePython2.7package")always_load("centos7")prepend_path("PATH","/sohware/apps/python/2.7/bin")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib/python2.7")prepend_path("LD_LIBRARY_PATH","/sohware/apps/python/2.7/lib/python2.7/site-packages")

ModulesExamples

23

• [jcombar1@login-node01~]$moduleloadpython• [jcombar1@login-node01~]$modulelist• CurrentlyLoadedModules:• 1)gcc/4.8.22)slurm/14.11.033)python/2.7.10[jcombar1@login-node01~]• $whichpython• /sohware/apps/python/2.7• $echo$PATH• /sohware/apps/python/2.7/bin:/sohware/apps/marcc/bin:/sohware/centos7/bin:/

sohware/apps/slurm/current/sbin:/sohware/apps/slurm/current/bin:.:/sohware/apps/mpi/openmpi/3.1.3/intel/18.0/bin:/sohware/apps/compilers/intel/itac/2018.3.022/intel64/bin:/sohware/apps/compilers/intel/clck/2018.3/bin/intel64:/sohware/apps/compilers/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64:/sohware/apps/compilers/intel/compilers_and_libraries_2018.3.222/linux/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sohware/apps/compilers/intel/parallel_studio_xe_2018.3.051/bin:/home-0/jcombar1/pdsh/bin:/home-0/jcombar1/bin

modulehelpapplicaFon

24

•modulehelpgaussian-----------------------------------------ModuleSpecificHelpfor"gaussian/g09"------------------------------------------ThisisacomputaFonalchemistryapplicaFon.************************************************[email protected]**Requestaccesstotheg09group************************************************website:hQp://www.gaussian.comManualonline:hQp://www.gaussian.com/g_tech/g_ur/g09_help.htm————————————————————————————————————————--OnMARCCGaussian09runsusingthreads.ItdoesnotuseLindalibraries--PleasedonotrunGaussianovermorethanoneNode.Followtheexamplebelow-————————————————————————————————————————-Torunitinbatchmodeuseascriptlikethisone:#!bin/bash-l#SBATCH--Fme=1:0:0#SBATCH--nodes=1#SBATCH--ntasks-per-node=8#SBATCH--parFFon=shared#SBATCH--mem=120000MB####THEABOVEREQUESTS120GBRAMmoduleloadgaussiangscratch=/scratch/users/$USERmkdir-p$gscratch/$SLURM_JOBIDexportGAUSS_SCRDIR=$gscratch/$SLURM_JOBIDdateFmeg09water(.com)

25

Jobmanagementtool

Queuingsystem

26

•MARCCallocatesresourcestousersusingatransparentandfairprocessbymeansofa“queueingsystem”.•SLURM(SimpleLinuxUniversalResourceManager)•Opensource,adoptedatmanyHPCcentersandHHPC(similarenvironments)

Queues/ParFFons

28

•LinktowebsitePartition Default/Max Time

Default/Max Cores

Default/MAx Mem Serial/Parallel Backfilling

SHARED 1 hr/ 72hr 1/24 5 GB / 120GB

Serial/ Parallel Shared

UNLIMITED Unlimited 1/24/48 5 Gb/120GB Serial/Parallel Shared

PARALLEL 1 hr/ 72hr 5 GB / 120GB Parallel Exclusive

GPUK80/gpup100/gpuv100 1 hr/ 72hr 6/24 5 GB /

120GBSerial/Parallel Shared

LRGMEM 1 hr/ 72hr 48 21GB/1024GB

Serial/Parallel Shared

Scavenger Max 6 hr 1/24 5Gb/120GB Serial/Parallel Shared

ParFFons/Shared

29

•Maysharecomputenodesforjobs•Serialorparalleljobs•Fmelimit1hrto100-hours•1-24cores•onenode

•#SBATCH-N1•#SBATCH--ntasks-per-node=12•#SBATCH--parFFon=shared

ParFFons/Unlimited

30

•UnlimitedFme!!!•Jobsthatneedtorunformorethan100-hours•Ifthesystem/nodecrashesthejobwillbekilled•oneto24cores,oneormulFnode•serialorparallel

•#SBATCH-Nn(n1ormore)•#SBATCH--parFFon=unlimited•#SBATCH--wallFme=15-00:00:00(fiheendays)•#SBATCH--ntasks-per-node=m•#SBATCH—mem=0!!!!!!!!!!!!!!!!!!!

ParFFons/Parallel

31

•Dedicatedqueue•exclusivenodes•singleandmulF-nodejobs•1hrto100hours•Paralleljobsonly

•#SBATCH-N4•#SBATCH--ntasks-per-node=24(96cores)•#SBATCH--parFFon=parallel•#SBATCH—mem=0

ParFFons/scavenger

32

•Mustusewithqos=scavenger•#SBATCH--qos=scavenger•#SBATCH--parFFon=scavenger

•Lowpriorityjobs•Timemaximum6hours•Useonlyifyouralloca8onranout

SLURMFlags

33

Description FLAGScript directive #SBATCH

Job Name #SBATCH --job-name=Any-nameRequested time #SBATCH -t minutes

#SBATCH -t [days-hrs:min:sec]Nodes requested #SBATCH -N min-Max

#SBATCH --nodes=NumberNumber of cores per node #SBATCH --ntasks-per-node=12Number of cores per task #SBATCH --cpus-per-task=2

Mail #SBATCH --mail-type=endUser’s email address #SBATCH [email protected]

Memory size #SBATCH --mem=[mem|M|G|T]Job Arrays #SBATCH --array=[array_spec]

Request specific resource #SBATCH --constraint=“XXX”

SLURMEnvvariables

34

Description Variable

JobID $SLURM_JOBID

Submit Directory $SLURM_SUBMIT_DIR

Submit Host $SLURM_SUBMIT_HOST

Node List $SLURM_JOB_NODELIST

Job Array Index $SLURM_ARRAY_TASK_ID

> printenv | grep SLURM

SLURMScripts

35

cp-r/scratch/public/scripts.(Copydirectory)

#!/bin/bash-l#SBATCH--job-name=MyJob#SBATCH--Fme=8:0:0#SBATCH--nodes=1#SBATCH--ntasks-per-node=12#SBATCH--mail-type=end#[email protected]#SBATCH--parFFon=sharedmoduleloadmvapich2/gcc/64/2.0b####loadmvapich2moduleFmempiexec./code-mvapich.x>OUT-24log

RunningJobs

36

• sbatch(qsub)script-name• squeue(qstat-a)-uuserid[[email protected]][email protected](sqme)

login-vmnode01.cm.cluster:Req'dReq'dElapJobidUsernameQueueNameSessIDNDSTSKMemoryTimeUseSTime-----------------------------------------------------------------------------------------300jcombar1sharedMyJob--112--08:00R00:00

SLURMcommands

37

•scancel(qdel)jobid•scontrolshowjobjobID•sinfo•sinfo-pshared•sqme•sacct

InteracFvework

38

•interact-usage• usage:interact[-ncores][-twallFme][-mmemory][-pqueue]• [-oou�ile][-X][-ffeaturelist][-hhostname][-gngpus]• InteracFvesessiononacomputenode• opFons:• -ncores(default:1)• -twallFmeashh:mm:ss(default:30:00)• -mmemoryas#[k|m|g](default:5GB)• -pparFFon(default:'def')• -oou�ilesaveacopyofthesession'soutputtoou�ile(default:off)• -XenableXforwarding(default:no)• -ffeaturelistCCV-definednodefeatures(e.g.,'e5-2600'),• combinedwith'&'and'|'(default:none)• -hhostnameonlyrunonthespecificnode'hostname'• (default:none,useanyavailablenode)

Jobarrays

39

•#!/bin/bash-l•#SBATCH--job-name=job-array•#SBATCH--Fme=1:0:0•#SBATCH--array=1-240•#SBATCH--nodes=1•#SBATCH--ntasks-per-node=1•#SBATCH--parFFon=shared•#SBATCH—mem=4.9GB•#SBACTH--mail-type=end•#[email protected]

•#runyourjob

•echo"StartJob$SLURM_ARRAY_TASK_IDon$HOSTNAME"

•...

GPUsandInteracFveJobs

40

•#SBATCH-pgpu--gres=gpu:4•#SBATCH--ntasks-per-node=4•#SBATCH--cpus-per-task=6

•interact-pdebug-g1-n1-c6

Compilers/Compiling

41

•Intelcompilers•modulelist•ifort-O3-openmp-omy.exemy.f90•icc-g-omyc.xmyc.c•GNU•gfortran-O4-omyg.f90•gcc-O4-omyc.xmyc.c•PGICompilers(mlpgi)•pgcc-help

MPIjobs

42

•modulespidermpi•moduleloadmvapich2•mpif90ormpicccode(.f90or.c)•mpiexeccode.x(withincomputenode•USEmpiiccormpif90(Intel-mpi)•mpif90andmpicc(usegcc)

Warning

43

•NorefundssomakesureyouareusingMARCCresourceseffecFvely

InformaFon

44

[email protected]

•Websitemarcc.jhu.edu