digital - nextgenio€¦ · d2.4 the nextgenio architecture wp2 “architecture” project acronym...

54
D2.4 The NEXTGenIO Architecture WP2 “Architecture” Project Acronym NEXTGenIO Project Name Next Generation I/O for Exascale Project Number 671591 Instrument Research and Innovation Action Thematic Priority FETHPC-1-2014 HPC Core Technologies, Programming Environments and Algorithms for Extreme Parallelism and Extreme Data Applications Due date 30/09/2017 Submission date 29/09/2017 Project start date 1 st October 2015 Project duration 36 months Deliverable lead organisation UEDIN Status Final Version 1.0 Author(s) Adrian Jackson, Iakovos Panourgias (UEDIN), Bernhard Homölle (FUJITSU), Alberto Miranda, Ramon Nou, Javier Conejero (BSC), Marcelo Cintra (INTEL), Simon Smart, Antonino Bonanni (ECMWF), Holger Brunst, Christian Herold, Muhammad Sarim Zafar (TUD) Reviewer(s) Hans-Christian Hoppe (INTEL), Michèle Weiland (UEDIN), Bernhard Homölle (FUJITSU) Dissemination level PU PU - Public

Upload: others

Post on 16-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

D2.4TheNEXTGenIOArchitectureWP2“Architecture”

ProjectAcronym NEXTGenIO

ProjectName NextGenerationI/OforExascale

ProjectNumber 671591

Instrument ResearchandInnovationAction

ThematicPriority FETHPC-1-2014HPCCoreTechnologies,ProgrammingEnvironmentsandAlgorithmsforExtremeParallelismandExtremeDataApplications

Duedate 30/09/2017

Submissiondate 29/09/2017

Projectstartdate 1stOctober2015

Projectduration 36months

Deliverableleadorganisation

UEDIN

Status Final

Version 1.0

Author(s) AdrianJackson,IakovosPanourgias(UEDIN),BernhardHomölle(FUJITSU),AlbertoMiranda,RamonNou,JavierConejero(BSC),MarceloCintra(INTEL),SimonSmart,AntoninoBonanni(ECMWF),HolgerBrunst,ChristianHerold,MuhammadSarimZafar(TUD)

Reviewer(s) Hans-Christian Hoppe (INTEL), Michèle Weiland (UEDIN), BernhardHomölle(FUJITSU)

Disseminationlevel

PU PU-Public

epcc|NEXTGenIOVisual Identity Designs

d i g i t a l

Copyright©2017NEXTGenIOConsortiumPartners 2

VersionHistoryVersion Date Comments,Changes,Status Authors&contributors

0.1 14/08/17 Initialdraftcreated AdrianJackson(UEDIN)

0.2 01/09/17 Reviewdraftfinished AdrianJackson(UEDIN)

0.3 04/09/17 FixedlayoutandincorporatedChristian’schanges

AdrianJackson(UEDIN),ChristianHerold(TUD)

0.4 12/09/17 Updateofsystemwarearchitectures,correctingsomemistakesandre-orderingsections.

AdrianJackson(UEDIN)

1.0 22/09/17 Incorporatingreviewcomments AdrianJackson,MichèleWeiland(UEDIN),BernhardHomölle(FUJITSU),Hans-ChristianHoppe,MarceloCintra(INTEL)

Copyright©2017NEXTGenIOConsortiumPartners 3

TableofContents1 ExecutiveSummary........................................................................................................................7

2 Introduction...................................................................................................................................8

2.1 Glossary..................................................................................................................................8

3 Non-volatileMemory...................................................................................................................12

3.1 SCMmodesofoperation......................................................................................................13

3.2 Non-volatilememorysoftwareecosystem..........................................................................14

3.3 AlternativestoSCM..............................................................................................................14

4 Requirements...............................................................................................................................16

5 HardwareArchitecture.................................................................................................................17

5.1 Hardwarebuildingblocks.....................................................................................................17

5.1.1 Mainservernode.........................................................................................................17

5.1.2 Ethernetswitch............................................................................................................19

5.1.3 Networkswitch.............................................................................................................19

5.1.4 Loginnodes..................................................................................................................195.1.5 Bootnodes...................................................................................................................19

5.1.6 Gatewaynodes.............................................................................................................20

5.2 Prototypeconsiderations.....................................................................................................20

5.2.1 RDMAoptionsfortheprototype..................................................................................20

5.2.2 ApplicabilityofthearchitecturewithoutaccesstoSCM..............................................21

5.3 Systemconfiguration............................................................................................................21

5.4 ExaFLOPvision......................................................................................................................23

5.4.1 ScalabilitytoanExaFLOPandbeyond..........................................................................24

5.4.2 Scalabilityinto100PFLOP/s.........................................................................................24

6 SystemwareArchitecture.............................................................................................................25

6.1 SystemwareOverview..........................................................................................................25

6.2 LoginNodes..........................................................................................................................26

6.2.1 CommunicationLibraries..............................................................................................27

6.2.2 Compilers......................................................................................................................27

6.2.3 I/OLibraries..................................................................................................................28

6.2.4 Java...............................................................................................................................28

6.2.5 JobLauncher.................................................................................................................28

6.2.6 JobScheduler...............................................................................................................28

6.2.7 LocalFilesystem............................................................................................................29

6.2.8 MetaScheduler............................................................................................................296.2.9 ObjectStore..................................................................................................................29

Copyright©2017NEXTGenIOConsortiumPartners 4

6.2.10 OperatingSystem.........................................................................................................29

6.2.11 PerformanceMonitoring..............................................................................................29

6.2.12 PerformanceVisualisationandDebuggingTools.........................................................30

6.2.13 Python..........................................................................................................................30

6.2.14 SchedulingTools...........................................................................................................30

6.2.15 ScientificLibraries.........................................................................................................30

6.2.16 Task-basedprogrammingmodel..................................................................................31

6.3 GatewayNodes....................................................................................................................31

6.3.1 OperatingSystem.........................................................................................................31

6.4 BootNodes...........................................................................................................................316.4.1 OperatingSystem.........................................................................................................32

6.4.2 Management/MonitoringSoftware.............................................................................32

6.4.3 PXE-BootServer............................................................................................................32

6.4.4 DHCPServer.................................................................................................................33

6.4.5 DNSServer....................................................................................................................33

6.4.6 NetworkManagement.................................................................................................33

6.4.7 LocalFilesystem............................................................................................................33

6.5 CompleteComputeNodes...................................................................................................33

6.5.1 AutomaticCheck-pointer.............................................................................................34

6.5.2 CommunicationLibraries..............................................................................................34

6.5.3 DataMovers.................................................................................................................35

6.5.4 DataScheduler.............................................................................................................35

6.5.5 I/OLibraries..................................................................................................................35

6.5.6 Java...............................................................................................................................35

6.5.7 JobScheduler...............................................................................................................35

6.5.8 Management/MonitoringSoftware.............................................................................35

6.5.9 Multi-nodeSCMFilesystem..........................................................................................36

6.5.10 ObjectStore..................................................................................................................36

6.5.11 OperatingSystem.........................................................................................................36

6.5.12 PerformanceMonitoring..............................................................................................366.5.13 PersistentMemoryAccessLibrary...............................................................................37

6.5.14 Python..........................................................................................................................37

6.5.15 SCMDriver....................................................................................................................37

6.5.16 SCMLocalFilesystem...................................................................................................37

6.5.17 Task-basedRuntimeEnvironment...............................................................................37

6.5.18 VolatileMemoryAccessLibrary...................................................................................38

Copyright©2017NEXTGenIOConsortiumPartners 5

7 UsagePatterns.............................................................................................................................39

7.1 GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem..........................................39

7.1.1 Description...................................................................................................................39

7.1.2 Components.................................................................................................................39

7.1.3 Datalifecycle................................................................................................................40

7.2 GeneralPurposeHPCsystemimprovingthroughput...........................................................40

7.2.1 Description...................................................................................................................40

7.2.2 Components.................................................................................................................40

7.2.3 Datalifecycle................................................................................................................40

7.3 ResilienceinspecialistHPCsystem......................................................................................417.3.1 Description...................................................................................................................41

7.3.2 Components.................................................................................................................41

7.3.3 Datalifecycle................................................................................................................41

7.4 LargeMemory......................................................................................................................42

7.4.1 Description...................................................................................................................42

7.4.2 Components.................................................................................................................42

7.4.3 Datalifecycle................................................................................................................42

7.5 Workflows............................................................................................................................42

7.5.1 Description...................................................................................................................42

7.5.2 Components.................................................................................................................43

7.5.3 Datalifecycle................................................................................................................43

7.6 DataAnalytics.......................................................................................................................43

7.6.1 Description...................................................................................................................43

7.6.2 Components.................................................................................................................44

7.6.3 Datalifecycle................................................................................................................44

7.7 Liveprocesspausingandmigration.....................................................................................44

7.7.1 Description...................................................................................................................44

7.7.2 Components.................................................................................................................45

7.7.3 Datalifecycle................................................................................................................45

7.8 ObjectStore..........................................................................................................................457.8.1 Description...................................................................................................................45

7.8.2 Components.................................................................................................................45

7.8.3 Datalifecycle................................................................................................................46

7.9 Weatherforecastingcomponentoverview..........................................................................46

7.9.1 Description...................................................................................................................46

7.9.2 Components.................................................................................................................46

Copyright©2017NEXTGenIOConsortiumPartners 6

7.9.3 Datalifecycle................................................................................................................47

7.10 Distributedworkflowmanagement.....................................................................................47

7.10.1 Description...................................................................................................................47

7.10.2 Components.................................................................................................................47

7.10.3 Datalifecycle................................................................................................................47

8 Conclusions..................................................................................................................................49

9 References....................................................................................................................................50

10 Appendix..................................................................................................................................51

TableofFiguresFigure1:One-levelmemory(1LM)......................................................................................................13Figure2:Two-levelmemory(2LM)......................................................................................................14Figure3:pmem.iosoftwarearchitecture(source:RedHat).................................................................14Figure4:MainnodeDRAM/NVM(SCM)configuration.....................................................................18Figure5:UnifiedNEXTGenIOservernode(source:FUJITSU)..............................................................18Figure6:Servernodeblockdiagram....................................................................................................19Figure7:LocalandRDMASCMlatencywithhardwareRDMA............................................................20Figure8:LocalandremoteSCMswithRDMAemulation....................................................................21Figure9:TheNEXTGenIOprototyperack............................................................................................22Figure10:Prototyperackdetails(NOTE:capacities/sizesofmemoryandotherhardwarecomponentspecificationsarebasedoncurrentexpectations,ratherthanexactspecificationsoftheprototype)...............................................................................................................................................................22Figure11:ScalingtheNEXTGenIOarchitecturebeyondonerack.......................................................23Figure12:Intra-rackvs.inter-rackconnectivity...................................................................................24Figure13:Systemwarearchitecture....................................................................................................26Figure14:Loginnodesystemwarecomponents..................................................................................27Figure15:Bootnodesystemwarecomponents..................................................................................32Figure16:Completecomputenodesystemwarecomponents...........................................................34Figure17:GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem........................................51Figure18:ResilienceinspecialistHPCsystem.....................................................................................52Figure19:Largememory.....................................................................................................................53Figure20:Workflows...........................................................................................................................54

TableofTablesTable1:NEXTGenIOscalability(3Dtorusprojections)........................................................................23

Copyright©2017NEXTGenIOConsortiumPartners 7

1 ExecutiveSummaryNEXTGenIOisdevelopingaprototypehighperformancecomputing(HPC)andhighperformancedataanalytics (HPDA) system that integrates a byte-addressable storage class memory (SCM) into astandardcomputeclustertoprovidegreatlyincreasedI/Operformanceforcomputationalsimulationanddataanalyticstasks.

Toenableustodevelopaprototypethatcanbeusedbyawiderangeofcomputationalsimulationapplication, anddataanalytic tasks,wehaveundertakena requirements-drivendesignprocess tocreate hardware and software architectures for the system. These architectures both outline thecomponentsand integrationof theprototypesystem,anddefineourvisionofwhat is requiredtointegrateandexploitSCMtoenableagenerationofExascalesystemswithsufficientI/Operformancetoensureawiderangeofworkloadscanbesupported.

The hardware architecture, which is designed to scale up to an ExaFLOP system, uses highperformanceprocessorscoupledwithSCM inNVRAM(non-volatile randomaccessmemory) form,traditionalDRAMmemory,andanOmni-Pathhighperformancenetwork,toprovideasetofcompletecomputenodesthatcanundertakebothhighperformancecomputationalandhighperformancedataanalysis.

Thesystemware(systemsoftware),whichsupportsthehardwareinthesystem,willenableparallelI/Ousing the SCM technology, provide amulti-node filesystem forusers to exploit, enableuseofobjectstoragetechniques,andprovideautomaticcheck-pointingifdesiredbyauser.Thesefeatures,along with other systemware components, will enable the system to support traditional parallelapplications with high efficiency, and newer computing modes such as high performance dataanalytics.

Thearchitecturesdefinedinthisdocumentarealsodemonstratedwithdescriptionsofrelevantusecasesthatillustratewhichsystemwarecomponentswillbeusedtoundertakevariousactionsonthehardwarebyuserapplications.

Copyright©2017NEXTGenIOConsortiumPartners 8

2 IntroductionTheNEXTGenIO project is developing a prototypeHPC system that utilises byte-addressable SCMhardwaretoprovidegreatlyimprovedI/OperformanceforHPCandHPDAapplications.Akeypartofdevelopingtheprototypeisthedesignofthedifferentcomponentsofthesystem.

ThisdocumentoutlinesthearchitectureswehavedesignedfortheNEXTGenIOprototypeandfutureExascaleHPC/HPDAsystems,buildingonadetailedrequirementscaptureundertakenintheproject.

OuraiminthisprocesshasbeentodesignaprototypesystemthatcanbeusedtoevaluateandexploitSCM for large scale computations and data analysis, and to design a hardware and softwarearchitecturethatcouldbeexploitedtointegrateSCMwithExascalesizedHPCandHPDAsystems.

NEXTGenIOisbuildingahardwareprototypeusingIntel’s3DXPoint™memory,andwehavedesignedasoftwarearchitecturethatisapplicabletootherinstancesofSCMoncetheybecomeavailable.Thefunctionalityoutlinedissufficientlygenericthatitcanexploitarangeofdifferenthardwaretoprovidethepersistentandhighperformancestoragefunctionalitythat3DXPoint™offers.

TheremainderofthisdocumentoutlinesthekeyfeaturesofSCMthatweareexploitinginNEXTGenIO,the general requirements that we have identified for our architectures, and the hardware andsoftware architectures that we have designed. It concludes with a discussion of common usagepatternsweenvisagewillbeusedtoexploittheperformancebenefitsSCMcanbring.

2.1 Glossary1LM 1levelmemory.ThisrepresentsthemodewhereDRAMandSCMare

used as separate random-access memory systems, with their ownmemoryaddressspace.ThismeansthatitispossibletoaddressalltheDRAMandSCMfromanapplication,resultinginanavailablememoryspaceofthetotalDRAMplusthetotalSCMinstalledinanode.1LMallowspersistentmemoryinstructionstobeissued.

2LM 2levelmemory.ThisrepresentsthemodewheretheDRAMisusedasatransparentcachefortheSCMinthesystem.Applicationsdonotseethe DRAM memory address space, only the SCM memory addressspace.Thismeansthetotalmemoryavailabletoapplicationsisthesizeof the total SCMmemory in the node. 2LM cannot issue persistentmemoryinstructions;itcanonlybeusedinload/store(volatile)mode.

1GE 1GigabitEthernet

10GE 10GigabitEthernet

100G 100Gigabit(dualsimplex)

3DXPoint™ Intel’semergingSCMtechnology thatwillbeavailable inNVMeSSDandNVDIMMforms

AEP Apache Pass, code name for Intel NVDIMMs based on 3D XPointTMmemory

API Applicationprogramminginterface

CPU Computeprocessingunit

CCN Completecomputenode

DDR4 DoubledatarateDRAMaccessprotocol(version4)

Copyright©2017NEXTGenIOConsortiumPartners 9

DHCP DynamicHostConfigurationProtocol

DIMM Dualinlinememory,themainstreampluggableformfactorforDRAM

DMA Directmemoryaccess(usingahardwarestatemachinetocopydata)

DNS DomainNameSystem,astandardInternetaddressresolutionsystem

DP Dualprocessor

DPA DIMMphysicaladdress

DPXeon DPIntel®Xeon®platform

DRAM Dynamicrandomaccessmemory

DMTF DistributedManagementTaskForce,an industry standardsbody forcomputersystemmanagementtopics

ECWMF EuropeanCentreforMedium-RangeWeatherForecasts

ExaFLOP 1018FLOP/s

FLOP Floatingpointoperation(inourcontext64bit/dualprecision)

Gb,GB Gigabyte(10243bytes,JEDECstandard)

GIOPS 109IOPS

GPU Graphicalprocessingunit

HDD Harddiskdrive.Primarilyusedtorefertospinningdisks

HPC HighPerformanceComputing

IB InfiniBand

IB-EDR IB“eightdatarate”(100Gb/sperlink)

IFTcard Internal faceplate transition card (interfaces internal iOPA cablescomingfromCPUstoserverfaceplate)

IFS IntegratedForecastingSystem,asimulationframeworkfromECMWF

InfiniBand IndustrystandardHPCcommunicationnetwork

I/O Input/output

iOPA IntegratedOPA(withHFIontheCPUpackage,asopposedtohavingitonaPCIecard)

IOPS I/Ooperationspersecond

IPMI IntelligentPlatformManagementInterface

JEDEC JEDECSolidStateTechnologyAssociation,formerlyknownastheJointElectronDeviceEngineeringCouncil;governsDRAMspecifications

Lustre Paralleldistributedfilesystem;Opensource

LNET,LNET Lustrenetworkprotocol

MIOPS 106IOPS

MPI MessagePassingInterface,standardforinter-processcommunicationheavilyusedinparallelcomputing

MW MegaWatt(106Watts)

Copyright©2017NEXTGenIOConsortiumPartners 10

NANDflash MainstreamNVMusedinSSDstoday

NUMA Non-uniformmemoryaccess

NVM Non-volatilememory(genericcategory,usedinSSDsandSCM)

NVMe NVMExpress,atransportinterfaceforPCIeattachedSSDs

NVMF NVMoverfabrics,theremoteversionofNVMe

NVML NVMlibraryat[5]

NVRAM Non-volatile RAM (DIMMbased). Implemented byNVDIMMs, beingone kind of NVM, that supports both non-volatile RAM access andpersistentstorage

NVDIMM Non-volatile memory in DIMM form fact that supports both non-volatileRAMaccessandpersistentstorage.IntelNVDIMM,alsoknownasAEP.

OEM Originalequipmentmanufacturer

Omni-Path High-performance interconnect fabric developed by Intel for use inHPCsystems

OPA Omni-Path

O/S Operatingsystem

PB Petabyte(10245bytes,JEDECstandard)

PCI-Express Peripheral Component Interconnect Express. Communication busbetweenprocessorsandexpansioncardsorperipheraldevices.

PCIe PCI-Express

Persistent Datathatisretainedbeyondpowerfailure

PFLOP Peta-FLOP,1015FLOP/s

PMEM,pmem Persistentmemory,anothernameforSCM

POSIX PortableOperatingSystemInterface

POSIXfilesystem AfilesystemthatadherestothePOSIXstandard,specificallyforFileandDirectoryoperations.

PXE Pre-bootExecutionEnvironment,runbeforeO/Sisbooted

QPI QuickPathInterconnect

QuickPathInterconnect InteltermforthecommunicationlinkbetweenIntel®Xeon®CPUsonaDPsystem

RAM Randomaccessmemory

ramdisk DRAMbasedblockstorageemulation

RDMA RemoteDMA,DMAoveranetwork

SCM Byte-Addressable Storage class memory, referring to memorytechnologies that can bridge the huge performance gap betweenDRAMandHDDswhileprovidinglargedatacapacitiesandpersistenceofdata

SNIA StorageNetworkingIndustryAssociation

Copyright©2017NEXTGenIOConsortiumPartners 11

SNMP SimpleNetworkManagementProtocol

SSD Solidstatediskdrive

SHDD SolidstateHDD,usingaNVMbufferinsideaHDD

TB Terabyte(10244bytes,JEDECstandard)

TFTP TrivialFileTransferProtocol

TOP500 TOP500supercomputerlist,http://www.top500.org

U Rackheightunit(1.75”)

Xeon® Intel’sbrandnameforx86serverprocessors

XeonPhiTM Intel’smanycorex86lineofCPUsforHPCsystems

Copyright©2017NEXTGenIOConsortiumPartners 12

3 Non-volatileMemoryTheaimoftheNEXTGenIOprojectistoexploitanewgenerationofmemorytechnologiesthatweseebringing great performance and functionality benefits for future computer systems. Non-volatilememory,memorythatretainsanydatastoredwithinitevenwhenitdoesnothavepower,hasbeenexploitedbyconsumerelectronicsandcomputersystemsformanyyears.Theflashmemorycardsusedincamerasandmobilephonesareanexampleofsuchhardware,usedfordatastorage.Morerecently,flashmemoryhasbeenusedforhighperformanceI/OintheformofSolidStateDisk(SSD)drives,providinghigherbandwidthand lower latency than traditionalHardDiskDrives (HDD),buttypicallywithlowercapacities.

WhilstflashmemorycanprovidefastI/Operformanceforcomputersystems,therearesomedrawbacks. It has limited endurance when compare to HDD technology, limiting the number ofmodificationsofmemorycellsandthustheeffectivelifetimeoftheflashstorage.Itisoftenalsomoreexpensivethanotherstoragetechnologies.However,SSDstorage,andenterpriselevelSSDdrivesareheavilyusedforI/Ointensivefunctionalityinlargescalecomputersystems.

Byte-addressablestorageclassmemorytakesanewgenerationofnon-volatilememorythatisdirectlyaccessible via CPU load/store operations, has higher durability than standard flash memory, andhigherreadandwriteperformance,andpackagesitinthesameformfactor(i.e.thesameconnectorandsize)asthemainmemoryusedincomputers(DRAM).ThisallowsthememorytobeinstalledandusedalongsideDRAMbasedmainmemory,controlledbythesamememorycontroller.Thismeansthatapplicationsrunningonthesystemcanaccessitdirectlyasifitwasmainmemory,includingtruerandomdataaccessatByteor cache linegranularity. This is verydifferent thanblockbased flashstorage,whichrequiresintensiveI/Ooperationsforsuchdataaccess.

Theperformanceofbyte-addressableSCMisprojectedtobelowerthanmainmemory(withalatencyof~5-10xofDDR4memorywhenconnectedtothesamememorychannels),butmuchfasterthanSSDsorHDDs.ItisalsoprojectedtobeofmuchlargercapacitythanDRAM,around2-5xdenser(i.e.2-5xmorecapacityinthesameformfactor).

This new class of memory offers very large memory capacity for servers, long term very highperformancepersistentstoragewithintheNVRAMmemoryspaceoftheservers,andtheabilitytoundertakeI/O(readingandwritingdata)inanewway.DirectaccessfromapplicationstoindividualbytesofdataintheSCMisverydifferentfromtheblock-orientedwayI/Oiscurrentlyimplemented.

SCMhasthepotentialtoenablesynchronous,byte level I/O,movingawayfromtheasynchronousblock-basedfileI/Oapplicationscurrentlyrelyon.IncurrentasynchronousI/O,thedriversoftwareissuesanI/OcommandbyhandingoveranI/Orequestintoaqueuetoanactivecontroller.Processingofthatcommandmayhappenaftersomedelayifthereareothercommandsalreadybeingprocessed.

Next,theI/Ocommandisactivelystarted(aswithstorageread/writeornetworkingtransmit),oritisrecognised as a receiving descriptor for an asynchronous I/O command coming from an externalsource(e.g.anetworkingreceive).OncetheI/Ooperationisfinishedbythecontroller,thedriverisnotified(usuallybyaninterrupt)tocompletetheoperationinsoftware.

WithSCMprovidingmuchlowerlatenciesthanexternaldiskdrives,thistraditionalI/Omodel,usinginterrupts,isinefficientbecauseoftheoverheadofcontextswitchesbetweenuserandkernelmode(whichcantypicallytakethousandsofCPUinstructions).

Furthermore,withSCMitbecomespossibletoimplementremoteaccesstodatastoredinthememoryusingRDMAtechnologyoverasuitableinterconnect.UsinghighperformancenetworkscanenableaccesstodatastoredinSCMinremotenodesfasterthanaccessinglocalhighperformanceSSDsviatraditionalI/Ointerfacesandstacksinsideanode.

Therefore,itispossibletouseSCMtogreatlyimproveI/Operformancewithinaserver,increasethememorycapacityofaserver,orprovidearemotedatastorewithhighperformanceaccessforagroup

Copyright©2017NEXTGenIOConsortiumPartners 13

ofserverstoshare.SuchstoragehardwarecanalsobescaledupbyaddingmoreSCMmemoryinaserver,oraddingmorenodestotheremotedatastore,allowingtheI/Operformanceofasystemtoscaleasrequired.

However,ifSCMisprovisionedintheserversinacomputesystem,theremustbesoftwaresupportformanagingdatawithintheSCM.Thisincludesmovingdataasrequiredforthejobsrunningonthesystem,andprovidingthefunctionalitytoletapplicationsrunonanyserverandstillutilisetheSCMforfastI/Oandstorage(i.e.applicationsshouldbeabletoaccessSCMinremotenodesifthesystemisconfiguredwithSCMonlyinasubsetofallnodes).

AsSCMispersistent, italsohaspotentialtobeusedforresiliency,providingbackupfordatafromactiveapplications,orprovidinglongtermstoragefordatabasesordatastoresrequiredbyarangeofapplications.With support from the systemware, servers could be enabled to handle power losswithout experiencing data loss, efficiently and transparently recovering from power failure andresumingapplications fromtheir latest runningstate,andmaintainingdatawith littleoverhead intermsofperformance.

3.1 SCMmodesofoperationOngoing developments in non-volatile / storage class memory have impacted the discussion ofmemoryhierarchies.AcommonmodelthathasbeenproposedincludestheabilitytoconfiguremainmemoryandSCMintwodifferentmodes:1-levelmemoryand2-levelmemory[2].

1-levelmemory,or1LM,hasmainmemory (DRAM)andNVRAMas twoseparatememoryspaces,both accessibleby applications. This is very similar to the FlatMode [3] configurationof thehighbandwidth, on-package, MCDRAM in current Intel® Xeon PhiTM processors (code name “KnightsLanding”orKNL).TheDRAMwillbehandledviastandardmemoryAPI’ssuchasmallocandrepresenttheO/Svisiblemainmemorysize.TheNVRAMwillbemanagedbystandardfilesystemAPIsuchasmmapandpresent thenon-volatilepartof the systemmemory.BothallowdirectCPU load/storeoperation.InordertotakeadvantageofSCMin1LMmode,thesystemwareorapplicationshavetobeadaptedtousethesetwodistinctaddressspaces.

Figure1:One-levelmemory(1LM).

2-levelmemory,or2LM,configuresDRAMasacacheinfrontoftheNVRAM.ApplicationsonlyseethememoryspaceoftheSCM,databeingusedisstoredinDRAM,andmovedtoSCMwhennolongerimmediatelyrequiredbythememorycontroller(asinstandardCPUcaches).ThisisverysimilartotheCacheMode[3]configurationofMCDRAMonKNLprocessors.

ThismodeofoperationdoesnotrequireapplicationstobealteredtoexploitthecapacityofSCM,andaimstogivememoryaccessperformanceatmainmemoryspeedswhilstprovidingthelargememory

MainMemory

StorageClassMemory

Processor

Copyright©2017NEXTGenIOConsortiumPartners 14

space of SCM. However, howwell themainmemory cache performswill depend on the specificmemory requirements and access pattern of a given application. Furthermore, persistence of theNVRAMcontentscannotbelongerguaranteed,duetothevolatileDRAMcacheinfrontoftheNVRAM,sothenon-volatilecharacteristicsofSCMarenotexploited.

Figure2:Two-levelmemory(2LM).

3.2 Non-volatilememorysoftwareecosystemInrecentyearstheStorageNetworkingIndustryAssociation(SNIA)hasbeenworkingonasoftwarearchitectureforSCMwithpersistentload/storeaccess.TheseoperatingsystemindependentconceptshavenowmaterialisedintotheLinuxpmem.io[5]library.

Thisapproachre-usesthenamingschemeoffilesastraditionalpersistententitiesandmaptheSCMregionsintotheaddressspaceofaprocess.Oncethemappinghasbeendone,thefiledescriptorisnolongerneededandcanbeclosed.

Figure3:pmem.iosoftwarearchitecture(source:RedHat).

3.3 AlternativestoSCMThereareexistingtechnologicalsolutionsthatareofferingsimilarfunctionalitytoSCMandthatcanalsobeexploitedforhighperformanceI/O.OneexampleisNVMedevices:SSDsthatareattachedtothePCIebusandsupport theNVMExpress interface. Indeed, Intelalreadyhasa lineofanNVMedeviceonthemarketthatuseon3DXPoint™memorytechnology,calledIntel®Optane™[6].OthervendorshavealargerangeofNVMedevicesonthemarket,mostofthembasedondifferentvariationsofFlashtechnology.

MainMemory StorageClassMemory Processor

Copyright©2017NEXTGenIOConsortiumPartners 15

NVMedeviceshavethepotentialtoprovidebyte-levelstorageaccess,usingthepmem.iolibraries.Afilecanbeopenedandpresentedasamemoryspaceforanapplication,andthencanbeuseddirectlyasmemorybythatapplication.

NVMedevicesofferthepotentialtoremovetheoverheadoffileaccess(i.e.dataaccessthroughfilereadsandwrites)whenperformingI/OandenablethedevelopmentofapplicationsthatexploitSCMfunctionality.However, given thatNVMedevicesare connectedvia thePCIebus, andhaveadiskcontrolleronthedevicethroughwhichaccessismanaged,NVMedevicesdonotprovidethesamelevelofperformancethatSCMoffers.Indeed,asthesedevicesstilluseblock-baseddataaccess,finegrainedmemoryoperationscanrequirewholeblocksofdatatobeloadedorstoredtothedevice,ratherthanindividualbytes.

Memory mapped files are another example of SCM like functionality using existing technology.Memorymapped files allow I/O storage to be accessed as if it wasmemory, through the virtualmemorymanager.However,thedatastillresidesonatraditionalstoragedeviceanditcanthereforeonlybeaccessedattheperformanceratethatthisstoragedeviceallows.

Memorymappedfilesallowforsomeoftheblockstoragecoststobeavoided(i.e.thosecostscausedbytheasynchronousdataaccessandinterrupts),butarelimitedtothesizeofDRAMavailabletoanapplication,have coarser grainedpersistency (requiring full blocks tobe flushed todisk toensurepersistency),withmuchhigherpersistencycoststhanbyte-addressableSCM.

Copyright©2017NEXTGenIOConsortiumPartners 16

4 RequirementsTodesignourhardwareandsystemwarearchitecturesthatwillenableustoconstructaprototypesystem exploiting SCM for HPC and HPDA applications, we identified and examined the usagescenariosforsuchasystemandusedthemtogenerateasetofrequirement.

Whilstdetailedreportingofthoserequirementsisnottheaimofthisdocument,wewilloutlinethegeneralrequirementsweconsideredkeywhendesigningthesystem:

1. Thesystemshouldbeascloseaspossibletoastandard,production,largescalecomputesystem.2. Applicationsmustbeabletorunonthesystemwithoutchanges,otherthanpossiblyrequiringre-

compilation,andbeabletousetheSCMtoprovideperformancebenefits.3. ItmustbepossibletomodifyapplicationstodirectlyusetheSCMandexploitthefullperformance

thehardwareprovides.4. ThehardwareandsystemwaremustbeabletosupportaheterogeneousSCMenvironment.We

considerthatsystemswheresomenodeshaveSCMandsomedonotmaybedesirableforarangeofusersorsystemproviders;wethereforewantourhardwareandsystemwarearchitecturestobeabletosupportsuchconfigurations.

5. Thesystemmustbeabletosupportmultipleusersandmultipleapplicationsrunningatthesametime.

6. Thesystemmustbeflexibleenoughtoallowanyapplicationtorunonanycomputenode.7. Itshouldbepossibleformultipleapplicationstoshareasinglecomputenode,orforuserjobsto

specifythatexclusivenodeaccessisrequired.8. Thesystemmustsupportthesharingofdatabetweenapplicationsoruserjobs.Inparticularwe

recognisethatworkflowsofapplicationsareanimportantusecaseforSCMinsuchsystems.9. The systemshould support I/O fromapplicationsusingnon-filesystembasedapproaches.One

candidatecouldbeanobjectstore.10. ThesystemshouldprovidedataaccesstoSCMbetweencomputenodestoallowapplicationsto

accessremotedataandtoallowthesystemtomanagedatainSCM.11. ThearchitecturesshouldenablethedesignofanExaFLOPrangesystemwithanI/Operformance

sufficienttoensureI/Oisnotaperformancelimiterwhenscalingapplicationsonsuchasystem.12. Thesystemshouldallowdifferentstoragetargetstobeconfiguredforagivenapplication.This

requiressoftwarebeingconfiguredforthesetofcomputenodesallocatedtoagivenuserjob.13. The system must support profiling and debugging tools to enable users to understand the

performanceandbehaviouroftheirprograms.

Copyright©2017NEXTGenIOConsortiumPartners 17

5 HardwareArchitectureDevelopingaprototypeplatformtotestandevaluatetheperformanceandfunctionalitybenefitsofSCMisoneofthekeyobjectivesoftheNEXTGenIOproject.TheSCMtechnologyweareusing,3DXPoint™memory, is a new type ofmemory that requires specific hardware support. 3D XPoint™memorycannotsimplybeinstalledinanexistingsystem.

However,oneof themajorbenefitsofSCM(NVRAM) is that it comes in thesamehardware formfactoras standardmemory (DRAM,DDR4 for3DXPoint™). ThismeanswecandesignandbuildasystemthatcantakebothstandardmemoryandSCMinthesamenodeandprovidebothtousers.

SCM does required specific processor support, primarily because memory controllers in moderncomputingsystemsaredirectlyintegratedintotheprocessoritself.Thismeans,ifwewanttouseanewmemory technologyweneedprocessorswithsupport for thatnewmemory in theirmemorycontrollers. For instance, for 3D XPoint™, Intel has announced a new generation of Intel® Xeon®processorswiththecodename“CascadeLake”for2018releasethatwillincludesuchsupport.

Furthermore,aswearecombiningtwotypesofmemoryintheservers(ornodes)inthesystem,weneedtoensurethatthereiscapacitytoinstallsufficientamountsofstandardmemoryandSCM.Aswearelookingtoprovideaprototypesystemwherelargescalecomputationalsimulationanddataanalytics tasks can be undertaken, we need to provide sufficient hardware capacity to installsufficientlylargeamountsofstandardDRAMmemoryandSCMatthesametime.Inotherwords,weneedasystemthathasspaceforalargenumberofmemorymodules,orDIMMs,tobeinstalled.

The NEXTGenIO prototype system is designed to provides for very high SCM capacities and I/Obandwidth,combiningis3DXPoint™technologywithIntel’snewOmni-PathinterconnecttoprovideextremelyhighI/Operformanceforapplications.

AsingleNEXTGenIOrackholdsupto32Xeon®servers,CompleteComputeNodes(CCNs),connectedbytheOmni-Pathnetwork,withtwologinnodestosupportlocalcompilation,visualisation,andrackmanagement.Furthermore,twobootserversareprovidedtostoreandservetheoperatingsystemand software required by the CCNs, and two gateway servers to enable access to an externalfilesystemforlongtermstorage,andforperformancecomparisonwiththeI/OprovidedbytheSCM.

Thearchitectureisdesignedwithscalabilityinmind,facilitatingtheconnectionofNEXTGenIOracksintolargersystems,reachingoutwellintotheExaFLOPrange.WeenablesoftwaretoaccesslocalSCMalmostasfastasDRAM,andtoaccessandanyotherCCN’sSCMintheclusterfasterthanaccessingafaststoragedevice(i.e.afastSSD)insidethenodeitself.

ThisSCMfocussedsystemarchitectureallowsfundamentallynewdesignsforscalableHPCsoftwareandprovidesthefunctionalityrequiredtomakethedatacentrepowerconsumptionbecomethefinalscalinglimiter,notthesystemarchitecture.

Usingfutureprocessorsandnetworkgenerations,weexpectanExaFLOPconfigurationbasedontheNEXTGenIOarchitecturetobecomepracticalwithinfiveyears.

5.1 Hardwarebuildingblocks

5.1.1 Mainservernode

SinceNEXTGenIOiscentredonusingSCMasanew,ultra-faststoragelayerbetweenmemory(DRAM)anddisk,wehavetounderstandthegoalsandconstraintsonhowtoconfigureDRAMandNVMintheserversbeingused.

AsshowninFigure4,theNEXTGenIOservernodeisadualprocessornodeandsupportsSCM.Asweare lookingforhighperformancenodes,weneedtohaveall12memorychannels(asprovidedbyIntel’scurrent“Skylake”generationofIntel®Xeon®processorsandannouncedfor“CascadeLake”)populatedwithsomeDRAM,toenableutilisingthetotalavailableDRAMbandwidthinthenode.

Copyright©2017NEXTGenIOConsortiumPartners 18

Figure4:MainnodeDRAM/NVM(SCM)configuration.

SincewearealsointerestedinmaximumSCMbandwidthandcapacity,wealsoneedtopopulateall12memorychannelsinthenodewithSCM.Together,thatmakes12DRAM-DIMMsplus12NVDIMMs,pairingupineachchannel.Ofcourse,devicesonthesamechannelsharetheavailableDDR4channel’sbandwidth.

The server hardware is implemented in a 1U form factor (Figure 5),which is a good compromisebetweenconfigurabilityanddensity.

Figure5:UnifiedNEXTGenIOservernode(source:FUJITSU).

Figure6showstheNEXTGenIOservernodeimplementationblockdiagram.Whilemostdetailsarenotrelevanthere,someareimportant.Intotal,therearethreePCIex16slotsavailableforadd-incardsinthe system.Wewill develop support for processors with integrated network connection i.e. onechannelcomingoffeachCPU;thiscurrentlyrequiresaPCIeslottocarryan“IFTCard”,holdingthemodules to receive external cables (shown on the upper right in the picture). Consequently, thatleavestwoPCIex16slotsforotherI/Ocards.

Copyright©2017NEXTGenIOConsortiumPartners 19

Figure6:Servernodeblockdiagram.

5.1.2 Ethernetswitch

Forsystemmanagementandgeneral-purposeinter-nodecommunication,twoEthernetnetworksareincluded:1GEforthemanagementLANandslowdataconnections,and10GEtoconnecttothedatacentreLANaswellasfastconnectionswithintherack.

5.1.3 Networkswitch

Thehardwarearchitecturesupportstheintegrationofmostcommonhighperformancenetworkstoprovide the fast, high bandwidth, interconnect between the nodes in the system. This includesInfiniband based networks, Omni-Path Architecture and fast Ethernet and Omni-Path options,connectedviaPCIExpressor(forOmni-Path)integratedintoamulti-chipCPUpackage.

Onefeaturethatisnecessaryforasystemtosupportalltheproject’srequirementsisremotedataaccessover thenetwork.RDMAoffers thepotential forhighperformanceaccess to remoteSCM,whichisimportanttoenablesomeoftheusecasesweareconsidering.

5.1.4 Loginnodes

AswithanyHPCcluster,weneedtosupportloginnodestoaccessthesystem,performcompilationsandremotevisualisation.Loginnodeshavedifferentrequirementstothecomputenodes,andassuchwillbeprovisionedwithhighmemoryandprocessingcapabilities,alongwithlocalstoragetoenablefastcompilationofapplications.

This local storage canbeprovidedbySCM in thenodesor local SSDs,dependingon the costandavailabilityofthesecomponents.SCMcouldprovidethebenefitofalargememoryspacefortheloginnodes(i.e.ifitsupported2LMfunctionality).

5.1.5 Bootnodes

Toenablebootingofthecomputenodesviathenetwork,foreasyadministrationoftheoperatingsystem,andtoinstallsoftwareonthecomputenodes,weprovideanumberofbootnodes.Thesehave1GEnetworkconnectionstothecomputeservers,connectedthrough10GEswitches(toreducetheriskofbecomingsaturatedduringsysteminitialisation).

Copyright©2017NEXTGenIOConsortiumPartners 20

Bootnodestypicallyhavelowrequirementsoncomputeperformance,i.e.evenamainstreamsingleCPUisexpectedtobesufficient.Thestoragecapacityrequirementsarealsomoderate.However,toavoidbottlenecksonbootstorms,SSDsorSHDDs(HDDswithaninternalNAND-Flashbuffer)seemtobeappropriate.

5.1.6 Gatewaynodes

IfanOmni-Pathnetworkisusedasthemaininterconnectwithinthesystem,andweneedtoconnectthesystemtoanexternalLustrefilesystembasedonInfiniband,wewillneedsomegatewayserversthatbridgebetweenthetwointerconnects.IntelhavetechniquesforbuildingsuchgatewaysusingstandardIntel®Xeon®serversandLinuxoperatingsystem.

5.2 PrototypeconsiderationsThechoiceofnetworkfortheprototypeisbetweenInfinibandandOmni-Pathbasedsolutions.WhilstInfinibandoffershardwareRDMA,Omni-Pathprovidesthepotentialforanetworkintegratedontotheprocessor(integratedOPAor iOPA).FortheNEXTGenIOprototypewedecidedtoevaluatetheintegratednetworkfunctionalityofOmni-Path.

5.2.1 RDMAoptionsfortheprototype

Thereareanumberofwayswecanprovideremotedataaccessintheprototypesystem.IfweutilisedanInfinibandbasednetworkwecouldenableRDMA(supportedbythenetworkadapterhardware),providing very low latency (1 microsecond or less) and high bandwidth access, with small I/Ooperations.ThishardwareRDMAalsodoesnotusemuchofthecomputeresourceinanodeasthenetworkadapterhasitsownprocessinghardwaretosupporttheRDMAoptions.

Figure7:LocalandRDMASCMlatencywithhardwareRDMA.

However, the current generation of Omni-Path network adapters does not support hardwareaccelerated RDMA (future generations ofOmni-Path are scheduled to have such support). In thisscenario,partsoftheRDMAoperationsneedtobeexecutedbythehostCPU.AssuminganoptimisedRDMAemulationtargetusingpolling,weexpectaremoteSCMblockread/writelatencyofaround4

Copyright©2017NEXTGenIOConsortiumPartners 21

to5microseconds(Figure8).EventhoughthisissignificantlyhigherthanwithhardwareacceleratedRDMA,itwillstillbeintherangeforeffectiveuseofsynchronousI/O.

Figure8:LocalandremoteSCMswithRDMAemulation.

5.2.2 ApplicabilityofthearchitecturewithoutaccesstoSCM

Whilst this hardware architecture is designed around SCM, it is more generally applicable. Forinstance,local(in-node)SCMcouldbereplacedbylocalnon-volatilediskdrives,usingNVM-Express(NVMe)devices. Theseprovidebyte level access topersistentmemory fromapplications, andarecompatiblewiththepmem.iolibraryfunctionality.Itwouldalsobepossibletoreplacein-nodeSCMwithremoteNVMedevices,usingNVMoverFabrics(NVMF).

NVMelatencyandbandwidthwillnotbeofthesamelevelofperformancecomparedtotheSCMwewilluseinourprototypearchitecture(i.e.3DXpoint™memory),withbestsustainedNVMelatenciesforcurrenttechnologybeingaround100-300µs,andbandwidthofaround2.5-3.5GB/s(forax4PCIeconnection),comparedto300-500nsand8-10GB/sfortheSCM(perDIMM).TheNVMFlatencywillbehigherthanwithremoteSCMaccessedbyRDMA.Weexpectaround10-20µs(read/write)with3DXPoint™-SSDs(Optane)andaround10µs(read/write)withaSCMbasedNVMFtarget.

5.3 SystemconfigurationTheNEXTGenIOprototyperackprovidesapproximately100TBNVRAMcapacitybasedonIntel’s3DXPoint™technology[1].Itisaccessiblefromany(andtoany)CCNinlessthan5microseconds,withveryhighI/Obandwidth(almost400MIOPs).

The rack holds up to 32 Intel® Xeon® servers, also known as complete compute nodes (CCN),connected by Intel® Omni-Path Architecture fabric. Two login nodes support local compilation,visualisation,andalsorackmanagement.Twobootserversenablenetworkboot,andtwogatewaystoanexternal file system.Thenodeconfigurationsdetailsandnetworkconnectivityare shown inFigure10.

Copyright©2017NEXTGenIOConsortiumPartners 22

Figure9:TheNEXTGenIOprototyperack.

Figure10:Prototyperackdetails(NOTE:capacities/sizesofmemoryandotherhardware

componentspecificationsarebasedoncurrentexpectations,ratherthanexactspecificationsoftheprototype).

Copyright©2017NEXTGenIOConsortiumPartners 23

Assuming2to4TFLOPSpernode,theNEXTGenIOrackwith32nodesachieves64-128TFLOPSandapproximately384MIOPS,i.e.around1.5TB/slocalblockstorage.Forthepurposeoftheprototypethisscaleissufficient,butonekeyaspectofthisworkistoshowhowthiscouldscaleintotheExaFLOPrange.

5.4 ExaFLOPvisionSincethearchitecturesupportsnodesusingthetwointegratednetworkportsoftheCPUs,wehavetwoPCIex16slotsavailablepernode.Usingtwosingle-channelnetworkcards,wecanextendthesingleracksystemtomanyracks,butstillpreservethesamehardware/softwarearchitecture.Itwouldjustaddasecondlevelof“slower”connectivity,i.e.betweenracks.“Slower”inthiscontextmeanshigher latencyandlowerbandwidthpernode,butstillkeepingtheconceptofsynchronousI/O,atleastuptoarangeof10to20microsecondslatency.

Witha5-hopnetworkwecanlink864rackswith50to100PFLOPs,83PBNVM,and332GIOPs:

Figure11:ScalingtheNEXTGenIOarchitecturebeyondonerack.

Withcubical3Dtorusconfigurations,wecanprojectscalingto theExaFLOPrange–stillusing thesameNEXTGenIOhardware/softwarearchitecture(Table1).

Total# Nodes(Intel DP Skylake)

PFLOP(Min

Estimate)

PFLOP(Max

Estimate)

Total SCM Capacity

(PB)

Total SCM I/O B/W(TB/s)

Total Power (MW)

768 1.5 3 2 36 0.4 3,072 6 12 9 144 1.5

24,576 48 96 72 1,152 12 82,944 162 324 243 3,888 41

196,608 384 768 576 9,216 98 384,000 750 1,500 1,125 18,000 192

Table1:NEXTGenIOscalability(3Dtorusprojections).

At750to1,500PFLOPs,thetotalSCMcapacitywouldbe1.1Exabyte.ThetotalI/Obandwidthwouldbe18PB/s,whichcorrespondstothespeedofapproximately6millioncurrentdayPCIe-basedSSDs.

Copyright©2017NEXTGenIOConsortiumPartners 24

TheFLOPtoI/Obandwidthratiowouldbe1:50,i.e.atasimilarleveltotheFLOPtomemorybandwidthratiothatHPCsystemsaspiretoo.

Thepowerneededfora1ExaFLOPsystembasedoncurrentIntelprocessorswouldbeimpractical–almost200MW.However,basedonfutureCPUsandinterconnects,anExaFLOPconfigurationwithinanaffordable20-30MWshouldbecomefeasibleearlyinthenextdecade.

5.4.1 ScalabilitytoanExaFLOPandbeyond

Technically,theOPAlinklayercanaddressmorethan10millionnodes[4].Suchclustersizewouldbefarbeyondanypracticalpowerlimit(5GW).Earlynextdecade,however,adualprocessor(DP)nodemayhave50TFLOPsormore inthesamepowerenvelope.Then,50,000nodeswouldachieve2.5ExaFLOPsinamuchmorepracticalpowerbudgetof25MW.

5.4.2 Scalabilityinto100PFLOP/s

AccordingtoIntel,a5-hopOPAclustercanlinkupto27,648nodes.WithourNEXTGenIOrackandasingleOPAinter-racklinkpernode,thiswouldleadtothefollowingconfiguration(seealsoFigure12):

- 864rackswith27,648NEXTGenIOnodes- 50to100PFLOPs- 83PBofSCMwith332GIOPs@4kB=1.36PB/slocalI/O- 14MWpower(assuming500W/nodeincludinginfrastructure)

The inter-rack bandwidth would be dependent on the network topology. Not knowing theconfigurationdetails,weguessthebandwidthpernodewillbelowerthanthe2x100Gavailableinourprototyperack.Thelatency,however,wouldbesimilar.Assuming1microsecondorlessround-tripperhop,theMPIlatencyshouldstaybelow5microseconds,andremoteSCMviaRDMAemulationbelow 10 microseconds. This would enable synchronous remote I/O SCM storage access, but atsomewhatlowerefficiencythanwithintheracks.

Figure12:Intra-rackvs.inter-rackconnectivity.

Sincethebandwidthpernodemaybelimitedbythisinter-rackbandwidthrestrictions,thesoftwareshouldbeawareofthisextranetworklevel.Theschedulerandapplicationsshouldthereforeensurethatinter-rackaccessesarelessfrequentthanintra-rackaccesses.Thisisverysimilartotoday’sNUMAmemoryaccessoptimisationwithincomputenodes.

Copyright©2017NEXTGenIOConsortiumPartners 25

6 SystemwareArchitectureThesystemwarefortheNEXTGenIOarchitecture includesall thesoftwarethatrunsonthesystem(excluding user applications). As described in Section 5, there are four main components in theprototype:

• Loginnodes• Boot/Managementnodes• Gatewaynodes• Completecomputenodes

Forlargescaleversionsofthesystemwearedesigning,wewouldexpecttheretobeafifthtypeofnode, JobLauncher/Schedulernodes,whichwouldrunthesystemwarecomponentsthatscheduleandrunuserjobs,andcollectperformancedatafromthesystemasawhole.Forfull-scaleproductionsystems,itwouldnotbedesirabletoruntheschedulingandmonitoringsystemsfromtheloginnodesasuserbehaviourcouldcompromisetheschedulingcomponentsandinterrupttherunningofjobsonthe system, and the small number of login nodesmight be overwhelmed by the load created bymonitoring,orchestratingandschedulingaverylargesystem.Furthermore,separatingouttheloginandjobscheduling/launchingnodesallowstheenforcementofdifferentlevelsofsecurityonthosecomponents,thusfacilitatingasecuresystemoverall.However,giventhescaleofourprototype,thisfunctionalityhasbeenassignedtotheloginnodesinsteadofrequiringaseparatesetofnodes.

There are also other hardware components that are important to the prototype system we aredeveloping,butarenotpartofthecoresystemwarearchitecture,specifically:

• Networkswitches,thesewillrequiremanagementfunctionalityinourprototypesystem,buttheyarenotpartofwhatweconsiderthecoreNEXTGenIOsystemwareastheyaresimplyacomponentofthenetworktheybelongto.Wewillusestandardcomponentsfortheselectednetworkfabric.

Wehaveaddedtheseadditionalcomponentstothesystemwarediagramsbelow,buttheyarenotdescribedtheminthefollowingsections;weviewthemasstandalonecomponentsthatarenotkeyto the architectures we are designing in this project (i.e. they could be replaced with othercomponentsandstillprovidethesamefunctionalitytooursystem).

6.1 SystemwareOverviewThe systemware provides the software functionality necessary to fulfil the requirementswe haveidentified.Itprovidesanumberofdifferentsetsoffunctionality,relatedtodifferentwaystoexploitSCM.

Briefly,fromauser’sperspectivethesystemenablesthefollowingscenarios:

1. Userscanrequestsystemwarecomponentstoload/storedatainSCMpriortoajobstarting,or after a job has completed. This can be thought of as similar to current burst buffertechnology.This canaddress the requirement forusers tobeable toexploitSCMwithoutchangingtheirapplications.

2. UserscandirectlyexploitSCMbymodifyingtheirapplicationstoimplementdirectmemoryaccessandmanagement.ThisoffersuserstheabilitytoexploitthebestperformanceSCMcanprovide,butrequiresapplicationdeveloperstoundertakethetaskofprogrammingforSCMthemselves,andensuretheyareusingitinanefficientmanner.

3. ProvideafilesystembuiltontheSCMintheCCNs.ThisallowsuserstoexploitSCMforI/Ooperations without having to fundamentally change how I/O is implemented in theirapplications. However, it does not enable the benefit of moving away from filesystemoperationsthatSCMcanprovide.

Copyright©2017NEXTGenIOConsortiumPartners 26

4. ProvideanobjectstorethatexploitstheSCMtoenableuserstoexploredifferentmechanismsforstoringandaccessingdatafromtheirapplications.

5. Enable the sharing of data between applications through SCM. For example, thismay besharingdatabetweendifferentcomponentsofthesamecomputationalworkflow,orbethesharingofacommondatasetbetweenagroupofusers.

6. Ensuredataaccessisrestrictedtothoseauthorisedtoaccessthatdataandenabledeletionorencryptionofdatatomakesurethoseaccessrestrictionsaremaintained

7. Provisionofprofilinganddebuggingtoolstoallowapplicationdeveloperstounderstandtheperformance characteristics of SCM, investigatehow their applications areutilising it, andidentifyanybugsthatoccurduringdevelopment.

8. Efficientcheck-pointingforapplications,ifrequestedbyusers.9. Provideboth1LMand2LMmodesiftheyaresupportedbytheSCMhardware.10. Enableordisablesystemwarecomponentsasrequiredforagivenuserapplicationtoreduce

theperformanceimpactofthesystemwaretoarunningapplication,ifthisisnotusingthosesystemwarecomponents.

Wehavenotdesignedthesystemwaretosupportthefollowingfunctionality:

1. Automaticcopyingormirroringofapplicationdatabetweennodeswithouta jobexplicitlyrequestingit.

Thesystemwarearchitecturewehavedefinedisoutlinedinthefollowingdiagram.Wewilldescribeitinmoredetailinthefollowingsections.Eachdistincttypeofnodeinthesystemhasitsownpartinthesystemwarearchitecture.

Figure13:Systemwarearchitecture.

6.2 LoginNodesTheprimarypurposeoftheloginnodesistoprovideresourcesforuserstoperformdatamovement,sourcecodecompilation,jobsubmission,andjobmanagement.Userswillbeabletologintothesenodes,andwillhaveaccess toa local filesystemon the loginnodesaswell as toaglobalparallelfilesystem,whichisalsoaccessiblefromtheCCNs.

ThesystemwarecomponentsfortheloginnodesareshowninFigure14,whichisasubsectionoftheoverallsystemwarearchitectureshowninFigure13.

Copyright©2017NEXTGenIOConsortiumPartners 27

Figure14:Loginnodesystemwarecomponents.

6.2.1 CommunicationLibraries

CompatibleMPIlibraries/functionalityisessentialtosupporttheHPCapplicationsthatwillbeusingthesystem.AsmanyparallelapplicationsuseMPI-I/OfunctionalitytheMPIlibraryshouldsupportatleasttheMPI-2.0APIsandfunctionality,andMPI-I/OshouldbeoptimisedtoworkwithSCM.

6.2.2 Compilers

Applications need to be compiled to use particular hardware platforms. Compute nodes for HPCsystemsgenerallyarenotsetupfordirectuseraccessorforcompilation,sologinnodesarerequiredonaHPCsystemtoletusersbuild/compiletheirapplications,andlaunchtheirjobsthroughthebatchsystem.

Compilationcanrequirereadingandwritingoflargenumbersofsmalldatafiles,whichcanbeslowonparallelfilesystems.SCMcouldspeedupsuchcompilationbyprovidingafasterfilesystemforsuchcompilation.

HPCapplicationsaregenerallyC,C++,orFortranprograms.Assuchitisessentialtohavesupportforthosethreelanguagesinthecompilersonthesystem.Theremaybevaryinglevelsofstandardsusageandcompliance,particularlyintheC++andFortrancodes(i.e.C++11orFORTRAN2015featurescouldbeusedbysomecodes),soitisnecessarytounderstandthelanguagesupportforC++andFortranthattheavailablecompilersprovide.

Copyright©2017NEXTGenIOConsortiumPartners 28

HPCapplicationsgenerallyusesomeparallelfunctionality,primarilyMPIandOpenMP(althoughotherparallelfunctionalityisused,suchasOpenACC,OpenCL,CAForUPC).Thereforethecompilerneedsto also support linking to communication libraries for MPI (and other distributed memoryparallelisationapproaches)basedprograms,andtonode-localparallelisation libraries/functionalitysuchasOpenMP.

6.2.3 I/OLibraries

ManyapplicationsuseI/Olibraries,suchasHDF5orNetCDF,tosimplifyI/Oforparticularapplicationareas,toprovidemetadatafordatasets,ortointegratedatawithotherapplications/tools.

I/O librariesmaybebuiltontopofother I/O libraries (i.e.HDF5maybuildonMPI-I/Oforparallelfunctionality,andNetCDFontopofHDF5).

6.2.4 Java

Java is a general-purpose, concurrent, class-based and object-oriented computer programminglanguagewhich has gained an important popularity, and consequently it is beingwidely adoptedwithinthescientificcommunity.ThelatestJavaversionis8.

Some new task based programming environments, such as COMPSs, are implemented in Java,consequentlyJavasupportmustbeprovidedwithintheloginnodesinordertobeabletoexecuteitenablesuchenvironmentswithintheloginnodes(e.g.submitaPyCOMPSsapplicationtothequeuingsystemandsettheenvironmentwithintheassignedCCN).

6.2.5 JobLauncher

For performance and security reasons the CCNs cannot be accessed directly by users. Jobs aresubmittedusingajobschedulerwhichtakesinputfromtheuser(viaajobscheduling/batchscript)whichspecifiestheapplicationtoberun,theresourcestobeused,andotherdatathatisrequiredforsuccessfulrunningofthejob.ThebatchsystemthenschedulesandrunsthejobontheCCNswhenresourcesareavailableandfollowingdefinedschedulingpolicies.

Oftentheapplicationisactuallyrunbyajoblauncher,aprogramspecifiedbytheuserinthebatchscriptwhichstartstheparallelapplicationontheCCNsthatthebatchsystemhasallocatedtothejob.Thejoblaunchergenerallyispartofparallellibrariesonthesystem,specificallytheMPIlibrariesandassociatedecosystem.

6.2.6 JobScheduler

Ajobschedulerisresponsibleforallocatingnodestorunparalleljobs.Inordertoprovidethisbasicfunctionality,ajobschedulermust:

• Provideauthenticationandauthorisationforusersandgroups• Provideaccountingandhistoricalrecordkeeping• Implementseveralschedulingalgorithms• Allowinteractionsbyprovidingasetofhelperapplication(andaprogrammaticAPI)• Provideabackupjobschedulerforafault-tolerantconfiguration• Allocateandrunjobsinatimelymanner• Manageandtrackdataassociatedwithuserjobs

The job scheduler must be aware of SCM resources in the system, allow users to specify SCMrequirements anddatamovement requirements, understand the configurationof CCNs andallowuserstospecifyconfigurationrequirementsofCCNsfortheirjobs.

The job scheduler will communicate with, and control, job scheduling and data schedulingcomponentsontheCCNs,toenablethesystemtoruninasefficientamanneraspossibleandensureuserdataissecureandsafe.

Copyright©2017NEXTGenIOConsortiumPartners 29

Therewillbemultipleschedulingcomponents inthejobscheduler, includingcomponentsthatcanschedulejobsbasedondatalocationorenergycharacteristics(thedataawarejobschedulerandtheenergyawarejobscheduler).

6.2.7 LocalFilesystem

For performance reasons, users of HPC systems should have access to both a high performanceparallelfilesystem(/work),suchasLustre,aswellasastandardbacked-upNFSfilesystem(/home).

CCNmayhaveaccessonlytothehighperformancefilesystem,whileloginnodeshaveaccesstobothsothattheusercansecurelystoredataonabackedupsystemwhilenotputtingunnecessarystrainonthehighperformancefilesystem.

Compilationcancreatealargenumberofsmallfiles,especiallywhencompilinglargeprograms,andthis canhavea significant impacton theperformanceofparallel filesystems.Havingamore localfilesystemontheloginnodes,especiallyonethatcanbebackedupifnecessary,isessentialtoenableefficientcompilationandhighperformanceapplicationsontheCCN.

6.2.8 MetaScheduler

ThemetaschedulercoordinatestheschedulingdecisionsbetweentheDataAwareJobSchedulerandtheEnergyAwareJobScheduler.Forexample,insteadofmovingdataitcanschedulejobswherethedata is located,ordecidetostage-inandstage-outdataduringtheexecutionofthe job if there isspare network capacity. This functionality will use the job scheduler, and through it the dataschedulers on CCNs, providing dynamic, adaptive, scheduling tomaximise the efficient use of themachine.

6.2.9 ObjectStore

Anyobjectstorewehaverunninginthesystemmayrequiretoolsavailableontheloginnodesforuserstoaccess,query,update,andremovedatafromtheobjectstore.Therefore,weshouldhavetheabilitytorunobjectstoreaccesssoftwareontheloginnodes.

Furthermore, someobject storeshavemanagementand control functionality that canbeused toadministerandmonitortheobjectstore.Theloginnodewouldbetheappropriatelocationforthatfunctionalityaswell.

6.2.10 OperatingSystem

TheoperatingsystemontheLoginnodesrequiresthefunctionalitytoprovideexternalaccesstothesystem;undertakecompilationsandvisualisation/graphicaloperations.Furthermore,themetaandjobschedulerswillrunontheloginnodes.

TheloginnodesmayhavetheirownlocalbootmediaaswellasthepotentialtoPXE-bootfromthebootnode.BesidestandardfunctionstheO/SmustsupporttheCPUsthatsupportSCM,aGPUaswellasthehighperformancenetworkinterface.

6.2.11 PerformanceMonitoring

Studyingsystemandapplicationperformancerequiresaperformance-monitoringcomponent.Theseperformancemonitoring interfaceswill be responsible for performancedata collectionduring theruntimeofanHPCapplication.Thiscoversbothdataabouttheapplication,anddataontheactionsofthesystemware.

In many cases the performance monitoring infrastructure is integrated into, or attached to, theapplications to be monitored. For certain performance aspects, e.g. the scheduling of jobs, it isnecessarytoprovideaself-containedmonitoringcomponentthatrunsindependentlyaspartofthesystemware.

Copyright©2017NEXTGenIOConsortiumPartners 30

6.2.12 PerformanceVisualisationandDebuggingTools

Performance profiling is statistical analysis to study the overall performance of HPC applications.Performancetracingallowstherecordingofindividualeventsfromanapplication.Theresultingtracefileenablesin-depthanalysisoftheapplication’sruntimebehaviourandistypicallyvisualisedusingatimeseries.Incontrast,debuggingtoolsprovideaccesstoanapplicationwhileitisrunningtoaddresscorrectnessissuesastheyhappen.

All three forms of performance tracking require a sophisticated graphical interface and varioussystemwarecomponentstoprovidetheirservice.InthisprojectthetoolsAllineaMap,Vampir/Score-P,andAllineaDDTwillbedeployed.Together,theyprovidecomprehensivein-depthanalysisaidingtoresolveerrors,bottlenecksandperformoptimisations.

Vampir/Score-Pisaframeworkforperformanceanalysis,whichenablesdeveloperstoquicklystudytheapplicationbehaviouratafinegrainlevelandperformin-depthroot-causeanalysis.Thislevelofdetailisachievedbymeansofrecordingperformanceeventsandstoringthemtothetracefile.Animportant featureofVampir is the intuitiveandgraphical representationofdetailedperformanceeventsonatimeaxisandaggregatedprofiles.

These tools differ from the monitoring component of Section 6.2.11 in that they are used byapplicationdevelopersduringcodedevelopmentandoptimisation, rather than thedatacollectioncomponentsofperformancemonitoring/debugging.

6.2.13 PythonPython is a free, interpreted, object-oriented, high-level programming language with dynamicsemantics.Itischaracterisedbyenablingtherapidapplicationdevelopment,aswellasbeingusedasscriptingorgluelanguagetoconnectexistingcomponentstogether.It isbecomingwidelyadoptedwithinthescientificcommunityduetoitsquicklearninganddevelopmentprocess,aswellasfortheexistingscientificlibraries,suchasNumPyandSciPy.

InordertorunPythonapplications,aPythoninterpreterisneededwithintheloginnodes.Thus,inordertoprovidesupportforscientificapplications,somelibrariesmaybeofinterest(e.g.NumPyorSciPy).

GiventhattheapplicationsthatwillbeexecutedinNEXTGenIOthattakeadvantageofPyCOMPSsarePythonapplications,weneedPythonsupportwithintheloginnodes.

6.2.14 SchedulingToolsScheduling tools are used to interact with the job scheduler (either by invoking acommand/applicationorprogrammaticallyusingawell-definedAPI).Interactionswithajobschedulerinclude:

• Submittingabatchscriptorsingleapplicationforaparallelrun• Transmitafiletonodesallocatedforaparticularjob• Querythescheduleraboutthecurrentstateofjobs/nodes• Cancelarunningorpendingjob• Get/updatethestateofjobs/nodes• Displayaccounting/historicaldataaboutnodes/jobs/users

6.2.15 ScientificLibrariesMany of the operations carried out in scientific applications can be composed from standardcomponents.Thesehavebeendevelopedandoptimisedextensivelybydomainspecialistsaswellashardwareandcompilervendors.High-performancecomputational librariesareoftenoptimisedfortheparticularsystemstheyaretobeusedon.

Copyright©2017NEXTGenIOConsortiumPartners 31

6.2.16 Task-basedprogrammingmodel

NewprogrammingmodelsforexploitationoflargescaleHPCsystemsaimtotackleprogrammabilityandperformanceforverylargenumbersofprogramprocessesorthreads.Onesuchapproachistouse a task-based programming model, where developers specify the tasks required for theirapplicationandtheruntimeenvironmentoftheprogrammingmodeldistributesandrunsthesetasksinasefficientwayaspossible.

One example of such a model is PyCOMPSs, which is the Python binding of COMP Superscalar(COMPSs),atask-basedprogrammingmodelthataimstoeasethedevelopmentofapplicationsfordistributedinfrastructures.

Itoffersasimpleprogrammingmodelbasedonsequentialdevelopment:aPyCOMPSsapplicationisaplainsequentialPythonscriptwithannotationsthatdefinetasksandtheirdatadependencies.Inthemodel,theuserismainlyresponsiblefor(i)identifyingthefunctionstobeexecutedasasynchronousparalleltasksand(ii)annotatingthemwithastandardPythondecorator.

The loginnodesmusthaveallthecomponentsofthetask-basedprogrammingmodelrequiredfordevelopmentofapplicationsandsubmissions/managementofjobsbasedonthistechnology.

6.3 GatewayNodesTheprimarypurposeofthegatewaynodesistoprovideaccesstoexternalstoragesystems.ThenodesprovidehighbandwidthgatewayfunctionalityfromIBtoOPAandviceversa.

6.3.1 OperatingSystem

TheoperatingsystemontheGatewaynodesrequiresthefunctionalitytoprovideexternalaccesstostoragesystems.Fortheprototypesystem,thebridgingsoftwarerunningontheGatewaynodeswillbetheLNETRoutersoftwarefromIntel.BesidestandardfunctionsthiscomponentmustsupportthatLNETroutersoftware,thenewprocessorstobeusedintheprototype,andthenetworkinterfacesrequiredtobridgebetweenthemaincomputenetworkandtheexternalstoragesystemnetwork.

6.4 BootNodesThebootnodesprovidemanagementandsoftwareprovisioningfunctionalityforthesystem.TheywillprovidetheCCNswithsoftwareimagesordownloadstoenableCCNstooperatewithout localstorage.

Theywillonlybeaccessiblebysystemadministrators.ThebootnodesystemwarecomponentsareoutlinedinFigure15.

Copyright©2017NEXTGenIOConsortiumPartners 32

Figure15:Bootnodesystemwarecomponents

6.4.1 OperatingSystem

TheoperatingsystemontheBootnodesrequires the functionality tohost thePre-bootExecutionEnvironment(PXE)toprovidebootserviceviaTrivialFileTransferProtocol(TFTP)toothernodes.

ThebootnodeswillalsoofferDomainNameSystem(DNS)andDynamicHostConfigurationProtocol(DHCP)serviceswithintheprototypedomain.ThebootnodewillhostmanagementandmonitoringsoftwareandisthecentralhuboftheNEXTGenIOprototypemanagementsystem.

ThisincludesCCN,EthernetnetworkandOmni-PathFabricmanagement.Thebootnodeswillbeusedfor management access to all other nodes, the Ethernet switches and the Omni-Path switches.Thereforetheoperatingsystemwillneedtosupportthemanagement,monitoring,DNS,DHCP,andPXE-Bootsoftwarethatwillberunonit.

6.4.2 Management/MonitoringSoftware

TheCCNandloginnodeswithinthesystemrequiresomesoftwaretomanagetheiroperation,start-up and shutdown. This includesmonitoring the health of individual nodes, provisioning softwareimages to boot compute nodes from, and managing the lifecycle of the whole system. Severaldifferentinterfacestotheboardmanagementcontrollers(BMC)oneachnodeareavailable.

The traditional Intelligent Platform Management Interface (IPMI) over LAN, Simple NetworkManagementProtocol (SNMP)or thenewRESTfulRedfish(DMTF)canbeusedtoretrieveandsetmanagementdataandfunctionsviaremoteout-of-band(ManagementLAN)interface.

6.4.3 PXE-BootServer

ThebootserverprovidesfunctionalityforCCNs,orloginnode,tobootviathenetworkratherthanhavingtohaveanoperatingsysteminstalledlocally.This,inturn,meansthattheCCNscanoperatewithoutlocalstorage/disksattached.

Copyright©2017NEXTGenIOConsortiumPartners 33

6.4.4 DHCPServer

DynamicHostConfigurationProtocol(DHCP)isaclient/serverprotocolthatautomaticallyprovidesahostwith its IPaddressandother relatedconfiguration information suchas the subnetmaskanddefaultgateway.

ThiscomponentprovidesIPaddressesonthenormalnetworkfortheCCN,login,andbootnodesinthesystem.

6.4.5 DNSServer

TheDNSserverisanycomputerthatprovidesanameserviceforasetofcomputersonanetwork.ADNSserverrunspecial-purposenetworkingsoftwarewithapublicIPaddress,andcontainsadatabaseofnetworknamesandaddressesforotherhosts.

6.4.6 NetworkManagement

Softwarethatconfiguresandmonitorstheswitchesandnetworksusedinthesystemisessentialtoensurethecorrectrunningoftheprototype.Thiswill involvemanagingallofthenetworkswewilldeployinthesystemthroughthemanagementnetwork.

6.4.7 LocalFilesystem

Theservicesonthebootnode,suchasthePXE-Bootservice,needaplacetostoredata(suchasthebootimages).

6.5 CompleteComputeNodesTheCCNsprovidethecomputationalresourceforthesystem.Theywillnotbedirectlyaccessiblebyusers;userjobswillberunthroughschedulingcomponentsinstalledonthenodes.

There will be a large number of CCNs in the system and they will all have the same softwareconfiguration(s). It is expected that userswill be able to reconfigure CCNs for differentmodes ofhardware operation (rebooting between different SCM configuration modes), and potentially toenableordisabledifferentsystemwarecomponents (suchas theobjectstoreorSCMfilesystem ifthosenotinuse).

Therefore, whilst the defined systemware architecture for the CCNs will be uniform, differentcomponentsmaybeactiveorenabledondifferentCCNsatanygiventime,dependingonthejobsthatarebeingrunonthesystem.

The NEXTGenIO architectures are also designed to allow some level of heterogeneity in theconfigurationofCCNs,withthepotentialtoeitherhavethesameamountofSCMacrossallnodes,ortohavesystemswheresomenodeshavenoSCM.ThisistoallowsystemstobedeployedwhereSCMisprovidedinasubsetoftheCCNs,withthebenefitsofSCMavailableacrossthesystemusingthefunctionalityofthesystemwareNEXTGenIOisdesigningandimplementing.

Copyright©2017NEXTGenIOConsortiumPartners 34

Figure16:Completecomputenodesystemwarecomponents

6.5.1 AutomaticCheck-pointer

Applicationsthatrunforlongperiodsoftime(overafewhours)generallyhavetocheck-point(writesomedatatoapersistentstoragesystem,likeafilesystem,thatcanbeusedtorestarttheapplicationifnecessary)tocopewithhardwarefailuresandjobqueuelimits.

Currently applications perform their own check-pointing, howeverwith SCM on node it could bepossible for the automatic check-pointing system component to automatically check-point wholeapplication instances into SCM periodically to remove the need for applications to check-pointthemselves.

6.5.2 CommunicationLibraries

CommunicationlibrariesmustbeavailableontheCCNsforusebyHPCandHPDAapplicationsiftheyhavebeencompileddynamically(i.e.theoperatingsystemlooksforthelibraryobjectfileswhentheapplicationrunsratherthanincludingthematcompiletime).

Copyright©2017NEXTGenIOConsortiumPartners 35

6.5.3 DataMovers

Datamovers are used by the data scheduler to undertake specific data operations. This could becopyingdatabetweendifferentfilesystems,fromlocaltoremoteCCNs,andbetweendifferentdatastoragetargets(forinstancebetweenanobjectstoreandafilesystem).

There will be different datamovers for different operations (i.e. one formoving files around, oranotherformovingdatatoaremoteCCN).

6.5.4 DataScheduler

The data scheduler can migrate data between different hardware levels, within a single node,independently fromanapplication,andbetweendifferentnodes in thesystem,specifically toandfrom SCM on other nodes over the high performance network (probably using RDMA). Thisfunctionalitycanberequestedbyothersoftwarecomponents(includingapplications).

Thedatascheduleralsokeepstrackofdataandwhathardwarethatdataislocatedin.OntheCCNthedataschedulercomponentprovidesthelocalfunctionalitytomovedatabetweenstoragelevels(e.g.filesystem,SCM,DRAM)asinstructedbythehigherlevelschedulingcomponent(thejobschedulerontheCCNincoordinationwiththemasterjobscheduler).

Thedataschedulerisprimarilyassociatedwithmovingdataforagivenjoborapplication,thismaybestagingdatainpriortoajobstartingorstagingdataoutafterajobhasfinished.

Thedataschedulerisalsoresponsibleforhandlingremotedataaccesses.

6.5.5 I/OLibraries

ManyapplicationsuseI/Olibraries,suchasHDF5orNetCDF,tosimplifyI/Oforparticularapplicationareas,toprovidemetadatafordatasets,ortointegratedatawithotherapplications/tools.

I/O librariesmaybebuiltontopofother I/O libraries (i.e.HDF5maybuildonMPI-I/Oforparallelfunctionality,andNetCDFontopofHDF5).

6.5.6 Java

Seesubsection6.2.4.

6.5.7 JobScheduler

AjobscheduleragentrunningontheCCNsisresponsibleforrunninganapplicationwhichispartofaparalleljob.Inordertoprovidethisbasicfunctionality,ajobscheduleragentmust:

• Beabletoupdatethecurrentconfigurationenvironmentofanode;• Executeaparallelapplication;• Monitoralltasksrunningonanode;• Kill/canceltasksrunningonanode;• Communicatewiththemasterjobscheduler(inordertoprovideupdates);• Acceptcommandsfromthemasterjobscheduler;• Beabletoactifcommunicationwiththemasterjobschedulerislost.• Beabletocommunicatewiththedataschedulertounderstandthecurrentstateofdatain

thesystem• Passcommandstothedatascheduler

6.5.8 Management/MonitoringSoftware

Thecomputeandloginnodeswithinthesystemrequiresomesoftwaretomanagetheiroperation,start-upandshutdown.Thisincludesmonitoringthehealthofindividualnodes,provisioningsoftwareimagestobootCCNsfrom,andmanagingthelifecycleofthewholesystem.

Copyright©2017NEXTGenIOConsortiumPartners 36

This component runs on each of the CCN and is used by the management/monitoring softwarecomponentsonthebootnodestoundertakemanagementormonitoringoperationsontheCCNs.

6.5.9 Multi-nodeSCMFilesystem

Themulti-nodeSCMFilesystemprovidesalightweightPOSIXfilesystemthathastwomaingoals:first,tohidethecomplexityofthedifferentlevelsoflocalstoragepresentinacomputingnode(i.e.DRAM,SCMandhighperformanceparallelfilesystem)byaggregatingthemintoavirtualstoragedeviceand,secondly,toallowcomputingnodestoexportSCMregionssothattheycanbeusedtransparentlybyothernodes,thusallowingforalargerstoragecapacityandtheimprovedefficiencyofcollaborative,multi-nodeI/Ooperations.

TherationalebehindthissoftwarecomponentisthattherearemanylegacyHPCapplicationswhichmaybehardorcostlytomodifytouseSCMtechnology,itwouldbedesirablethattheycouldatleastgetsomeofthebenefitoftheNEXTGenIOarchitecturewithoutanymodifications.

Additionally,themulti-nodeSCMFilesystemoffersthefollowingopportunitiesandadvantages:(1)policiescanbeaddedtoautomaticallycontrolthelevelatwhichdataisstored(DRAM,SCMorhighperformanceparallelfilesystem)andwhenitshouldbemigrated;(2)theopportunitytodistributeI/ObetweentheSCMofasetofnodesandkeeptheresultsina“cooperativeburstbuffer”,whichcouldbereusedbetweenjobsorwrittentolong-termstoragewhentheclusterloadisappropriate;and(3)offeranAPItoschedulersandapplicationssothattheycanbettercommunicatetheirneedstothefilesystem.

6.5.10 ObjectStoreRecentlysomeHPCapplicationshavestartedusingobjectstorestomanagedatainsteadof,orasacomplementto,traditionalfilesystems.Objectstoresprovideapplicationswithvariablegranularitydataintheformofindividuallyidentifiedobjectsinsteadoffilesinadirectoryhierarchy.Suchobjectstoresappeartobeagoodmatchtodistributedstorage,especiallyonethatcanprovideuser-levelbyte-granularityaccess(asSCMtechnologyoffers).

Intelhavebeenworkingononesuchobjectstore,DAOS.AnotherexampleisdataClay,adatastoredevelopedbyBSCthatallowsaccessingpersistentobjectsasiftheywereinmemory.dataClaycanstore functions or code-snippets, as well as data, that can be used by the PyCOMPSs task-basedprogrammingsystem.

6.5.11 OperatingSystem

TheoperatingsystemontheCCNsmustbeconfiguredtoallowhighperformancecomputationsandtosupportnewnon-volatilemainmemory functionality. Itmustallowthe implementationofnewstorageparadigmssuchasdirectload/storeaccessofpersistentmainmemory,withoutpagecaches,but also distributed shared storage based on SCM via a high-speed low latency high bandwidthinterconnect.

ToavoidtraditionallocalstoragedevicessuchasHDDorSSDtheO/Smustsupportamemorydisktobootstraightfromthebootloaderonthebootnode.AswellasthestandardfunctionstheO/SmustsupporttheSCMhardware,processorsthatcansupportSCM,andthenetworkinterfaceusedforthemaincomputenetwork.

6.5.12 PerformanceMonitoring

Thisperformancemonitoringcomponentisresponsibleforcollectingperformanceinformationfromanapplicationexecution.Inthefirstcase,performancemonitoringcomponentattachesitselftotheexecution of the application. Different states of the program under observation are inspected byperiodicallyinterruptingtherunningprogram.Exampleswouldbesetitimer()oroverflowtriggersofhardwarecountersthataretriggeredperiodicallytoobservetheapplicationexecution.

Copyright©2017NEXTGenIOConsortiumPartners 37

Inordertoinspecttheexecutionstateofanapplication,call-pathandinformationfromthehardwareperformancecountersareutilised.Callpathsprovideinformationregardingthefunctionsandregionswithintheapplicationwhereasperformancecountersprovide informationregardingthehardwarerelatedactivities.Theperformancemonitoringcomponentdirectlyattachesitselfwiththeinterfaceswhich canprovide information regarding theapplicationexecution. These interfaces includePAPI,hwloc,netloc,procfsandNVML.

In some cases, this component will query hardware resources on the node. For this purpose,systemwaremonitoringinterfaces,suchasinterfacestothescheduler,areused.TheschedulingAPIprovidesamonitoringmechanismtointerprettheoverallresourceconsumptionoftheapplicationonanode.

Moreover, the scheduler will also be responsible for managing the optimal utilisation of SCMresources,requiringittomaintainuptodateinformationonthestatusofSCM.Consequently,theinformationcollectedbytheschedulercanbepassedtotheperformancemonitoringcomponent.

6.5.13 PersistentMemoryAccessLibrary

OnewaytoprovidesimplerandreusableSCMaccessandmanagementmethodsistousealibrarysuite,whichcanthenbeusedbybothuserapplicationsandsystemware.Amongthecoreexpectedfunctionality,suchlibrariesshouldprovidemechanismsforapplicationsorsystemwaretocreateandmanagepersistentmemoryregionsandtheirkeys,includingacrosspowercycles.

The library should also provide mechanisms to facilitate storage of persistent data, such asmechanisms for reliable transactionalupdates.Optionally, the libraries couldprovidemethods formanipulationofcommonhigher-leveldatastructures inpersistentmemory,suchas listsandtreesandevensimplekey-valuestores.

6.5.14 PythonSeesubsection6.2.13.

6.5.15 SCMDriver

Justlikewithanyhardwaredevice,theSCMmustbeexposedtotherestofthesystemandtheO/Sthroughaspecificdevicedriver.WeexpectthistobeprovidedbytheSCMvendor(inthiscaseIntel).Wedonotexpecttoneedtoperformchangestothedevicedriver.

6.5.16 SCMLocalFilesystem

OneofthemoststraightforwardmethodstoexploitSCMforpersistentstorageineachcomputenodeistousea localfilesystem.WhileanyfilesystemcanbeusedwiththeSCM,inordertoexploitthebyte-level load/store access to SCM it is necessary that the local filesystem supports the DAXextensions.TheseextensionsallowuserstobypassallthetraditionalblockorientedstructuresintheI/Ostack,suchasthepagecache.

6.5.17 Task-basedRuntimeEnvironment

Task-basedprogrammingmodelsgenerallyrequirearuntimecomponentthatexecutesandcontrolstasksassociatedwithatask-basedprogramoncomputenodes.

The PyCOMPSs runtimemanages the user application exploiting the inherent concurrency of thescript, automatically detecting and enforcing the data dependencies between tasks and spawningthese tasks to the available resources, which in this scenario is considered to be able to handlepersistentobjectsacrosstheinfrastructure.

Copyright©2017NEXTGenIOConsortiumPartners 38

6.5.18 VolatileMemoryAccessLibrary

WhilethesystemcantransparentlyprovidelargevolatilememorytoapplicationsontopofcombinedDRAMandSCM(in2LMmode),someapplicationsmaywishtomanagetheirownvolatiledataandDRAM.

One example would be applications explicitly storing frequently accessed data in DRAM and lessfrequentlyaccesseddata inNVRAM,butnot requiringpersistencyof thedata inNVRAM.Anotherexamplewould be applications implementing a software cache usingNVRAMof datamirrored inexternalstoragedevices.Insuchcases,amechanismisrequiredtoallowapplicationstospecifywhere(eitherDRAMorNVRAM)memoryshouldbeallocated.

Copyright©2017NEXTGenIOConsortiumPartners 39

7 UsagePatternsToallowacompleterunderstandingofhowasystemdevelopedfromourarchitecturescanbeused,we outline some of the possible usage scenarios we have developed during the requirementsgatheringandarchitecturedesignprocessesintheproject.

The followingsectionsprovidedetailson the functionality that thehardwareandsystemwarecanoffer applications.We build on use cases, and associated usage scenarios, we have identified tooutlinethesystemwareandhardwarecomponentsusedbyagivenapplication,andthelifecycleofthedatainthosecomponents.

For each usage scenario, we describe which systemware components are active, how data flowsthroughthesecomponentsandtheuserapplication,wheredataresides,andthestateofthesystemanddataaftertheapplicationhasfinished.SequencediagramsthatillustratethedatalifecyclecanbefoundintheAppendix(seepage51).

7.1 GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem

7.1.1 Description

ManyHPCsystemshostawiderangeofjobs,fromsmallsinglenodejobstofullsystemjobs.FortheUKnationalHPCservicethatUEDINoperateweseeabreakdownofjobsthatisroughlyasfollows:

• Breakdownbycomputetimeo 15%<96coreso 80%<96-12000coreso 5%>12000cores

• Breakdownbynumberofjobssubmittedo 70%<96coreso 29%<96-12000coreso 1%>12000cores

Asthereisawiderangeofjobsonthesystem,weseeawiderangeofI/Opatterns,fromapplicationsthatdoverylittleI/OtothosethataredominatedbyI/O.WeestimatetheI/Ocostislessthan10%for the majority of applications. Most applications will perform check-pointing to cope with themaximumruntimelimitonthesystem.

Suchasystemgenerallyusesaparallelfilesystemandtherearenodisksinthecomputenodes.AllI/Oisdonetotheparallelfilesystemoverthehighperformancecommunicationnetworkfabricusedforinter-processcommunications.

Asthisisageneralpurposeusagescenariothereispotentialforapplicationsrunningonasystemlikethis to use all the different components available in the NEXTGenIO systemware and hardwarearchitecture.However,forthisexamplewewillrestrictourselvestoapplicationsthathavenotbeenmodified,aresimplyre-compiled,andwritetheirdatatofilesinastandardfilesystem.Traditionallysuchapplicationswouldusetheparallel filesystemattachedtoaHPCsystemtoperformtheir I/O,wouldberunusingabatchsystem,andwouldoperatewithintheresourcelimitsplacedonthembythatbatchsystem.

We assume that applications use themulti-node SCM filesystem for I/Owhilst the application isrunning, and the external high performance filesystem for data storage after an application hasfinished.

ForthisusagescenarioweareusingSCMin1LMnode.

7.1.2 Components

• JobScheduler

Copyright©2017NEXTGenIOConsortiumPartners 40

• Multi-nodeSCMfilesystem• I/OLibraries• CommunicationLibraries

7.1.3 Datalifecycle

1. Schedulingrequestssubmitted.a. At this point the executable to be run is either stored on the external high performance

filesystemoronthelocalfilesystemontheloginnodes,andthedatafortheapplicationisstoredontheexternalhighperformancefilesystem.

2. Jobschedulerallocatesresourcesforanapplication.a. The job scheduler ensures that nodes are in 1LM mode and that the multi-node SCM

filesystemcomponentisoperationalonthenodesbeingusedbythisapplication.3. Oncethenodesthatthejobhasbeenallocatedtoareavailablethedataschedulercopiesinput

datafromtheexternalhighperformancefilesystemtothemulti-nodeSCMfilesystem.a. Thisstepisoptional;anapplicationcouldreadit’sinputdatadirectlyfromtheexternalhigh

performance filesystem, but using the multi-node SCM filesystem will deliver betterperformance.

4. JobLauncherstartstheuserapplicationontheCCN.5. Applicationreadsdatafromthemulti-nodeSCMfilesystem.6. Applicationwritesdatatothemulti-nodeSCMfilesystem.7. Applicationfinishes.8. DataScheduleristriggeredandmovesdatafrommulti-nodeSCMfilesystemtotheexternalhigh

performancefilesystem.

7.2 GeneralPurposeHPCsystemimprovingthroughput

7.2.1 Description

This isageneralpurposeusagescenario,meaning there ispotential forapplications touseall thedifferentcomponentsavailableintheNEXTGenIOsystemwareandhardwarearchitecture.However,forthisexampleweareconsideringhowtousetheNEXTGenIOsystemwaretooptimisetheoverallutilisationorenergyefficiencyofthesystem.

WeareassumingthatCCNsareinvariedstates,withvariousdataloadedontoSCMondifferentCCNs.The schedulers areusedby themeta scheduler toensure thatefficientuse ismadeof thewholemachineorparticularresourcesthatareofinterest(i.e.totalenergyconsumed,powerlimits,ortimeforthroughputofjobs).

7.2.2 Components

• DataAwareJobScheduler• EnergyAwareJobScheduler• MetaScheduler• Multi-nodeSCMfilesystem• I/OLibraries• CommunicationLibraries

7.2.3 Datalifecycle

1. Schedulingrequestssubmitted.a. Atthispointtheexecutabletoberuniseitherstoredontheexternalhighperformance

filesystemoronthelocalfilesystemontheloginnodes,andthedatafortheapplicationmaybeontheexternalfilesystemoronSCMssomewherewithinthesystem.

Copyright©2017NEXTGenIOConsortiumPartners 41

2. Metascheduler is informedofthe jobschedulingrequestsandundertakesananalysisofwhatresourceswouldbemostefficienttouseforthisjobtakingintoaccounttheconstraintsthatthemetaschedulerassigned.

3. Metaschedulergeneratesjobanddataschedulerinstructions.4. Jobschedulerallocatesresourcesforanapplication.5. Oncethenodesthatthejobhasbeenallocatedtoareavailablethedataschedulercopiesinput

datafromtheexternalhighperformancefilesystemtothemulti-nodeSCMfilesystem.a. This step isoptional, thedatamayalreadybeon-nodeormaystillbeon theexternal

filesystem.6. JobLauncherstartstheuserapplicationontheCCN.7. Applicationreadsdatafromthemulti-nodeSCMfilesystem.8. Applicationwritesdatatothemulti-nodeSCMfilesystem.9. Applicationfinishes.10. Data Scheduler may be triggered to move data from SCM storage to the external high

performancefilesystem.

7.3 ResilienceinspecialistHPCsystem

7.3.1 Description

AnothercategoryofHPCsystemarethosethatareusedforasingleacademicorindustrialdiscipline.Thistypeofsystemcanstillbeverylarge,butwilltypicallyrunasmallernumberoflongerjobs,andeachjobwillbelargerthanisusuallyseenonsharedHPCsystem.Check-pointingiscriticalforthesafeoperationofjobstoensurethatanyhardwarefailuredoesnotresultinlosingalargejob.

Applications currentlymanually check-point their running state and datasets to enable restart onfailureofjobsorjobsreachingbatchsystemtimelimits.SCMcanallowforO/S-levelcheck-pointing,ratherthanapplicationlevelcheck-pointing,withtheO/StriggeringadumpofanapplicationtoSCMmemory for futureuse independentlyof theapplication itself. It canalsoallowforapplications tocheck-pointtoSCMmanually.

Forthisusagescenario,weareusingtheSCMforcheck-pointingofrunningapplicationstoensurethatapplicationscanberecoveredafterhardwarefailures.Inthiscontext,weareusingtheSCMin1LMmode,although2LMmodewithanApp-Directregionisalsopossible.

7.3.2 Components

• JobScheduler• AutomaticCheck-pointer• CommunicationLibraries• I/OLibraries

7.3.3 Datalifecycle

1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesautomaticcheck-pointingrequirements.

2. Jobschedulerallocatesresourcesforanapplication.3. Once the nodes the job has been allocated to are available the job launcher starts the user

application.4. Theuserapplicationnotifiestheautomaticcheck-pointerthatcheck-pointingisrequired,andat

whatfrequency.a. Some applications may notify the automatic check-pointer when check-pointing can

occur.b. Othersmayletthecheck-pointingoccurasdecidedbythecheck-pointer.

5. Twoscenariosarenowpossible.

Copyright©2017NEXTGenIOConsortiumPartners 42

a. Applicationcompletessuccessfully.i. Schedulernotifiestheautomaticcheck-pointerthatcheck-pointersarenolonger

required.ii. Check-pointerremovescheck-points.

b. Applicationfailedforsomereason.i. Re-runjobsubmittedwithacheck-pointIDprovidedtoresumetherun.ii. Applicationrunstocompletionasabove.

Ifcheck-pointdatawaswrittentothemulti-nodeSCMfilesysteminsteadofstraighttoSCMthenitispossiblethatjobscouldberestartedwithoutfailednodesprovidedthemulti-nodeSCMfilesystemhas provided data backup/replication/migration functionality to ensure the check-point data isavailableonmultiplenodes,notjustthenodeitwaswrittenon.

7.4 LargeMemory

7.4.1 Description

Applicationsareassuchasbio-informaticsoftenrelyonlargescalesharedmemorysystemstoenablevery largedata sets tobe loaded andprocessed. For these applications, reading andwritingdatafrom/tostoragesystemstakesalongtime,soreducingthiscostisimportant.

For this example,weare consideringusing theSCMasa very largememory space toperform in-memoryprocessingof fulldatasets, readingthedata intomemoryonceandthenwritingthedatabacktothehighperformancefilesystemattheendofthecomputation.Inthiscontext,weareusingtheSCMin2LMmodeandasinglejobwillonlyuseonenodeinthesystem.

7.4.2 Components

• JobScheduler• I/OLibraries

7.4.3 Datalifecycle

1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesinformationthat2LMmodeisrequired.

2. Jobschedulerallocatesanodefortheapplication.3. Oncethenodethatthejobhasbeenallocatedtoisavailablethejobschedulerensuresitisin2LM

mode.a. Eitherthisissimplyselectinganodealreadyinthatmode,or;b. Theschedulerinitiatesnoderebootasrequiredtogetthenodeinto2LMmode.

4. Thejobschedulerstartstheuserapplicationonthenode.5. Theuserapplicationloadsdatafromtheexternalhighperformancefilesystemintomemory.6. ThememorycontrollerhardwareloadsdataintoSCMandusestheDRAMascacheforthatSCM.7. Applicationprocessesdata.8. Applicationwritesdatabacktotheexternalhighperformancefilesystem.9. Applicationfinishesandjobschedulerreleasesthenodeused.

7.5 Workflows

7.5.1 Description

Jobsfromanincreasingnumberofscientificareasareoftencharacterisedbyaworkflow/jobchainingapproach,whereoneapplicationmayreaddata, thenproduceoutput/writedata,which is in turnusedbyanotherapplication.Readingandwritingdatatobepassedbetweenapplicationscantakesalongtime,thereforereducingthiscostisimportant.

InthiscontextweareusingSCMin1LMmode.

Copyright©2017NEXTGenIOConsortiumPartners 43

7.5.2 Components

• JobScheduler• I/OLibraries• DataScheduler• DataMovers

7.5.3 Datalifecycle

1. Jobschedulingrequestsubmitted.a. Schedulingrequestincludesinformationthat1LMmodeisrequired.

2. Jobschedulerallocatesnodesfortheapplication.a. Thismayincluderestartingnodestoensuretheyarein1LMmode.

3. Thejobschedulerstartsthefirstuserapplicationintheworkflowonthenodes.4. Theuserapplicationloadsdatafromtheexternalhighperformancefilesystemintomemory.

a. ThismaybeSCMorDRAM.5. Applicationprocessesdata.6. ApplicationwritesdatatoSCM.7. ApplicationfinishesandinformsDataSchedulerofdatatoberetained.8. DataSchedulertakescontrolofdata.

a. Thisensuresdataisnotlostoncetheowningapplicationfinishes.b. TheDataSchedulerorSchedulingDaemonswillgenerateakeyforfutureapplicationsto

use,toensurethesecurityofthatdata,andwillreturnthatkeytothefinishingapplicationoruserscript.

9. Next application in the userworkflow submitted to the job schedulerwith notification that itrequirespreviouslywrittendata.

a. TheapplicationorscriptwillusethepreviouslygeneratedkeytotakecontrolofthedatastoredinSCM.

10. Jobschedulerallocatesnodesfortheapplication.a. Itislikelythesewillbethesamenodesasthoseusedbythepreviousapplication,butmay

notbe.Iftheyaredifferentnodes,thejobanddataschedulersareresponsibletoensureworkflowdataisavailablewhenthenextworkflowapplicationstarts.

11. Thejobschedulerstartsthenextuserapplicationintheworkflowonthenodesallocatedtothatapplication.

12. Applicationprocessesdata.13. ApplicationwritesdatatoSCM.14. ApplicationfinishesandinformsDataSchedulerofdatatoberetained.15. DataSchedulertakescontrolofdata.

a. Thisensuresdataisnotlostoncetheowningapplicationfinishes.b. TheDataSchedulerorSchedulingDaemonswillgenerateakeyforfutureapplicationsto

useto,ensurethesecurityofthatdata,andwillreturnthatkeytothefinishingapplicationoruserscript.

16. Steps9–13repeatedasrequired.17. Finalapplicationintheworkflowfinishes.

a. DataSchedulerusesDataMovers towritedata tobestoredback to theexternalhighperformancefilesystem.

b. Nodesreleasedbyjobscheduler.

7.6 DataAnalytics

7.6.1 Description

Dataanalyticsusagecasesgenerallyhaveverylargedatasetsthatareanalysedoften,andtheanalysisperformssmallamountsofworkperdatapoint.Therearethereforelargenumbersofdatareads,and

Copyright©2017NEXTGenIOConsortiumPartners 44

smallamountsofcomputeordatawrites.Oftenthemainoverheadcomesfromloadingthedatapriorto analysis, or from converting the data into a format that can be queried, rather than from theanalysisorqueryingitself.

Oncedatahasbeenloadeditisoftenqueriedoranalysedmanytimes.PersistenceoflargedatasetsinSCMisbeneficialasitenablesdatatobeavailableafterrebootsandhardwarefaults.Persistenceoftheanalysisdata(i.e.individualanalysisruns)isnotimportant.Ifanindividualanalysisrunislostthatcanbere-run.

7.6.2 Components

• JobScheduler

7.6.3 Datalifecycle

1. RequestsubmittedtojobschedulerfordatasettobeloadedintoSCMonCCNsfromthehighperformancefilesystem.

2. The job scheduler instructs specific CCNs data schedulers to load the specified data in to theappropriateplaceinthesystem.

a. ThiscouldbelocalSCMonindividualnodes.b. Thiscouldbethemulti-nodeSCMfilesystem.c. Thiscouldtheobjectstore.

3. ThedataschedulerloadsthedataandstoresareferencetowhatdatahasbeenstoredandonwhichCCN.

4. Ausersubmitsarequesttothejobschedulertorunanapplication.5. Jobschedulerallocatesresourcesforanapplication.

a. Theschedulingtakesintoaccountthedatarequiredbythejobwhenallocatingresourcesbyliaisingwiththedatascheduler.

b. If thedata is not presenton any resources, or the requested resourcesdonotmatchwherethedataisavailablethenthejobisrejected.

6. Once thenodes that the jobhasbeenallocatedareavailable the job launcher starts theuserapplication.

7. Theapplicationusesthedataavailableonthenodes.8. Theapplicationwritesanydataitproducesbacktothehighperformancefilesystem.9. Theapplicationfinishes.10. Computeresourcesarefreedupforuse.11. Steps4-10arerepeatedasrequired.

a. Multiplejobscanrunusingthepre-loadeddataatthesametimeasrequired.12. Oncethedataisnolongerrequiredonthesystemthejobschedulerisnotifiedthatthedatacan

beremoved.13. Thejobschedulernotifiesthedataschedulersthattheycanremovethedata.14. Thedataschedulerremovesthedata.

7.7 Liveprocesspausingandmigration

7.7.1 Description

BatchsystemsforHPCaredesignedtoallowarangeofjoblengthsandsizestoberun.However,ifthereareurgentorlargescalejobswaitinginthebatchqueue,thisinvolvesallocatingresourcesforthese future jobs and blocking resources until these jobs are ready to run (this is often calledbackfilling/queuedraining).

SCMgivesthepossibilitytopauselivejobsbycheck-pointertheirstatesfromDRAMtoSCMtofreeupresourcestorunurgentjobsquickly.Thiscutsdownthebackfilling/drainingtimeforlargescaleor

Copyright©2017NEXTGenIOConsortiumPartners 45

urgentjobs.ItisalsopossibleforapplicationstobemigratedbetweennodesthroughRDMAaccesstoSCM.

Toachievethisfunctionalitywewillusetheautomaticcheck-pointerandjobschedulerworktogethertomigratearunningapplicationintoastoredstateinSCM,runanewapplication,andthenrestoretheapplicationfromthecheck-pointoncetheresourcesareonceagainfree.

ThisreliesonhavingaportionoftheSCMavailable,regardlessofwhether1LMor2LMmodeisbeingused,toensurethatthereissufficientspaceforarunningapplicationtobecheck-pointed.

7.7.2 Components

• JobScheduler• AutomaticCheck-pointer• JobLauncher

7.7.3 Datalifecycle

1. ApplicationisrunningonasetofCCN.2. Ahighpriorityjobissubmittedtothejobscheduler.3. Jobschedulerallocatesasetofnodestothehighpriorityjob.4. Jobschedulernotifiestheautomaticcheck-pointeroneachoftheallocatednodestocheck-point

thejobs.5. Check-pointertakesacheck-pointandstoresitinSCM.6. Check-pointernotifiesthejobschedulerthatthecheck-pointhasbeentakenandwhereitis.7. Jobschedulerkillstherunningapplication(s).8. Joblauncherstartsthehighpriorityapplication.9. Highpriorityapplicationrunstocompletion.10. Job scheduler reallocates the nodes to the previous jobs and resubmits their job scriptswith

restorefromautomaticcheck-pointingenabled.11. Application(s)restarts.12. Jobschedulernotifiesautomaticcheck-pointerthatcheckpointdatacanberemoved.13. Application(s)runstocompletion.14. Nodesarereleasedforfutureuse.

7.8 ObjectStore

7.8.1 Description

Object stores are a data storage approach thatmanages data as objects, compared to traditionalstoragearchitectureslikefilesystems,wheredataismanagedasafilehierarchy,andblockstoragewheredata isdealtwith inchunks,calledblocks.Asobject storesprovidestorage thatcanmirrorstandarddataaccessinmostapplications,theyofferthepotentialforfinegrain,andlowoverhead,datastorageandaccess.

Matchingthelowoverheadsofobjectstores,comparedtofileaccess,withthehighperformanceofSCM,whencomparedtotraditionalI/Odevices,offersextremelyhighI/OperformanceandnewwaysofoperatingforHPCandHPDAapplications.

This scenario involves an application thathasbeenmodified touse anobject store, runningon asystemwithSCM,withatleastpartoftheSCMavailablefordirectaccessfromtheobjectstore.

7.8.2 Components

• Objectstore• JobScheduler

Copyright©2017NEXTGenIOConsortiumPartners 46

7.8.3 Datalifecycle

1. Objectstoreserviceisinitiatedthroughthejobscheduler.2. Jobissubmittedrequestingnodeswithobjectstoreenabled.3. Jobschedulercheckswhichnodeshaveobjectstoreavailable.4. Jobschedulerallocatesrequirednodes.

a. Jobsubmissionfailsifnoderequestcannotbematched.5. Jobschedulerrunsjobonrequirednode.6. Jobrunsandusesobjectstore.7. Jobfinishes.8. Steps3-7repeatedasrequired.9. Objectstoreserviceisclosedthroughjobscheduler.10. Datascheduler(s)canpersistobjectstoretoexternalhighperformancefilesystemifrequired.

a. Ifnotpersistedthenobjectstoreislostattheendofajob.

7.9 Weatherforecastingcomponentoverview

7.9.1 Description

The operational workflow at ECMWF is complex, and involves many (approx. 10,000) tasks tiedtogetherbyanexternalcontrolsystem,withsubstantialinter-stagedataflows.

WithinNEXTGenIO,thefocus isonadaptingtheflowofdatabetweenthevariouscomponents,toavoiddatatransfersto-and-fromthehigh-performancefilesystemwithinthetime-criticalpathway.This is likely to involvesomeformofobjectstore,onnodescontainingSCM,whichcanretain therelevant data inmemory throughout the critical time periods. How this functionality fits into theoverallworkflowisyettobeexplored,andinparticularitisnotcleartheextenttowhichthisdatastorageshouldbecollocatedwithotheroperationalcomponents.

Theoverallweatherforecastingworkflow,althoughitispurelyanapplication,expectstooperatewithsubstantial reserved resources, and very shortqueue times, and thus tohavea veryhigh level ofcontrolofhowtheoverallworkflowoperatesonthemachine.Asaresult,someofthecomponentsappeartohavesystemwarescope,despitebeingpartoftheapplicationnotthesystemware.

Therearetwopseudo-systemwarecomponentsthatarepartoftheECMWFoperationalworkflow:

• WorkflowController: ECMWFuses a customworkflow controller that runs continuously on anodeoutside theHPC system. It interrogates and submits jobs to the scheduler, and receivesstatusupdatesdirectlyfromtheHPCjobsovertheIPnetwork.

• HotObjectStore:Adomainspecificmulti-nodeobjectstoreformeteorologicaldatathatcanbeaddressedusingthesamerequestsastheexistingindexeddatastoresatECMWF.Thismay,ormaynotbecollocatedwithotherapplicationcomponents,oroperateasitsowntaskontheHPCprototype.ItmakesuseofthePersistentMemoryAccessLibrarytomakeuseofthelargememorycapacity available on SCM. It communicateswith other HPC tasks over the high-performancenetwork.

Withinthisworkflow,SCMwillbeexplicitlymanagedin1LMmode.

7.9.2 Components

• JobScheduler• PersistentMemoryAccessLibrary• CommunicationLibraries(MPI)• High-performancefilesystem• I/Olibraries(GRIB)

Copyright©2017NEXTGenIOConsortiumPartners 47

7.9.3 Datalifecycle

Notethedescriptionaboveofthenon-systemwarecomponentscalledthe“workflowcontroller”and“hotobjectstore”.

1. Dataarrivesfromawidevarietyofexternalsourcesonanongoingbasisandpre-processedbeforebeingstoredonthehigh-performanceparallelfilesystem

2. Duringthetime-criticalwindow,theWorkflowControllermonitorsthestatusoftheoperationaltasks,ensuringthatthecorrecttasksaresubmittedatthecorrecttime

3. Atthestartofthetime-criticalwindow,thepre-processeddataisreadfromthehigh-performanceparallelfilesystemintothe52primarycomputationaltasks

4. AtintervalsoutputfieldsstoredtheHotObjectStore,storingthedatainSCMusingthePersistentMemoryAccessLibrary.Inanon-time-criticalmanner,thisdatamayalsobeduplicatedontothehighperformancefilesystemasthistime

5. Post-processingprocessesaretriggeredattheappropriatemomentbytheworkflowcontroller.TheyretrievetherelevantdatafromtheHotObjectStoreoverthehigh-performancenetwork

6. Theoutputofpost-processingisstoredonthehigh-performancefilesystemforlaterretrieval.7. Dataarchivingtasks,outsidethetime-criticalpathway,retrievemeteorological inputdata,and

outputfieldsfromthehighperformancefilesystemandthehotobjectstoreforexternalarchiving.

7.10 Distributedworkflowmanagement

7.10.1 DescriptionPyCOMPSsisthePythonbindingofCOMPSuperscalar(COMPSs),whichisatask-basedprogrammingmodelthataimstoeasethedevelopmentofapplicationsfordistributedinfrastructures.

PyCOMPSsmanages theuserapplicationand interactswithdataClay inorder tohandlepersistentobjectsacrosstheinfrastructure.Inthemodel,theuserismainlyresponsiblefor(i) identifyingthefunctionstobeexecutedasasynchronousparalleltasks,(ii)annotatingthemwithastandardPythondecoratorand(iii)determinethepersistentobjects.

AnexampleapplicationforthisusecasecanbetheK-meansalgorithmoralifescienceapplicationorsimilarthatusestheK-meansalgorithminitsanalysiswithpersistentobjects.

7.10.2 Components

• Task-basedprogrammingmodelandenvironment:PyCOMPSs• Objectstore:dataClay

7.10.3 Datalifecycle1. UserrequeststheexecutionoftheapplicationwithasetofparametersusingCOMPSslaunching

scripts.a. Atthispoint,theexecutablecanbeontheexternalhighperformancefilesystemoronthe

localfilesystemintheloginnodes.b. Ifadatasetisrequested,itisstoredontheexternalhighperformancefilesystem.c. The user requests to use dataClay in combination with PyCOMPSs (this step may be

optionalforotherapplications).2. COMPSslaunchingscriptscreateajobtemplatewhichissubmittedtothejobscheduler.3. TheJobschedulerallocatesresourcesforPyCOMPSs.4. Oncethenodesthatthejobschedulerhasallocatedareavailable,COMPSsstartsandconfigures

therequiredprocesseswithineachnodeinordertosetupamaster-workerstructure.a. Thefirstnodeoftheallocationwillbeconsideredthemasterandtherestofthenodes

willbeconsideredworkers.b. If the user has requested to use dataClay, this deployment process also triggers the

dataClaydeploymentbeforestartingtheuserapplicationexecution.

Copyright©2017NEXTGenIOConsortiumPartners 48

5. Whentheentiredeploymenthasfinished,PyCOMPSsstartstheuserapplicationexecutionontheassignedCCN.

6. Datasetsource:a. Option1:theapplicationcreatesadatasetandmakesitpersistentwithindataClay;b. Option2:thedatasetisalreadywithindataClay.

7. Theapplicationstartscomputinginordertoachievetheresult.a. ThetasksaredeployedacrosstheCCN.b. EachtaskcaninteractwithdataClayifhastoreadorwritedata.

8. Theapplicationfinishes.9. Datascheduler(s)canpersistobjectstoretoexternalhighperformancefilesystemifrequired.

a. Ifnotpersistedthenobjectstoreislostattheendofajob.

Copyright©2017NEXTGenIOConsortiumPartners 49

8 ConclusionsWehaveoutlinedthehardwareandsystemwarearchitectureswehavedesignedtoexploitStorageClassMemoryforHPCandHPDAapplications.WehavedemonstratedthatsuchanarchitecturehasthepotentialtoscaleuptoExascale-classsystemsinthenearfuture,withthepotentialforextremelyhighI/Operformanceandmemorycapacity.

EfficientlyutilisingSCMforcomputationalsimulationanddataanalyticapplicationswilleitherrequireapplicationmodifications, toenableapplications todirectly targetandmanageSCM,or intelligentsystemwaretosupportapplicationsusingSCMforI/O,memory,andcheck-pointing.

WehavedescribedarangeofsystemwarecomponentsthattheNEXTGenIOprojectisdevelopinganddeployingtoevaluateSCMforlargescaleapplications.Wewillbeevaluatingthissystemware,alongwithdirectaccessapplications,andtherawperformanceofSCM,usingaprototypeHPCsystemthatNEXTGenIOisdeveloping.

Copyright©2017NEXTGenIOConsortiumPartners 50

9 References[1] “3DXPoint™TechnologyRevolutionizesStorageMemory”

http://www.intel.com/content/www/us/en/architecture-and-technology/3D-XPoint™-technology-animation.html

[2] https://www.google.com/patents/US20150178204[3] https://colfaxresearch.com/knl-mcdram/[4] IntelOmni-PathWhitepaperfromHotInterconnects2015:

https://plan.seek.intel.com/LandingPage-IntelFabricsWebinarSeries-Omni-PathWhitePaper-3850

[5] pmem.io:http://pmem.io/[6] https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/optane-

ssd-dc-p4800x-brief.pdf

Copyright©2017NEXTGenIOConsortiumPartners 51

10 AppendixThefollowingpagescontainsequencediagramsforusagepatternsdescribedinSection7.

Figure17:GeneralPurposeHPCsystemusingmulti-nodeSCMfilesystem.

Copyright©2017NEXTGenIOConsortiumPartners 52

Figure18:ResilienceinspecialistHPCsystem.

Copyright©2017NEXTGenIOConsortiumPartners 53

Figure19:Largememory.

Copyright©2017NEXTGenIOConsortiumPartners 54

Figure20:Workflows.