lina yu, hongfeng yu -...

Legion-basedScien/ficDataAnaly/csonHeterogeneousProcessors

Lina Yu, Hongfeng YuDepartment of Computer Science & Engineering

University of Nebraska Lincoln, Lincoln, Nebraska

1

Outline

•  Motivation•  Contributions•  Framework•  Examples•  ExperimentsandResults•  Conclusion

2

Mo/va/on

•  Itischallengingtoef>icientlyusetoday’ssupercomputers–  Deep,distributedmemoryhierarchies–  Heterogeneousprocessingunits

•  Communicationcostsareacriticalissueforparallelsystemandsoftwaredesignerstoconsider–  Ascienti>icanalyticswork>lowconsistsofmultipleoperationsthat

intrinsicallyincurdifferentcommunicationordatamovementrequirementsbetweencomputenodes

3

•  Legion:programmingmodel+runtimesystem–  Describehierarchicalorganizationsofbothdataandcomputationatan

abstractlevel

•  Legionassistsaprogrammerinsolvingthecommonprogrammingburdens–  Discover/verifythecorrectnessofparallelexecution–  Managecommunication

•  Atahighlevel,mappingaLegionprogramneedsmakingtwokindsofdecisions–  Foreachtask,selectaprocessoronwhichtorunthetaskbythemapping

interface–  Foreachlogicalregion,ataskneedstoselectamemoryinwhichtocreate

anduseaphysicalinstanceofthelogicalregion

Mo/va/on

4

OurContribu/on

•  InvestigatethefeasibilityofusingLegiontoperformanalyticsforlarge-scalescienti>icdataonheterogeneousprocessors

•  Helpuserssimplifyprogrammingonthedatapartition,dataorganization,anddatamovementfordistributed-memoryheterogeneousarchitectures

•  Facilitateasimultaneousexecutionofmultipleanalyticsoperationsonmodernandfuturesupercomputers

•  Demonstratethescalabilityandtheusabilityofourapproachusingseveralrepresentativeanalyticsoperationsonaheterogeneoussupercomputer

5

MapperInterface

•  WedesignacustommapperbasedonLegion’smapperinterface–  Mapoperationsontotargetprocessors–  Specifywhichmemoriesareusedtohostthephysicalinstancesofthe

logicalregionsrequestedbysuchoperations

6

OP= {op1,...,opv }

mapper interface GPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum } <opi, CPUsi, GPUsi>

<op1, CPUs1, GPUs1>

<opv, CPUsv, GPUsv>

……

……

RegionConstruc/onandTaskScheduling

•  Mainstepsoftheprocessofourapproach–  Makeanoperationopiprocessedonheterogeneousprocessors

7

logical region physical region

logical partition ={lp1,…,lpp}

CPUsi task scheduler

Opi (logical partition)

… …

… …

index space

field space

GPUsi … …

… …

(GPUsi)1 Opgi(lp1)

(GPUsi)k Opgi(lpk)

(GPUsi)u Opgi(lpu)

(CPUsi)1 Opci(lpu

+1)

(CPUsi)j Opci(lpu+j)

(CPUsi)v Opci(lpp)

Construct a field space of the logical region, and allocate the field space for each portion of data.

Construct an index space of the logical region for the inputdata of each operation.

Create a logical region using the index space and the fieldspace defined in the previous two steps.

Execute operations on GPUs and CPUs according to the previous mapper interface we designed.

Use coloring to partition a logical region (colorings are objects that describe an intended partition of an index space).

Create a corresponding physical region to hold the physical instances (i.e., the real values for the input data).

1

2

1

23

5

6

4

3

4

5

6

ContinuedList1

9

10

ray_casting image_compositing

mapperinterfaceGPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum } CPUs1

GPUs1 ray_casting

entropy

entropy

CPUs2 image_compositing

ray_casting image_compositing

mapperinterface

GPUs1 ray_casting

CPUs2 entropy

entropy

CPUs3 image_compositing

CPUs1 ray_casting

GPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum }

•  Sort-lastparallelvolumerenderingwithentropyanalysis–  Mapperinterface

Examples

Examples•  Sort-lastparallelvolumerenderingwithentropyanalysis

–  Regionconstructionandtaskscheduling

11

…

entropyCPU

ray_castingCPU

image_compositingCPU image_compositingCPU

Index Field

{vol_index_space} {vol_field_space}

3Dvolumelogicalregion

TaskID Type

RAY_CASTING_TASK1 GPUs

RAY_CASTING_TASK2 CPUs

ENTROPY_TASK CPUs

IMAGE_COMPOSITING_TASK CPUs

Mapper

Tasks3Dvolumephysicalregion 3Dvolumelogicalpar//on

entropyCPU

ray_castingGPU

Voxel Index Value

… …LogicalPartitonID Start Offset

… … …

Index Field

{img_index_space} {img_field_space}

2Dimagelogicalregion

2Dimagephysicalregion 2Dimagelogicalpar//onPixel Index Value

… …

LogicalPartitonID Start Offset

… … …

LogicalPartitonID TaskID

… …

…

GPU

…

denotesCPUcores

GPU

GPU

Examples•  Sort->irstparallelvolumerenderingwithentropyanalysis

–  Mapperinterface•  Raycastingtask(GPUs)•  Entropytask(CPUs)

–  Regionconstructionandtaskscheduling•  Dividethe2Dimageintouniform2Dgrids•  Eachprocessorisresponsiblefortherenderingofanimageportion•  Noneedtodividethe3Dvolumedata•  Noneedimagecompositing

•  Thesort->irstandsort-lastalgorithmshavedifferencesondatapartitioninganddistributionrequirements,butoursolutionprovidesasimpleandfeasiblewaytoincorporatedifferentoperationsinauni>iedframeworkusinglogicalregions

12

ExperimentsandResults•  ConductexperimentsonTitan,aCrayXK7supercomputerlocatedattheOakRidgeLeadershipComputingFacility–  EachnodeofTitancontainsone16-coreAMDOpteronCPUandaNVIDIA

TeslaK20GPU

•  Testsort->irstandsort-lastparallelrendering•  Conductscalabilitycomparisonsusingacombustiondatasetwiththeresolutionof1600x1375x430

•  Testbetween1to256processorswithtwooutputimageresolutionsof10242and20482

13

ExperimentsandResults•  Theoverviewtimebreakdown,datapartitiontime,renderingtime,anddatamovementtimeonadifferenttotalnumberofnodesforsort->irstrenderingandsort-lastrendering

(a) (b)

(c) (d)

(a) (b)

(c) (d)

14

Fig.1:(a):the/mebreakdownofsort-firstparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):thedatamovement/me.Twooutputimageresolu/ons,10242and20482,areused.

Fig.2:(a):the/mebreakdownofsort-lastparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):theimagecomposi/ng/me.Twooutputimageresolu/ons,10242and20482,areused.

•  Interactiverenderingtimeanddatamovementtimeofsort->irstparallelrenderingfor64nodeswithimageresolutionof10242

ExperimentsandResults

Fig.3:Therendering/meanddatamovement/meofsort-firstrenderingfor64nodesfrommul/pleviewangles.Theoutputimageresolu/onis10242.

15


•  Therenderingtimeresultsofsort->irstandsort-lastparallelrenderingonanynumberofnodesfrom1to256withimageresolutionof10242

Fig.4:The/meresultsofsort-first(a)andsort-last(b)parallelrenderingonanynumberofnodesfrom1to256.Theoutputimageresolu/onis10242.

(a) (b)

16


Fig.5:The/meresultsofraycas/ngandentropyanalysiswithvariousra/osonalloca/on.Theoutputimageresolu/onis10242.

•  Legionjobstealingschedulingperformance–  CPUraycastingtimeis1.347seconds(5%)–  CPUentropytimeis0.936second–  GPUraycastingtimeis2.833seconds(95%)

•  Giventhateachnodehasa16-coreCPU,wetesteddifferentratiosbetweenraycastingandentropyoperations

17

Conclusion

•  Astudyforconductingscienti>icdataanalyticsondistributedheterogeneousarchitecturesbyleveragingtheLegionprogrammingmodelandruntimesystem

•  Considerbothscalabilityandusabilityinourdesign

•  Facilitatecomplexanalyticsoperationswithcompletelydifferentdatapartitioninganddistributionrequirementsinanearlyuni>iedmanner

•  PerformoperationsacrossCPUsandGPUsandbalanceworkloadbyautomaticormanualschedulingstrategies

18

Acknowledgement

•  ThisresearchhasbeensponsoredinpartbytheDepartmentofEnergythroughtheExaCTCenterforExascaleSimulationofCombustioninTurbulencetheNationalScienceFoundationthroughgrantIIS-1423487.

•  TheallocationofsupercomputingtimeontheOakRidgeLeadershipComputingFacility(OLCF)hasbeensponsoredbytheDepartmentofEnergythroughtheInnovativeandNovelComputationalImpactonTheoryandExperiment(INCITE)program

19

ThankYou!

20

lina yu, hongfeng yu -...

Documents