lina yu, hongfeng yu -...
TRANSCRIPT
Legion-basedScien/ficDataAnaly/csonHeterogeneousProcessors
Lina Yu, Hongfeng YuDepartment of Computer Science & Engineering
University of Nebraska Lincoln, Lincoln, Nebraska
1
Outline
• Motivation• Contributions• Framework• Examples• ExperimentsandResults• Conclusion
2
Mo/va/on
• Itischallengingtoef>icientlyusetoday’ssupercomputers– Deep,distributedmemoryhierarchies– Heterogeneousprocessingunits
• Communicationcostsareacriticalissueforparallelsystemandsoftwaredesignerstoconsider– Ascienti>icanalyticswork>lowconsistsofmultipleoperationsthat
intrinsicallyincurdifferentcommunicationordatamovementrequirementsbetweencomputenodes
3
• Legion:programmingmodel+runtimesystem– Describehierarchicalorganizationsofbothdataandcomputationatan
abstractlevel
• Legionassistsaprogrammerinsolvingthecommonprogrammingburdens– Discover/verifythecorrectnessofparallelexecution– Managecommunication
• Atahighlevel,mappingaLegionprogramneedsmakingtwokindsofdecisions– Foreachtask,selectaprocessoronwhichtorunthetaskbythemapping
interface– Foreachlogicalregion,ataskneedstoselectamemoryinwhichtocreate
anduseaphysicalinstanceofthelogicalregion
Mo/va/on
4
OurContribu/on
• InvestigatethefeasibilityofusingLegiontoperformanalyticsforlarge-scalescienti>icdataonheterogeneousprocessors
• Helpuserssimplifyprogrammingonthedatapartition,dataorganization,anddatamovementfordistributed-memoryheterogeneousarchitectures
• Facilitateasimultaneousexecutionofmultipleanalyticsoperationsonmodernandfuturesupercomputers
• Demonstratethescalabilityandtheusabilityofourapproachusingseveralrepresentativeanalyticsoperationsonaheterogeneoussupercomputer
5
MapperInterface
• WedesignacustommapperbasedonLegion’smapperinterface– Mapoperationsontotargetprocessors– Specifywhichmemoriesareusedtohostthephysicalinstancesofthe
logicalregionsrequestedbysuchoperations
6
OP= {op1,...,opv }
mapper interface GPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum } <opi, CPUsi, GPUsi>
<op1, CPUs1, GPUs1>
<opv, CPUsv, GPUsv>
……
……
RegionConstruc/onandTaskScheduling
• Mainstepsoftheprocessofourapproach– Makeanoperationopiprocessedonheterogeneousprocessors
7
logical region physical region
logical partition ={lp1,…,lpp}
CPUsi task scheduler
Opi (logical partition)
… …
… …
index space
field space
GPUsi … …
… …
(GPUsi)1 Opgi(lp1)
(GPUsi)k Opgi(lpk)
(GPUsi)u Opgi(lpu)
(CPUsi)1 Opci(lpu
+1)
(CPUsi)j Opci(lpu+j)
(CPUsi)v Opci(lpp)
Construct a field space of the logical region, and allocate the field space for each portion of data.
Construct an index space of the logical region for the inputdata of each operation.
Create a logical region using the index space and the fieldspace defined in the previous two steps.
Execute operations on GPUs and CPUs according to the previous mapper interface we designed.
Use coloring to partition a logical region (colorings are objects that describe an intended partition of an index space).
Create a corresponding physical region to hold the physical instances (i.e., the real values for the input data).
1
2
1
23
5
6
4
3
4
5
6
8
ContinuedList1
9
10
ray_casting image_compositing
mapperinterfaceGPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum } CPUs1
GPUs1 ray_casting
entropy
entropy
CPUs2 image_compositing
ray_casting image_compositing
mapperinterface
GPUs1 ray_casting
CPUs2 entropy
entropy
CPUs3 image_compositing
CPUs1 ray_casting
GPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum }
• Sort-lastparallelvolumerenderingwithentropyanalysis– Mapperinterface
Examples
Examples• Sort-lastparallelvolumerenderingwithentropyanalysis
– Regionconstructionandtaskscheduling
11
…
entropyCPU
ray_castingCPU
image_compositingCPU image_compositingCPU
Index Field
{vol_index_space} {vol_field_space}
3Dvolumelogicalregion
TaskID Type
RAY_CASTING_TASK1 GPUs
RAY_CASTING_TASK2 CPUs
ENTROPY_TASK CPUs
IMAGE_COMPOSITING_TASK CPUs
Mapper
Tasks3Dvolumephysicalregion 3Dvolumelogicalpar//on
entropyCPU
ray_castingGPU
Voxel Index Value
… …LogicalPartitonID Start Offset
… … …
Index Field
{img_index_space} {img_field_space}
2Dimagelogicalregion
2Dimagephysicalregion 2Dimagelogicalpar//onPixel Index Value
… …
LogicalPartitonID Start Offset
… … …
LogicalPartitonID TaskID
… …
…
GPU
…
denotesCPUcores
GPU
GPU
Examples• Sort->irstparallelvolumerenderingwithentropyanalysis
– Mapperinterface• Raycastingtask(GPUs)• Entropytask(CPUs)
– Regionconstructionandtaskscheduling• Dividethe2Dimageintouniform2Dgrids• Eachprocessorisresponsiblefortherenderingofanimageportion• Noneedtodividethe3Dvolumedata• Noneedimagecompositing
• Thesort->irstandsort-lastalgorithmshavedifferencesondatapartitioninganddistributionrequirements,butoursolutionprovidesasimpleandfeasiblewaytoincorporatedifferentoperationsinauni>iedframeworkusinglogicalregions
12
ExperimentsandResults• ConductexperimentsonTitan,aCrayXK7supercomputerlocatedattheOakRidgeLeadershipComputingFacility– EachnodeofTitancontainsone16-coreAMDOpteronCPUandaNVIDIA
TeslaK20GPU
• Testsort->irstandsort-lastparallelrendering• Conductscalabilitycomparisonsusingacombustiondatasetwiththeresolutionof1600x1375x430
• Testbetween1to256processorswithtwooutputimageresolutionsof10242and20482
13
ExperimentsandResults• Theoverviewtimebreakdown,datapartitiontime,renderingtime,anddatamovementtimeonadifferenttotalnumberofnodesforsort->irstrenderingandsort-lastrendering
(a) (b)
(c) (d)
(a) (b)
(c) (d)
14
Fig.1:(a):the/mebreakdownofsort-firstparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):thedatamovement/me.Twooutputimageresolu/ons,10242and20482,areused.
Fig.2:(a):the/mebreakdownofsort-lastparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):theimagecomposi/ng/me.Twooutputimageresolu/ons,10242and20482,areused.
• Interactiverenderingtimeanddatamovementtimeofsort->irstparallelrenderingfor64nodeswithimageresolutionof10242
ExperimentsandResults
Fig.3:Therendering/meanddatamovement/meofsort-firstrenderingfor64nodesfrommul/pleviewangles.Theoutputimageresolu/onis10242.
15
ExperimentsandResults
• Therenderingtimeresultsofsort->irstandsort-lastparallelrenderingonanynumberofnodesfrom1to256withimageresolutionof10242
Fig.4:The/meresultsofsort-first(a)andsort-last(b)parallelrenderingonanynumberofnodesfrom1to256.Theoutputimageresolu/onis10242.
(a) (b)
16
ExperimentsandResults
Fig.5:The/meresultsofraycas/ngandentropyanalysiswithvariousra/osonalloca/on.Theoutputimageresolu/onis10242.
• Legionjobstealingschedulingperformance– CPUraycastingtimeis1.347seconds(5%)– CPUentropytimeis0.936second– GPUraycastingtimeis2.833seconds(95%)
• Giventhateachnodehasa16-coreCPU,wetesteddifferentratiosbetweenraycastingandentropyoperations
17
Conclusion
• Astudyforconductingscienti>icdataanalyticsondistributedheterogeneousarchitecturesbyleveragingtheLegionprogrammingmodelandruntimesystem
• Considerbothscalabilityandusabilityinourdesign
• Facilitatecomplexanalyticsoperationswithcompletelydifferentdatapartitioninganddistributionrequirementsinanearlyuni>iedmanner
• PerformoperationsacrossCPUsandGPUsandbalanceworkloadbyautomaticormanualschedulingstrategies
18
Acknowledgement
• ThisresearchhasbeensponsoredinpartbytheDepartmentofEnergythroughtheExaCTCenterforExascaleSimulationofCombustioninTurbulencetheNationalScienceFoundationthroughgrantIIS-1423487.
• TheallocationofsupercomputingtimeontheOakRidgeLeadershipComputingFacility(OLCF)hasbeensponsoredbytheDepartmentofEnergythroughtheInnovativeandNovelComputationalImpactonTheoryandExperiment(INCITE)program
19
ThankYou!
20