lina yu, hongfeng yu -...

20
Legion-based Scien/fic Data Analy/cs on Heterogeneous Processors Lina Yu, Hongfeng Yu Department of Computer Science & Engineering University of Nebraska Lincoln, Lincoln, Nebraska 1

Upload: others

Post on 10-Sep-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Legion-basedScien/ficDataAnaly/csonHeterogeneousProcessors

Lina Yu, Hongfeng YuDepartment of Computer Science & Engineering

University of Nebraska Lincoln, Lincoln, Nebraska

1

Page 2: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Outline

•  Motivation•  Contributions•  Framework•  Examples•  ExperimentsandResults•  Conclusion

2

Page 3: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Mo/va/on

•  Itischallengingtoef>icientlyusetoday’ssupercomputers–  Deep,distributedmemoryhierarchies–  Heterogeneousprocessingunits

•  Communicationcostsareacriticalissueforparallelsystemandsoftwaredesignerstoconsider–  Ascienti>icanalyticswork>lowconsistsofmultipleoperationsthat

intrinsicallyincurdifferentcommunicationordatamovementrequirementsbetweencomputenodes

3

Page 4: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

•  Legion:programmingmodel+runtimesystem–  Describehierarchicalorganizationsofbothdataandcomputationatan

abstractlevel

•  Legionassistsaprogrammerinsolvingthecommonprogrammingburdens–  Discover/verifythecorrectnessofparallelexecution–  Managecommunication

•  Atahighlevel,mappingaLegionprogramneedsmakingtwokindsofdecisions–  Foreachtask,selectaprocessoronwhichtorunthetaskbythemapping

interface–  Foreachlogicalregion,ataskneedstoselectamemoryinwhichtocreate

anduseaphysicalinstanceofthelogicalregion

Mo/va/on

4

Page 5: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

OurContribu/on

•  InvestigatethefeasibilityofusingLegiontoperformanalyticsforlarge-scalescienti>icdataonheterogeneousprocessors

•  Helpuserssimplifyprogrammingonthedatapartition,dataorganization,anddatamovementfordistributed-memoryheterogeneousarchitectures

•  Facilitateasimultaneousexecutionofmultipleanalyticsoperationsonmodernandfuturesupercomputers

•  Demonstratethescalabilityandtheusabilityofourapproachusingseveralrepresentativeanalyticsoperationsonaheterogeneoussupercomputer

5

Page 6: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

MapperInterface

•  WedesignacustommapperbasedonLegion’smapperinterface–  Mapoperationsontotargetprocessors–  Specifywhichmemoriesareusedtohostthephysicalinstancesofthe

logicalregionsrequestedbysuchoperations

6

OP= {op1,...,opv }

mapper interface GPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum } <opi, CPUsi, GPUsi>

<op1, CPUs1, GPUs1>

<opv, CPUsv, GPUsv>

……

……

Page 7: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

RegionConstruc/onandTaskScheduling

•  Mainstepsoftheprocessofourapproach–  Makeanoperationopiprocessedonheterogeneousprocessors

7

logical region physical region

logical partition ={lp1,…,lpp}

CPUsi task scheduler

Opi (logical partition)

… …

… …

index space

field space

GPUsi … …

… …

(GPUsi)1 Opgi(lp1)

(GPUsi)k Opgi(lpk)

(GPUsi)u Opgi(lpu)

(CPUsi)1 Opci(lpu

+1)

(CPUsi)j Opci(lpu+j)

(CPUsi)v Opci(lpp)

Construct a field space of the logical region, and allocate the field space for each portion of data.

Construct an index space of the logical region for the inputdata of each operation.

Create a logical region using the index space and the fieldspace defined in the previous two steps.

Execute operations on GPUs and CPUs according to the previous mapper interface we designed.

Use coloring to partition a logical region (colorings are objects that describe an intended partition of an index space).

Create a corresponding physical region to hold the physical instances (i.e., the real values for the input data).

1

2

1

23

5

6

4

3

4

5

6

Page 8: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

8

Page 9: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ContinuedList1

9

Page 10: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

10

ray_casting image_compositing

mapperinterfaceGPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum } CPUs1

GPUs1 ray_casting

entropy

entropy

CPUs2 image_compositing

ray_casting image_compositing

mapperinterface

GPUs1 ray_casting

CPUs2 entropy

entropy

CPUs3 image_compositing

CPUs1 ray_casting

GPU = { gpu1,...,gpun }

CPU = { cpu1,...,cpum }

•  Sort-lastparallelvolumerenderingwithentropyanalysis–  Mapperinterface

Examples

Page 11: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Examples•  Sort-lastparallelvolumerenderingwithentropyanalysis

–  Regionconstructionandtaskscheduling

11

entropyCPU

ray_castingCPU

image_compositingCPU image_compositingCPU

Index Field

{vol_index_space} {vol_field_space}

3Dvolumelogicalregion

TaskID Type

RAY_CASTING_TASK1 GPUs

RAY_CASTING_TASK2 CPUs

ENTROPY_TASK CPUs

IMAGE_COMPOSITING_TASK CPUs

Mapper

Tasks3Dvolumephysicalregion 3Dvolumelogicalpar//on

entropyCPU

ray_castingGPU

Voxel Index Value

… …LogicalPartitonID Start Offset

… … …

Index Field

{img_index_space} {img_field_space}

2Dimagelogicalregion

2Dimagephysicalregion 2Dimagelogicalpar//onPixel Index Value

… …

LogicalPartitonID Start Offset

… … …

LogicalPartitonID TaskID

… …

GPU

denotesCPUcores

GPU

GPU

Page 12: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Examples•  Sort->irstparallelvolumerenderingwithentropyanalysis

–  Mapperinterface•  Raycastingtask(GPUs)•  Entropytask(CPUs)

–  Regionconstructionandtaskscheduling•  Dividethe2Dimageintouniform2Dgrids•  Eachprocessorisresponsiblefortherenderingofanimageportion•  Noneedtodividethe3Dvolumedata•  Noneedimagecompositing

•  Thesort->irstandsort-lastalgorithmshavedifferencesondatapartitioninganddistributionrequirements,butoursolutionprovidesasimpleandfeasiblewaytoincorporatedifferentoperationsinauni>iedframeworkusinglogicalregions

12

Page 13: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ExperimentsandResults•  ConductexperimentsonTitan,aCrayXK7supercomputerlocatedattheOakRidgeLeadershipComputingFacility–  EachnodeofTitancontainsone16-coreAMDOpteronCPUandaNVIDIA

TeslaK20GPU

•  Testsort->irstandsort-lastparallelrendering•  Conductscalabilitycomparisonsusingacombustiondatasetwiththeresolutionof1600x1375x430

•  Testbetween1to256processorswithtwooutputimageresolutionsof10242and20482

13

Page 14: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ExperimentsandResults•  Theoverviewtimebreakdown,datapartitiontime,renderingtime,anddatamovementtimeonadifferenttotalnumberofnodesforsort->irstrenderingandsort-lastrendering

(a) (b)

(c) (d)

(a) (b)

(c) (d)

14

Fig.1:(a):the/mebreakdownofsort-firstparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):thedatamovement/me.Twooutputimageresolu/ons,10242and20482,areused.

Fig.2:(a):the/mebreakdownofsort-lastparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):theimagecomposi/ng/me.Twooutputimageresolu/ons,10242and20482,areused.

Page 15: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

•  Interactiverenderingtimeanddatamovementtimeofsort->irstparallelrenderingfor64nodeswithimageresolutionof10242

ExperimentsandResults

Fig.3:Therendering/meanddatamovement/meofsort-firstrenderingfor64nodesfrommul/pleviewangles.Theoutputimageresolu/onis10242.

15

Page 16: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ExperimentsandResults

•  Therenderingtimeresultsofsort->irstandsort-lastparallelrenderingonanynumberofnodesfrom1to256withimageresolutionof10242

Fig.4:The/meresultsofsort-first(a)andsort-last(b)parallelrenderingonanynumberofnodesfrom1to256.Theoutputimageresolu/onis10242.

(a) (b)

16

Page 17: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ExperimentsandResults

Fig.5:The/meresultsofraycas/ngandentropyanalysiswithvariousra/osonalloca/on.Theoutputimageresolu/onis10242.

•  Legionjobstealingschedulingperformance–  CPUraycastingtimeis1.347seconds(5%)–  CPUentropytimeis0.936second–  GPUraycastingtimeis2.833seconds(95%)

•  Giventhateachnodehasa16-coreCPU,wetesteddifferentratiosbetweenraycastingandentropyoperations

17

Page 18: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Conclusion

•  Astudyforconductingscienti>icdataanalyticsondistributedheterogeneousarchitecturesbyleveragingtheLegionprogrammingmodelandruntimesystem

•  Considerbothscalabilityandusabilityinourdesign

•  Facilitatecomplexanalyticsoperationswithcompletelydifferentdatapartitioninganddistributionrequirementsinanearlyuni>iedmanner

•  PerformoperationsacrossCPUsandGPUsandbalanceworkloadbyautomaticormanualschedulingstrategies

18

Page 19: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

Acknowledgement

•  ThisresearchhasbeensponsoredinpartbytheDepartmentofEnergythroughtheExaCTCenterforExascaleSimulationofCombustioninTurbulencetheNationalScienceFoundationthroughgrantIIS-1423487.

•  TheallocationofsupercomputingtimeontheOakRidgeLeadershipComputingFacility(OLCF)hasbeensponsoredbytheDepartmentofEnergythroughtheInnovativeandNovelComputationalImpactonTheoryandExperiment(INCITE)program

19

Page 20: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel

ThankYou!

20