quicksilver:a proxy app for the monte carlo transport code
TRANSCRIPT
LLNL-PRES-743737This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Quicksilver: AProxyAppfortheMonteCarloTransportCodeMercuryWorkshoponRepresentativeApplications
DavidRichards,RyanBleile,PatrickBrantley,ShawnDawson,ScottMcKinley,MatthewO’Brien
September5,2017
LLNL-PRES-7437372
SierrawillincludeIBMPower9CPUsandNvidia VoltaGPUs
IBMPower9CPU<10%SierraFlops
Nvidia VoltaGPU>90%SierraFlops16GBHBM/GPU
~4500nodesEachnodehas:2Power94VoltaGPU
LLNL-PRES-7437373
§ Particlesinteractwithmatterbyavarietyof“reactions”
§ Theprobabilityofeachreactionanditsoutcomesarecapturedinexperimentallymeasured“crosssections”(Latencyboundtablelookups)
§ Followsmanyparticles(millionsormore)andusesrandomnumberstosampletheprobabilitydistributions(Verybranchy,divergentcode)
§ Particlescontributetodiagnostic“tallies”(Potentialdataraces)
§ Theresultisastatisticallycorrectrepresentationofthephysicalsystem
MercurysolvesparticletransportproblemsusingtheMonteCarloMethod
Absorption Scattering Fission
LLNL-PRES-7437374
Nuclearcrosssectionsarenot“nice”functions
Scattering
Absorption
Fission
NuclearCrossSectionsfor235U
LLNL-PRES-7437375
§ Threespecificuses:— AnimbleprototypecodefortestingdesignorrefactoringoptionsforMercury— Anopensourcevehicleforco-designwithoutsidepartners— AbenchmarkcodetoreplaceourpreviousMonteCarlobenchmarkcode
§ OverallgoalwastoapproximatetheoverallapplicationperformanceofMercury— Controlflowisdominatedbybranchingduetotherandomsamplingofreactions.— Memoryaccesspatternsassociatedwithreadingcrosssectiontablestendtobelatency-bound,
smallmemoryloadsthataredifficultorimpossibletocacheorcoalesce.— Domaindecompositionandinternodecommunicationtohandlelargeproblems.
§ MajordatastructuresintentionallysimilartoMercury
§ Flexibleinputstorepresentmultiplecommonusemodes
CreatingQuicksilverrequiredmodelingchoicesProxyappsaremodelsforoneormoreaspectsoftheirparents
Itisessentialtotoidentifythekeyfeaturesoftheparentapptheproxyisintendedtorepresentandincludefaithfulmodelsofthosefeatures
LLNL-PRES-7437376
QuicksilveromitsmanyMercuryfeatures,butkeepsenoughtorepresentcriticalcomputationalpatterns
Mercury Quicksilver
CrossSections Continuous energy&multi-group Multi-group(syntheticdata)
Mesh/Geometry Multipletypes &solidgeometry 3Dpolyhedralonly
Reactions 10-40 3(usesreplication)
Reaction Physics Physicallybased Simplified(isotropic, etc.)
Tallies Many built-in&userdefined Balance&scalarflux
Sources&populationctrl Realistic withvariancereduction Simplified
Loadbalancing Sophisticated applicationspecific Trivial particle-countbased
Inputspecification Pythonscriptinginterface Flexible problemsetup(YAML)
MPI/OpenMP Yes/Yes Yes/Yes
LLNL-PRES-7437377
IsQuicksilveragoodrepresentationofMercury?
�
�����
������
������
������
������
������
������
� �� �� �� �� �� ��
��������������
���� ����
��������������������
��������
Balancetalliescounteachkindofevent.Dataisfor100,000particles.
Numberofreactionsdropssharply
Facetcrossingsnearlyvanish
Particlesreachingcensusincreases
Bigchangesduringfirst10timesteps.Notthebehaviorweexpected
LLNL-PRES-7437378
§ Sourcedparticleshaverandomenergydrawnfromuniformdistribution
§ Simplifiedsourcingrulescreate10%oftargetsimulationparticleseachstep
§ Populationcontrolsplitsorkillsparticlestoachievetargetnumber— Adjustsparticlestatisticalweight
§ Injustafewtimesteps,populationcontrolmagnifiesanyparticlesthatsurvivetocensus— Rare,lowenergyparticleslesslikelyto
beabsorbed
Theparticleenergyspectrumhasanunexplainedshift.Causedbyunintendedconsequencesofsimplifiedphysics&sourcing.
�
����
���
����
���
����
���
� �� ��� ��� ��� ���
������������������
������ �����
���� ����� � 20MeV
1keV
Energygroupsarelogarithmic.Particlesingroup135areabout
100xlowervelocitythangroup230
LLNL-PRES-7437379
Deletinglow-weightparticlessolvestheproblem
�
����
���
����
���
����
���
� �� ��� ��� ��� ���
������������������
������ �����
���� ����� �
Thereisnosuchcapabilityintheparentapplication,butitmakesQuicksilverabettermodelforMercury
LLNL-PRES-74373710
Thisis1000s(or10,000s)oflinesofcode
§ loopovercycles(timesteps)— cycle_init
• sourceinnewparticles• populationcontrol
— cycle_tracking• loopoverparticles
– untilcensus• finddistancetocensus(endoftimestep)• finddistancetomaterialboundary(meshfacet)• finddistancetocollision(reaction)• selectreactionandupdateparticle
— cycle_finalize
QuicksilverandMercuryarehostiletothetypicalGPUfine-grainedthreadingapproach
Majorityofcrosssectionlookupsareinhere
LLNL-PRES-74373711
Thisis1000s(or10,000s)oflinesofcode
§ loopovercycles(timesteps)— cycle_init
• sourceinnewparticles• populationcontrol
— cycle_tracking• loopoverparticles
– untilcensus• finddistancetocensus(endoftimestep)• finddistancetomaterialboundary(meshfacet)• finddistancetocollision(reaction)• selectreactionandupdateparticle
— cycle_finalize
QuicksilverandMercuryarehostiletothetypicalGPUfine-grainedthreadingapproach
Majorityofcrosssectionlookupsareinhere
HowdoyouwritethiscodeforGPUs?
“Fat”threadingstrategy:• Eachthreadgetsitsown“vault”ofparticles
• Tallyandbufferdatastructuresarereplicatedtoavoidraces
• WorksgreatonCPUplatforms!
LLNL-PRES-74373712
Makethisakernel!
Thisis1000s(or10,000s)oflinesofcode
§ loopovercycles(timesteps)— cycle_init
• sourceinnewparticles• populationcontrol
— cycle_tracking• loopoverparticles
– untilcensus• finddistancetocensus(endoftimestep)• finddistancetomaterialboundary(meshfacet)• finddistancetocollision(reaction)• selectreactionandupdateparticle
— cycle_finalize
QuicksilverandMercuryarehostiletothetypicalGPUfine-grainedthreadingapproach
Majorityofcrosssectionlookupsareinhere
Canthis“Big-Kernel”approachpossiblyperformwell?
“Fat”threadingstrategy:• Eachthreadgetsitsown“vault”ofparticles
• Tallyandbufferdatastructuresarereplicatedtoavoidraces
• WorksgreatonCPUplatforms!
LLNL-PRES-74373713
Tobuildabigkernel,Mercury’sthreadingmodelmustchange
Fatthread
§ Dozensofactivethreads
§ Separatecollectionofparticlesforeachthread
§ Dataracesmanagedwithreplication
§ MPItightlyintegratedintrackingloop
§ WorkswellonCPUs
Thinthread
§ Thousandsofactivethreads
§ Allthreadsshareacommoncollectionofparticles
§ Dataracesmanagedwithatomics
§ NoMPIintrackingloop
§ WorksonGPUsandCPUs
LLNL-PRES-74373714
§ QuicksilverwasmorecomplicatedthanwewantedtoportGPU— MPI,variableparticlecount,etc.
§ Quicksilver_lite isevenmoreapproximatethanQuicksilver— Zero-Dmesh,verysimplifiedphysics
§ Quicksilver_lite maintainsfeaturesmostlikelytoimpairGPUperformance— Randomtablelook-ups— Callstackdepthinnucleardatalook-ups— Branchycontrolflowanddivergence
Totestbig-kernelwewroteaproxyappforourproxyapp
Initialize ComputeP8CPU(10threads) 0.27sec 1.25sec
P8CPU(40threads) 0.45sec 0.72sec
P-100GPU 0.26sec 0.45sec
QS_lite runtimes(lowerisbetter)
QS_lite providedourfirstevidencethatthebig-kernelapproach
mightactuallywork
LLNL-PRES-74373715
§ Bychangingprobleminputswecanadjust:— Fractionofparticlesthatreachcensus— Ratiooffacetcrossingstoreactions— Relativeprobabilitiesofdifferentreactiontypes
MercuryisemployedforaverywidevarietyofproblemsNosinglesampleproblemwillrepresentallusecases
Fat (CPU)seg/sec Thin (GPU)seg/secReactionDominated 8.52e+06 1.15e+07
Balanced 1.35e+07 7.50+e06
Facet Dominated 2.24e+07 2.90+e07
Higheris
better
GPUsandCPUsaresimilar,butwithdifferenceperformancesensitivities.GPUsareslowestinbalancedcase.Perhapsduetohighestdivergence?
LLNL-PRES-74373716
§ Differentcodecapabilitiesmakeapples-to-applescomparisonimpossible
§ Mercurytestproblemisacriticalsphereinwater— TunedmeshsizeandtimesteptoobtainsametallyratiosasasimilarQuicksilverproblem
PerformancecomparisontoMercury
CPUPerformance Mercury (seg/sec) Quicksilver(seg/sec)ReactionDominated 1.62e+06 8.52e+06
Balanced 2.66e+06 1.37e+07
FacetDominated 2.97e+06 2.24e+07
Quicksilverisroughly10xfasterthanMercuryonCPUs,butcapturessametrends.PerformancedifferencelikelysomewhatduetoMercury’sbetterphysics.
Higheris
better
LLNL-PRES-74373717
Willbigkernelwork?
Inspiteofadversealgorithmiccharacteristics,wearehopefulthatMercurywillperformequallywellonGPUsasCPUs.Apotential3-5xspeedupcomparedtoCPUsonly
0
10
20
30
40
50
Fat Thread Power 8
Thin Thread Power 8
Fat Thread KNL
Thin Thread KNL
Thin Thread P100
Cyc
le T
rack
ing
Tim
e [s
econ
ds] 1 Node 2 Nodes 4 Nodes
Cycletrackingtimeforareaction
dominatedproblem(lowerisbetter)
LLNL-PRES-74373718
§ WeusedQuicksilvertotestdesignstrategiesforMonteCarloTransportonGPUs— Sofar,abigkernelapproachappearstobeviable
§ EffortisshiftingfromprototypingwithQuicksilvertorefactoringMercury.— Designroleismostlydone
§ ModificationsareplannedtomakeQuicksilvermorerepresentativeforphotons— ThiswillmakeQuicksilveramoreflexibleprocurementbenchmark— MoreonproxiesvsbenchmarksinThursdaykeynotetalk
Conclusionandfuturework