optimizing for intel knl - nersc · performance is upper bound of what you get in multi-node •...
TRANSCRIPT
![Page 1: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/1.jpg)
Thorsten Kurth
Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing)
NUGTraining,06/09/2017
![Page 2: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/2.jpg)
2
Multicore vs. manycore• multicore(Edison)
‣ 5000nodes
‣ 12physicalcores/CPU
‣ 24HWthreads/CPU
‣ 2.4-3.2GHz
‣ 4DPops/cycle
‣ 30MBL3cache
‣ 64GB/node
‣ 100GB/smemorybandwidth
• manycore(Cori-KNL)
‣ 9600nodes
‣ 68physicalcores/CPU
‣ 272HWthreads/CPU
‣ 1.2-1.6GHz
‣ 2x8DPops/cycle
‣ noL3cache
‣ 16GB/node(fast) 96GB/node(slow)
‣ 450GB/smemorybandwidth(fast)
![Page 3: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/3.jpg)
Recompile and go?
3
• x86-64compatible:canusecodesforolderarchitecturesorrecompile
• self-hosted:noneedforoffloading
• medianspeedupvs.Edison:1.15x
• medianspeedupvs.Haswell:0.70x
![Page 4: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/4.jpg)
Why should I optimize my code?
• pros
• getmoreforyourbucks:makingefficientuseofexistingmanycoreHPCsystems
• fastsuccesspossible:manylowhangingfruitsinunoptimizedcodes
• investinginthefuture:heterogeneousarchitecturesareenergyefficientandthuswillstayaroundforawhile
• benefitsonmulticore:optimizationstargetingmanycorearchitecturesmostlyimproveperformanceonmulticoresystemsaswell
• cons
• effort:manymostbeneficialoptimizationsrequiresignificantcodechanges
• investinginthefuture:whatifIbetonthewronghorse?
4
![Page 5: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/5.jpg)
Optimization targets
• singlenodeperformance
• starthere:forrepresentativelocalproblemsize,singlenodeperformanceisupperboundofwhatyougetinmulti-node
• manyoptimizationopportunities,fastturnaroundtimes
• manyprofilingtoolsavailable
• multi-nodeperformance
• feweroptimizationopportunities,profiling/debuggingtedious
• IOperformance
• notmanyopportunitiesforimprovement
5
![Page 6: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/6.jpg)
Where do I start?
• gettoknowyourapplication:don’tassumeyoualreadydo!
• determinehotspots
• manualtimers:becarefulwiththreadsafety/syncbarriers
• profilingtools:NERSCoffersalot
• CrayPat(verylightweight)
• Advisor(findtime-consumingloops)
• VTune(candoalotofthingsbutalsoveryslow)
• MAP(comparablylightweight)
• foundhotspots,nowwhat?
6
![Page 7: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/7.jpg)
What architectural feature shall I target?
• KNLhasmanynewfeaturestoexplore
• manythreads
• biggervectorunits
• complexintrinsics(ISA)
• multiplememorytiers
• understandyourhotspots
• computebound:morethreads,vectorization,ISA
• memoryBWbound:memorytiers,morethreads
• memorylatencybound:morethreads,vectorization
7
![Page 8: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/8.jpg)
Prerequisites - compile and run
• recompileyourcodeforKNL:codeforolderCPUsissupportedbutthosedonotmakefulluseofnewarchitecture
• Cray(wrappers): module swap craype-haswell craype-mic-knl
• Intel: -xmic-avx512
• GNU:-march=knl
• useproperOpenMPsettings:export OMP_NUM_THREADS=64export OMP_PLACES=threadsexport OMP_PROC_BIND=spread
• usejob-script-generatoronmy.nersc.govorNERSCwebsite
• nodeconfiguration:use-C knl,quad,cacheasastart
8
![Page 9: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/9.jpg)
Prerequisites - #FLOPS
• :numberoffloatingpointoperations
• manualcalculation:
• floatadditionandmultiplication:+1
• complexmultiplication:+6(4multiplications+2additions)
• etc.
• measurewithSDE:
• usingSDEismoreprecise,becauseitaccountsformasking
9
#FLOPS
![Page 10: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/10.jpg)
Prerequisites - #BYTES
10
• :numberofbytestransferredfrommainmemory
• manualcalculation(notrecommended,butgoodcheck):
• countthebytesofdatatobereadandwritteninthekernel
• doesnotaccountfordatareusethroughcaching
• measurewithVTune:
• preciselyobtainuncorecounterevents
#BYTES
![Page 11: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/11.jpg)
What is limiting my performance?
• Rooflineperformancemodel
• arithmeticintensity
• performance
• plotPvs.AIwitharchitecturalrooflineR
11
AI =#FLOPS
#BYTES
P =#FLOPS
time[s]
R(AI) = min(memory bw ·AI, peak flops)
![Page 12: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/12.jpg)
Example roofline
12
![Page 13: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/13.jpg)
Example roofline
12
memory bandwidth bound
![Page 14: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/14.jpg)
Example roofline
12
use MCDRAM
![Page 15: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/15.jpg)
Example roofline
12
compute bounduse MCDRAM
![Page 16: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/16.jpg)
Example roofline
12
use MCDRAM
utilize vectorization
![Page 17: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/17.jpg)
Example roofline
12
(possibly) memory latency bounduse MCDRAM
utilize vectorization
![Page 18: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/18.jpg)
Example roofline
12
use MCDRAM
utilize vectorization
threading, vectorization
![Page 19: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/19.jpg)
Example roofline
12
use MCDRAM
utilize vectorization
threading, vectorization
improve AI
![Page 20: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/20.jpg)
How to improve AI?
• definitionofarithmeticintensity
• twopossibilities
• numberofflops⬆ numberofbytes➡
(notpossible/easy,choiceofalgorithmdeterminesflops)
• numberofflops ➡ numberofbytes ⬇
• reality:tradeoffbetweenboth
13
AI =#FLOPS
#BYTES
![Page 21: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/21.jpg)
Create more work/thread
• loop/kernelfusion:improvescachere-useandreduceoverhead
• collapsenestedloops:
• rearrangedatastructures:moveOpenMPout(coarsegrain)
14
![Page 22: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/22.jpg)
Loop transformations I
• looptiling:improvescachere-useandcansignificantlyimproveperformance
15
![Page 23: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/23.jpg)
Loop transformations I
• looptiling:improvescachere-useandcansignificantlyimproveperformance
• especiallyrelevantonKNLbecauseofmissingL3
• blockingtosharedL2(512KiB)usuallygood
• wasmytransformationsuccessful?checkL1,L2missrates,e.g.inVTune
15
![Page 24: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/24.jpg)
Loop transformations II
• shortloopunrolling:helpsthecompilervectorizingtherightloops
16
![Page 25: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/25.jpg)
Loop transformations II
• shortloopunrolling:helpsthecompilervectorizingtherightloops
• unrollingpragmasarehelpfultoo
• checkcompileroptimizationreports
• useIntelAdvisor
16
![Page 26: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/26.jpg)
Data alignment
• align(andpad)datato64bitwordstoimproveprefetching
• canbedoneeasilyinmajorprogramminglanguages
• FORTRAN:-align array64byte (ifort,gfortrandoesitautomagically)
• C/C++:aligned_alloc(64, <size>), __attribute__ ((aligned(64))), __declspec(align(64))
• C++trick:overloadnew operator
• advanced:manuallypaddataifarrayextentsarepowerof2tominimizecacheassociativityconflicts
17
![Page 27: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/27.jpg)
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 1.2 sec
![Page 28: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/28.jpg)
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 1.2 sec
if condition inside loop
![Page 29: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/29.jpg)
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 0.8 sec
1.5x speedup
![Page 30: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/30.jpg)
Reduced precision math
• transcendentalfunctions,squareroots,etc.areexpensive
• use-fp-model fast=2 -no-prec-divduringcompilation
• replacedivisionsbyconstantswithmultiplicationswithinverse
• donotexpecttoomuch:benefitsusuallyonlyvisibleinheavilycompute-boundcodesections
• reducedprecisionmightnotalwaysbeacceptable
19
![Page 31: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/31.jpg)
Benefits of AVX-512
20
• medianspeedup:1.2x
• benefitscanbelargerthan2x(probablymoreefficientprefetching)
• automaticallyenabledwhencompilingforKNLarchitecture
![Page 32: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/32.jpg)
Use MCDRAM
• alwaysuse16GiBon-packagememory(MCDRAM)
• cacheworkswell:requestwith-C knl,cache
• codefitsinto16GiB:request-C knl,flatand prependexecutablewithnumactl -m 1
21
![Page 33: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/33.jpg)
A note on heap allocation
• KNLmemoryallocationiscomparablyslow
• avoidallocatingandde-allocatingmemoryfrequently
• removeallocations/deallocationsinloopbodiesorfunctionswhicharecalledmanytimes
• tooinvolved?poolallocatorlibraries(e.g.IntelTBBscalablememorypools)
• pros:
• overloadsnew/malloc,no/minimalsourcecodechangesnecessary
• cangivesignificantperformanceboostforcertaincodes
• takecareofthread-safety
• cons:
• memoryfootprintneedstobeknown/computedinadvance
• codemightbecomelessportable
22
![Page 34: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/34.jpg)
Multi-node optimizations
• singleKNLthreadcannotsaturateAriesinjectionrate
• usethread-levelcommunicationormultipleMPIrankspernode
• recommended:>4rankspernode
• dedicatecorestoOS:-S <ncores>insbatch(ncores=2goodchoice)
23
![Page 35: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/35.jpg)
Hugepages, DMAPP and hardware AMO
• hugepagescanreduceAriesTLBmisses
• loadcorrespondingmoduleatcompileandruntime
• canusedifferentmodulesatcompileandruntime
• MPI-collective-heavycodes:enableDMAPP(add-ldmapp) export MPICH_RMA_OVER_DMAPP=1export MPICH_USE_DMAPP_COLL=1export MPICH_NETWORK_BUFFER_COLL_OPT=1
• enablehardwareAMOforMPI-3RMAatomicsexport MPICH_RMA_USE_NETWORK_AMO=1
24
![Page 36: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/36.jpg)
Some notes on IO
25
SingleCoreI/OPerformanceonCori
Rela4vePerfo
rmance(KN
L/H
SW)
0.0
0.3
0.6
0.9
1.2
WriteBa
ndwidth(M
B/s)
0
300
600
900
1200
BufferedI/O SyncI/O DirectI/O
HSWKNLKNL/HSW
![Page 37: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/37.jpg)
Use multiple processes
• usemoreprocesses(e.g.withMPIIO)
• unfortunately,nogoodthreadedIOsolutionsavailableyet
• always:pool(writebigchunks),reducefileoperations(open,close)
• largefiles:burstbuffer
26
![Page 38: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/38.jpg)
Does it help?
27
• medianspeedupvs.Edison:1.15x
• medianspeedupvs.Haswell:0.70x
![Page 39: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/39.jpg)
Does it help?
27
• medianspeedupvs.Edison:1.8x
• medianspeedupvs.Haswell:1.0x
![Page 40: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/40.jpg)
Summary
• singlenodeperformance(goforthatonefirst)
• loopfusionandtiling
• ensuregoodvectorization
• useMCDRAM
• multi-nodeperformance
• hugepages
• DMAPP
• IOperformance
• usemultiplenodes,poolIO,reducefileoperationstominimum
28
![Page 41: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/41.jpg)
NERSC training material
• runningjobs
• process/threadbinding
• codeprofilingandtools
• measuringarithmeticintensity(AI)
• improvingOpenMPscaling
• vectorizationhelp
• howtouseMCDRAM
• NESAPcasestudies
29
![Page 42: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f61be4eba8b7f3f832eb54d/html5/thumbnails/42.jpg)
Thank you
30