optimizing for intel knl - nersc · performance is upper bound of what you get in multi-node •...

42
Thorsten Kurth Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing) NUG Training, 06/09/2017

Upload: others

Post on 21-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Thorsten Kurth

Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing)

NUGTraining,06/09/2017

Page 2: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

2

Multicore vs. manycore• multicore(Edison)

‣ 5000nodes

‣ 12physicalcores/CPU

‣ 24HWthreads/CPU

‣ 2.4-3.2GHz

‣ 4DPops/cycle

‣ 30MBL3cache

‣ 64GB/node

‣ 100GB/smemorybandwidth

• manycore(Cori-KNL)

‣ 9600nodes

‣ 68physicalcores/CPU

‣ 272HWthreads/CPU

‣ 1.2-1.6GHz

‣ 2x8DPops/cycle

‣ noL3cache

‣ 16GB/node(fast) 96GB/node(slow)

‣ 450GB/smemorybandwidth(fast)

Page 3: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Recompile and go?

3

• x86-64compatible:canusecodesforolderarchitecturesorrecompile

• self-hosted:noneedforoffloading

• medianspeedupvs.Edison:1.15x

• medianspeedupvs.Haswell:0.70x

Page 4: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Why should I optimize my code?

• pros

• getmoreforyourbucks:makingefficientuseofexistingmanycoreHPCsystems

• fastsuccesspossible:manylowhangingfruitsinunoptimizedcodes

• investinginthefuture:heterogeneousarchitecturesareenergyefficientandthuswillstayaroundforawhile

• benefitsonmulticore:optimizationstargetingmanycorearchitecturesmostlyimproveperformanceonmulticoresystemsaswell

• cons

• effort:manymostbeneficialoptimizationsrequiresignificantcodechanges

• investinginthefuture:whatifIbetonthewronghorse?

4

Page 5: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Optimization targets

• singlenodeperformance

• starthere:forrepresentativelocalproblemsize,singlenodeperformanceisupperboundofwhatyougetinmulti-node

• manyoptimizationopportunities,fastturnaroundtimes

• manyprofilingtoolsavailable

• multi-nodeperformance

• feweroptimizationopportunities,profiling/debuggingtedious

• IOperformance

• notmanyopportunitiesforimprovement

5

Page 6: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Where do I start?

• gettoknowyourapplication:don’tassumeyoualreadydo!

• determinehotspots

• manualtimers:becarefulwiththreadsafety/syncbarriers

• profilingtools:NERSCoffersalot

• CrayPat(verylightweight)

• Advisor(findtime-consumingloops)

• VTune(candoalotofthingsbutalsoveryslow)

• MAP(comparablylightweight)

• foundhotspots,nowwhat?

6

Page 7: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

What architectural feature shall I target?

• KNLhasmanynewfeaturestoexplore

• manythreads

• biggervectorunits

• complexintrinsics(ISA)

• multiplememorytiers

• understandyourhotspots

• computebound:morethreads,vectorization,ISA

• memoryBWbound:memorytiers,morethreads

• memorylatencybound:morethreads,vectorization

7

Page 8: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Prerequisites - compile and run

• recompileyourcodeforKNL:codeforolderCPUsissupportedbutthosedonotmakefulluseofnewarchitecture

• Cray(wrappers): module swap craype-haswell craype-mic-knl

• Intel: -xmic-avx512

• GNU:-march=knl

• useproperOpenMPsettings:export OMP_NUM_THREADS=64export OMP_PLACES=threadsexport OMP_PROC_BIND=spread

• usejob-script-generatoronmy.nersc.govorNERSCwebsite

• nodeconfiguration:use-C knl,quad,cacheasastart

8

Page 9: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Prerequisites - #FLOPS

• :numberoffloatingpointoperations

• manualcalculation:

• floatadditionandmultiplication:+1

• complexmultiplication:+6(4multiplications+2additions)

• etc.

• measurewithSDE:

• usingSDEismoreprecise,becauseitaccountsformasking

9

#FLOPS

Page 10: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Prerequisites - #BYTES

10

• :numberofbytestransferredfrommainmemory

• manualcalculation(notrecommended,butgoodcheck):

• countthebytesofdatatobereadandwritteninthekernel

• doesnotaccountfordatareusethroughcaching

• measurewithVTune:

• preciselyobtainuncorecounterevents

#BYTES

Page 11: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

What is limiting my performance?

• Rooflineperformancemodel

• arithmeticintensity

• performance

• plotPvs.AIwitharchitecturalrooflineR

11

AI =#FLOPS

#BYTES

P =#FLOPS

time[s]

R(AI) = min(memory bw ·AI, peak flops)

Page 12: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

Page 13: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

memory bandwidth bound

Page 14: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

use MCDRAM

Page 15: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

compute bounduse MCDRAM

Page 16: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

use MCDRAM

utilize vectorization

Page 17: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

(possibly) memory latency bounduse MCDRAM

utilize vectorization

Page 18: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

use MCDRAM

utilize vectorization

threading, vectorization

Page 19: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Example roofline

12

use MCDRAM

utilize vectorization

threading, vectorization

improve AI

Page 20: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

How to improve AI?

• definitionofarithmeticintensity

• twopossibilities

• numberofflops⬆ numberofbytes➡

(notpossible/easy,choiceofalgorithmdeterminesflops)

• numberofflops ➡ numberofbytes ⬇

• reality:tradeoffbetweenboth

13

AI =#FLOPS

#BYTES

Page 21: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Create more work/thread

• loop/kernelfusion:improvescachere-useandreduceoverhead

• collapsenestedloops:

• rearrangedatastructures:moveOpenMPout(coarsegrain)

14

Page 22: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

15

Page 23: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

• especiallyrelevantonKNLbecauseofmissingL3

• blockingtosharedL2(512KiB)usuallygood

• wasmytransformationsuccessful?checkL1,L2missrates,e.g.inVTune

15

Page 24: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

16

Page 25: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

• unrollingpragmasarehelpfultoo

• checkcompileroptimizationreports

• useIntelAdvisor

16

Page 26: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Data alignment

• align(andpad)datato64bitwordstoimproveprefetching

• canbedoneeasilyinmajorprogramminglanguages

• FORTRAN:-align array64byte (ifort,gfortrandoesitautomagically)

• C/C++:aligned_alloc(64, <size>), __attribute__ ((aligned(64))), __declspec(align(64))

• C++trick:overloadnew operator

• advanced:manuallypaddataifarrayextentsarepowerof2tominimizecacheassociativityconflicts

17

Page 27: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Make use of ISA

• helpthecompilertogenerateefficientintrinsics

18

runtime example for app with kernel: 1.2 sec

Page 28: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Make use of ISA

• helpthecompilertogenerateefficientintrinsics

18

runtime example for app with kernel: 1.2 sec

if condition inside loop

Page 29: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Make use of ISA

• helpthecompilertogenerateefficientintrinsics

18

runtime example for app with kernel: 0.8 sec

1.5x speedup

Page 30: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Reduced precision math

• transcendentalfunctions,squareroots,etc.areexpensive

• use-fp-model fast=2 -no-prec-divduringcompilation

• replacedivisionsbyconstantswithmultiplicationswithinverse

• donotexpecttoomuch:benefitsusuallyonlyvisibleinheavilycompute-boundcodesections

• reducedprecisionmightnotalwaysbeacceptable

19

Page 31: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Benefits of AVX-512

20

• medianspeedup:1.2x

• benefitscanbelargerthan2x(probablymoreefficientprefetching)

• automaticallyenabledwhencompilingforKNLarchitecture

Page 32: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Use MCDRAM

• alwaysuse16GiBon-packagememory(MCDRAM)

• cacheworkswell:requestwith-C knl,cache

• codefitsinto16GiB:request-C knl,flatand prependexecutablewithnumactl -m 1

21

Page 33: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

A note on heap allocation

• KNLmemoryallocationiscomparablyslow

• avoidallocatingandde-allocatingmemoryfrequently

• removeallocations/deallocationsinloopbodiesorfunctionswhicharecalledmanytimes

• tooinvolved?poolallocatorlibraries(e.g.IntelTBBscalablememorypools)

• pros:

• overloadsnew/malloc,no/minimalsourcecodechangesnecessary

• cangivesignificantperformanceboostforcertaincodes

• takecareofthread-safety

• cons:

• memoryfootprintneedstobeknown/computedinadvance

• codemightbecomelessportable

22

Page 34: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Multi-node optimizations

• singleKNLthreadcannotsaturateAriesinjectionrate

• usethread-levelcommunicationormultipleMPIrankspernode

• recommended:>4rankspernode

• dedicatecorestoOS:-S <ncores>insbatch(ncores=2goodchoice)

23

Page 35: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Hugepages, DMAPP and hardware AMO

• hugepagescanreduceAriesTLBmisses

• loadcorrespondingmoduleatcompileandruntime

• canusedifferentmodulesatcompileandruntime

• MPI-collective-heavycodes:enableDMAPP(add-ldmapp) export MPICH_RMA_OVER_DMAPP=1export MPICH_USE_DMAPP_COLL=1export MPICH_NETWORK_BUFFER_COLL_OPT=1

• enablehardwareAMOforMPI-3RMAatomicsexport MPICH_RMA_USE_NETWORK_AMO=1

24

Page 36: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Some notes on IO

25

SingleCoreI/OPerformanceonCori

Rela4vePerfo

rmance(KN

L/H

SW)

0.0

0.3

0.6

0.9

1.2

WriteBa

ndwidth(M

B/s)

0

300

600

900

1200

BufferedI/O SyncI/O DirectI/O

HSWKNLKNL/HSW

Page 37: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Use multiple processes

• usemoreprocesses(e.g.withMPIIO)

• unfortunately,nogoodthreadedIOsolutionsavailableyet

• always:pool(writebigchunks),reducefileoperations(open,close)

• largefiles:burstbuffer

26

Page 38: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Does it help?

27

• medianspeedupvs.Edison:1.15x

• medianspeedupvs.Haswell:0.70x

Page 39: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Does it help?

27

• medianspeedupvs.Edison:1.8x

• medianspeedupvs.Haswell:1.0x

Page 40: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Summary

• singlenodeperformance(goforthatonefirst)

• loopfusionandtiling

• ensuregoodvectorization

• useMCDRAM

• multi-nodeperformance

• hugepages

• DMAPP

• IOperformance

• usemultiplenodes,poolIO,reducefileoperationstominimum

28

Page 42: Optimizing for Intel KNL - NERSC · performance is upper bound of what you get in multi-node • many optimization opportunities, fast turnaround times • many profiling tools available

Thank you

30