optimizing for intel knl - nersc · performance is upper bound of what you get in multi-node •...

Thorsten Kurth

Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing)

NUGTraining,06/09/2017

2

Multicore vs. manycore• multicore(Edison)

‣ 5000nodes

‣ 12physicalcores/CPU

‣ 24HWthreads/CPU

‣ 2.4-3.2GHz

‣ 4DPops/cycle

‣ 30MBL3cache

‣ 64GB/node

‣ 100GB/smemorybandwidth

• manycore(Cori-KNL)

‣ 9600nodes

‣ 68physicalcores/CPU

‣ 272HWthreads/CPU

‣ 1.2-1.6GHz

‣ 2x8DPops/cycle

‣ noL3cache

‣ 16GB/node(fast) 96GB/node(slow)

‣ 450GB/smemorybandwidth(fast)

Recompile and go?

3

• x86-64compatible:canusecodesforolderarchitecturesorrecompile

• self-hosted:noneedforoffloading

• medianspeedupvs.Edison:1.15x

• medianspeedupvs.Haswell:0.70x

Why should I optimize my code?

• pros

• getmoreforyourbucks:makingefficientuseofexistingmanycoreHPCsystems

• fastsuccesspossible:manylowhangingfruitsinunoptimizedcodes

• investinginthefuture:heterogeneousarchitecturesareenergyefficientandthuswillstayaroundforawhile

• benefitsonmulticore:optimizationstargetingmanycorearchitecturesmostlyimproveperformanceonmulticoresystemsaswell

• cons

• effort:manymostbeneficialoptimizationsrequiresignificantcodechanges

• investinginthefuture:whatifIbetonthewronghorse?

4

Optimization targets

• singlenodeperformance

• starthere:forrepresentativelocalproblemsize,singlenodeperformanceisupperboundofwhatyougetinmulti-node

• manyoptimizationopportunities,fastturnaroundtimes

• manyprofilingtoolsavailable

• multi-nodeperformance

• feweroptimizationopportunities,profiling/debuggingtedious

• IOperformance

• notmanyopportunitiesforimprovement

5

Where do I start?

• gettoknowyourapplication:don’tassumeyoualreadydo!

• determinehotspots

• manualtimers:becarefulwiththreadsafety/syncbarriers

• profilingtools:NERSCoffersalot

• CrayPat(verylightweight)

• Advisor(findtime-consumingloops)

• VTune(candoalotofthingsbutalsoveryslow)

• MAP(comparablylightweight)

• foundhotspots,nowwhat?

6

http://www.nersc.gov/users/software/performance-and-debugging-tools/craypat/

http://www.nersc.gov/users/software/performance-and-debugging-tools/advisor/

http://www.nersc.gov/users/software/performance-and-debugging-tools/vtune/

http://www.nersc.gov/users/software/performance-and-debugging-tools/MAP/

What architectural feature shall I target?

• KNLhasmanynewfeaturestoexplore

• manythreads

• biggervectorunits

• complexintrinsics(ISA)

• multiplememorytiers

• understandyourhotspots

• computebound:morethreads,vectorization,ISA

• memoryBWbound:memorytiers,morethreads

• memorylatencybound:morethreads,vectorization

7

Prerequisites - compile and run

• recompileyourcodeforKNL:codeforolderCPUsissupportedbutthosedonotmakefulluseofnewarchitecture

• Cray(wrappers): module swap craype-haswell craype-mic-knl

• Intel: -xmic-avx512

• GNU:-march=knl

• useproperOpenMPsettings:export OMP_NUM_THREADS=64export OMP_PLACES=threadsexport OMP_PROC_BIND=spread

• usejob-script-generatoronmy.nersc.govorNERSCwebsite

• nodeconfiguration:use-C knl,quad,cacheasastart

8

https://my.nersc.gov/script_generator.php

http://my.nersc.gov

http://www.nersc.gov/users/computational-systems/cori/running-jobs/general-running-jobs-recommendations/

Prerequisites - #FLOPS

• :numberoffloatingpointoperations

• manualcalculation:

• floatadditionandmultiplication:+1

• complexmultiplication:+6(4multiplications+2additions)

• etc.

• measurewithSDE:

• usingSDEismoreprecise,becauseitaccountsformasking

9

#FLOPS

http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/

Prerequisites - #BYTES

10

• :numberofbytestransferredfrommainmemory

• manualcalculation(notrecommended,butgoodcheck):

• countthebytesofdatatobereadandwritteninthekernel

• doesnotaccountfordatareusethroughcaching

• measurewithVTune:

• preciselyobtainuncorecounterevents

#BYTES


What is limiting my performance?

• Rooflineperformancemodel

• arithmeticintensity

• performance

• plotPvs.AIwitharchitecturalrooflineR

11

AI =#FLOPS

#BYTES

P =#FLOPS

time[s]

R(AI) = min(memory bw ·AI, peak flops)

https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/

Example roofline

12

Example roofline

12

memory bandwidth bound

Example roofline

12

use MCDRAM

Example roofline

12

compute bounduse MCDRAM

Example roofline

12

use MCDRAM

utilize vectorization

Example roofline

12

(possibly) memory latency bounduse MCDRAM


Example roofline

12

use MCDRAM


threading, vectorization

Example roofline

12

use MCDRAM


threading, vectorization

improve AI

How to improve AI?

• definitionofarithmeticintensity

• twopossibilities

• numberofflops⬆ numberofbytes➡

(notpossible/easy,choiceofalgorithmdeterminesflops)

• numberofflops ➡ numberofbytes ⬇

• reality:tradeoffbetweenboth

13

AI =#FLOPS

#BYTES

Create more work/thread

• loop/kernelfusion:improvescachere-useandreduceoverhead

• collapsenestedloops:

• rearrangedatastructures:moveOpenMPout(coarsegrain)

14

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

15

Loop transformations I

• looptiling:improvescachere-useandcansignificantlyimproveperformance

• especiallyrelevantonKNLbecauseofmissingL3

• blockingtosharedL2(512KiB)usuallygood

• wasmytransformationsuccessful?checkL1,L2missrates,e.g.inVTune

15

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

16

Loop transformations II

• shortloopunrolling:helpsthecompilervectorizingtherightloops

• unrollingpragmasarehelpfultoo

• checkcompileroptimizationreports

• useIntelAdvisor

16

Data alignment

• align(andpad)datato64bitwordstoimproveprefetching

• canbedoneeasilyinmajorprogramminglanguages

• FORTRAN:-align array64byte (ifort,gfortrandoesitautomagically)

• C/C++:aligned_alloc(64, <size>), __attribute__ ((aligned(64))), __declspec(align(64))

• C++trick:overloadnew operator

• advanced:manuallypaddataifarrayextentsarepowerof2tominimizecacheassociativityconflicts

17

Make use of ISA

• helpthecompilertogenerateefficientintrinsics

18

runtime example for app with kernel: 1.2 sec

Make use of ISA


18


if condition inside loop

Make use of ISA


18


1.5x speedup

Reduced precision math

• transcendentalfunctions,squareroots,etc.areexpensive

• use-fp-model fast=2 -no-prec-divduringcompilation

• replacedivisionsbyconstantswithmultiplicationswithinverse

• donotexpecttoomuch:benefitsusuallyonlyvisibleinheavilycompute-boundcodesections

• reducedprecisionmightnotalwaysbeacceptable

19

Benefits of AVX-512

20

• medianspeedup:1.2x

• benefitscanbelargerthan2x(probablymoreefficientprefetching)

• automaticallyenabledwhencompilingforKNLarchitecture

Use MCDRAM

• alwaysuse16GiBon-packagememory(MCDRAM)

• cacheworkswell:requestwith-C knl,cache

• codefitsinto16GiB:request-C knl,flatand prependexecutablewithnumactl -m 1

21

A note on heap allocation

• KNLmemoryallocationiscomparablyslow

• avoidallocatingandde-allocatingmemoryfrequently

• removeallocations/deallocationsinloopbodiesorfunctionswhicharecalledmanytimes

• tooinvolved?poolallocatorlibraries(e.g.IntelTBBscalablememorypools)

• pros:

• overloadsnew/malloc,no/minimalsourcecodechangesnecessary

• cangivesignificantperformanceboostforcertaincodes

• takecareofthread-safety

• cons:

• memoryfootprintneedstobeknown/computedinadvance

• codemightbecomelessportable

22

https://software.intel.com/en-us/node/506255

Multi-node optimizations

• singleKNLthreadcannotsaturateAriesinjectionrate

• usethread-levelcommunicationormultipleMPIrankspernode

• recommended:>4rankspernode

• dedicatecorestoOS:-S <ncores>insbatch(ncores=2goodchoice)

23

Hugepages, DMAPP and hardware AMO

• hugepagescanreduceAriesTLBmisses

• loadcorrespondingmoduleatcompileandruntime

• canusedifferentmodulesatcompileandruntime

• MPI-collective-heavycodes:enableDMAPP(add-ldmapp) export MPICH_RMA_OVER_DMAPP=1export MPICH_USE_DMAPP_COLL=1export MPICH_NETWORK_BUFFER_COLL_OPT=1

• enablehardwareAMOforMPI-3RMAatomicsexport MPICH_RMA_USE_NETWORK_AMO=1

24

Some notes on IO

25

SingleCoreI/OPerformanceonCori

Rela4vePerfo

rmance(KN

L/H

SW)

0.0

0.3

0.6

0.9

1.2

WriteBa

ndwidth(M

B/s)

0

300

600

900

1200

BufferedI/O SyncI/O DirectI/O

HSWKNLKNL/HSW

Use multiple processes

• usemoreprocesses(e.g.withMPIIO)

• unfortunately,nogoodthreadedIOsolutionsavailableyet

• always:pool(writebigchunks),reducefileoperations(open,close)

• largefiles:burstbuffer

26

http://www.nersc.gov/users/computational-systems/cori/burst-buffer/

Does it help?

27



Summary

• singlenodeperformance(goforthatonefirst)

• loopfusionandtiling

• ensuregoodvectorization

• useMCDRAM

• multi-nodeperformance

• hugepages

• DMAPP

• IOperformance

• usemultiplenodes,poolIO,reducefileoperationstominimum

28

NERSC training material

• runningjobs

• process/threadbinding

• codeprofilingandtools

• measuringarithmeticintensity(AI)

• improvingOpenMPscaling

• vectorizationhelp

• howtouseMCDRAM

• NESAPcasestudies

29

http://www.nersc.gov/users/computational-systems/cori/running-jobs/

https://www.nersc.gov/users/software/programming-models/openmp/process-and-thread-affinity/

http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/profiling/


http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/improving-openmp-scaling/

http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/vectorization/

http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/using-on-package-memory/

http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/

Thank you

30

optimizing for intel knl - nersc · performance is upper bound of what you get in multi-node •...

Documents