Programming the new KNL Cluster at LRZ: Hardware Overview
1
Dr. Momme Allalen Januar, 24-25, 2018 @ LRZ
Programming the new KNL Cluster at LRZ
● Intro@Multicore systems & accelerators on HPC ● Architecture overview of the Intel Xeon Phi Products (MIC) ● KNL vs KNC ● KNC and KNL Programming models ● Hands on
Agenda
January, 24-25, 2018 [email protected]
Multicore Processors
How many cores is too many ? Intel/AMD 2-4-8-16-40 cores IBM Cell 8-16 SPU, Power8 12 cores with 8 threads per core NVIDIA Pascal/Volta GPUs 3584/5120 cores
Multicore is Hardware and Software together (challenge and inspire each other)
More transistors, worse reliability Error / fault ( detection / correction / recovery)
Memory Memory wall due to bandwidth (scalability?) Memory wall due to power (interconnect needs power) Memory size grows but data always grows more and more
Programming the new KNL Cluster at LRZ [email protected], 24-25, 2018
more cores your CPU
contained, the faster it would perform
overall !
• In the past, computers got faster by increasing the clock frequency of the core, but this has now reached its limit mainly due to power requirements (Voltage was decreased).
• Today, processor cores are not getting any faster, but instead the number of cores per chip increases and registers are getting wider
• On HPC, we need a chip that can provide: • Higher computing performance • @high power efficiency: keep the power/core as low as possible • Developing a new chips is incredibly expensive • Must make maximum use of existing technology
Why do we need “accelerators” on HPC?
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
• One solution is a heterogeneous system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support
• Two types of hardware options, Intel Xeon Phi (KNC) and Nvidia GPU
• Can perform many parallel operations every clock cycle and keep power/core as low as possible.
Why do we need “accelerators” on HPC?
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
● Intel Xeon processors are for general purpose
● The current architecture Haswell and Broadwell Skylake is upcoming
Intel Multi-Core Architecture
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Architectures Comparison (CPU vs GPU)
• Large cache and sophisticated flow control minimise latency for arbitrary memory access.
• Simple flow control • More transistors for
computing in parallel (up to 21 billion on Nvidia Volta GPU)
• (SIMD)
CPU
ControlALU
Cache
ALU
ALU ALU
DRAM
GPU
DRAM
Intel Xeon CPU E5-2697v4 “Brodwell”
Nvidia GPU P100 Nvidia GPU V100
Cores @ Clock 2 x 18 cores @ ≥ 2.3 GHz 56SMs @ 1.4 GHz [email protected] GHzSP Perf./core ≥ 73.6 GFlop/s up to 166 GFlop/s
SP peak ≥ 2.6 TFlop/s up to 10.6 TFlop/s up to 15 TFlop/s
Transistors/TDP 2x7 Billion /2x145W 14 Billion/300W 21 Billion/300WBW 2 x 62.5 GB/s 510 GB/s up to 900 GB/s
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Intel Xeon Phi Products Intel Many Integrated Core (MIC) Architecture
• Xeon Phi Coprocessor: first product was released in 2012 named Knights Corner (KNC) which is the first architecture supporting 512 bit vectors
• Xeon Phi Processor: 2nd generation announced at ISC16 in June named Knights Landing (KNL) also support 512bit vectors with a new instruction set called Intel Advanced Vector Instructions 512 (Intel AVX-512)
Specialised Platform for high demanding computing application
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Intel MIC Programming Workshop @ LRZ
Xeon Phi Performance in Practice: e.g WRF (Weather Research and Forecasting) - Code Performance Timeline (on single node)
01-11 01-12 01-13 01-14 01-15 01-16
GF/
s (S
ingl
e pr
ecis
ion)
0
20
40
60
80
100
120
140
pre KNL 89GF/s
KNL(68 cores) 125 GF/s OpenMP 2 threads per core
Xeon
Xeon Phi
Broadwell (36 cores) 73 GF/s Haswell
63/GF/s KNC (61 cores)
42 GF/s Optimised
Sandy Bridge (16 cores) 31 GF/s initial Sandy Bridge (16 cores)
41 GF/s Optimised
KNC (60 cores) 16 GF/s Initial http://www2.mmm.ucar.edu/wrf/WG2/benchv3/
[email protected], 24-25, 2018
In common with Intel many-core Xeon CPUs: • X86 architecture • C, C++ and Fortran • Standard parallelization libraries • Similar optimization methods
• PCIe bus connection • IP-addressable
• Own Linux OS
• 6, 8 and 16 GB cached • 57, up to 61 x86 64 bit cores
• 1.1, 1.053 and 1.238 GHz • 1 to 2 TFlop/s DP performance
• 512 bit SIMD vector registers
• 4 way hyper-threading • SSE & AVX set for SIMD are not
supported
• Intel Initial Many Core Instructions (IMCI).
Intel Xeon Phi KNC Architecture
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
-Each core has a private L2 cache connected. -Bidirectional ring interconnect which connects all the cores, L2 caches, PCIe client logic, GDDR5 memory controllers …etc.
Intel Xeon Phi KNC Architecture
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Network Access on KNC
• Network access possible using TCP/IP tools like ssh. • NFS mounts on Xeon Phi supported. • Proxy Console /File I/O.
First experiences with the Intel MIC architecture at LRZ, Volker Weinberg and Momme Allalen, inSIDE Vol. 11 No.2 Autumn2013
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
GPUMICCPU
Architectures Comparison
General-purpose architecture
Control
Cache
Massively data parallel
ALU
DRAM
ALU
ALU ALU
DRAMDRAM
Power-efficient Multiprocessor X86 design architecture
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Intel Xeon Phi Knights Landing (KNL)
• 2nd generation Xeon Phi • Successor to the Knights Corner 1st
generation
• New cores • New processor architecture • New operation modes • New memory systems
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Intel Xeon Phi Knights Landing (KNL) Architecture
• Bootable CPU • Up to 72 cores based on the Intel Atom
cores (Silvermont microarchitecture) • 4HT running @ 1.3 to 1.5 GHz • 3+ TFlop/s in DP (FMA) • 6+ TFlop/s in SP (FMA) • ~ 384 GB DDR4 (> 90 GB/s) • 16 GB HBM (MCDRAM)
> 450 GB/s STREAM Performance • Binary-compatible with Xeon • Common operating system
(SUSE,WINDOWS,RHEL…)
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
KNC vs KNL
● Co-processor ● Binary incompatible
with other architectures
● 61 In-order cores ● 1.1 GHz processor ● up to 16 GB RAM ● 22 nm process ● One 512-bit VPU ● No support for branch prediction and
fast unaligned memory access
● No PCIe - Self hosted ● Binary compatible
with prior Xeon architectures (no phi)
● up to 72 Out-of-order cores ● 1.4 GHz processor ● Up to 400 GB RAM (with MCDRAM) ● 14 nm process ● Tow 512-bit VPUs ● Support for branch prediction and
fast unaligned memory access
The Improvement on the KNL Hardware still not good for non optimised code
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
KNL core
• 8-way 32 KB instruction cache • 8-way 32 KB data cache • 2VPUs
• Only one can run legacy ops • Compile with -xMIC-AVX512 to use both
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
● KNC required at least 2 threads per core for sensible compute performance
● KNL does not ● Can run up to 4 threads per core efficiently
● Several applications don’t need any hyper threads
Hyperthreading on Xeon Phi
Programming the new KNL Cluster at LRZ [email protected], 24-25, 2018
Invocation of the Intel MPI compiler
Language MPI Compiler CompilerC mpiicc iccC++ mpiicpc icpcFortran mpiifort ifort
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Intel Fabric integrated on KNL processor
● Intel released KNL-F on Nov. 2016 ; KNL with a Fabric ✓ Intel Omni-Path Architecture ✓ High Bandwidth and low
latency ✓ The Omni-Path technology will
allow to build Clusters like: LRZ-KNL Cluster
• Other Xeon Phi based system: Server , Workstation ..
dap.xeonphi.com
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
KNL: Cores and threads
• Up to 36 tiles connected by 2 D Mesh interconnect each with 2 physical cores (up 72 cores with out of order instruction execution )
• Distributed L2 cache across a mesh interconnect • Sensitive to the data locality
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
KNL Tile (Cores and threads)
• 2cores, each with 2VPU
• Up to 72 cores with 4 way hyper threading up to 288 logical processors
• 2VPU: 2x AVX512
Core Corecache
Decode
Vector ALU
Vector ALU
32KB L1 D-cache
cache
Decode
Vector ALU
Vector ALU
32KB L1 D-cache
1 MB L2
D-cache 16-way
• Up to 36 MB L2 per KNL
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Parallelism on Xeon Phi
Corelogical core
Vector Unit
logical core
Shared Memory
Corelogical core
Vector Unit
logical core
Corelogical core
Vector Unit
logical core
Corelogical core
Vector Unit
logical core
Victorisation
Shared memory parallelism: OpenMP
C/C++/Fortran, Python/Java …. Porting is easy Two parallelisation modes are required: Shared memory and vectorisation Run multiple threads/processes and each thread issues vector instructions (SIMD)
logical core
logical core
logical core
logical core
logical core
logical core
logical core
logical core
Programming the new KNL Cluster at LRZ January, 24-25, 2018
KNL and Vector Instruction Sets
● Binary runs without recompilation ● KNC binary requires recompilation
x87/MMX x87/MMX
SSE*
AVX
SSE*
AVX
AVX2
x87/MMX
SSE*
AVX
AVX2
AVX-512F
AVX-512CD
AVX-512PF
AVX-512ER
SNB E5-2600
HSW E5-2600 KNL
● are going to be used in future Xeon architecture like SKX
-Conflict Detection: Improves Vectorisation -Prefetch: Gather and Scatter Prefetch -Exponential and Reciprocal Instructions
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Memory on KNL
● Two levels of memory on KNL : 1.Main memory ● KNL has direct access to all of main memory ● Similar latency and bandwidth as a standard Xeon
processors ● 6 DDR channels
2.Multi-Channels DRAM or MCDRAM ● HBM on chip: 16GB ● Slightly higher latency than main memory ● 8 MCDRAM controllers , 16 channels
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Using MCDRAM on KNL
● At boot time you have to choose one memory mode operation
Flat Mode -MCDRAM treated as a NUMA node -as separately addressable
memory
-Users control what goes to
MCDRAM
Cache Mode -MCDRAM treated as a transparent Last Level Cache
(LLC)
-MCDRAM is used automatically
Hybrid Mode -Combination of Flat and Cache -Ratio can be chosen in the
BIOS
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Using MCDRAM on KNL
● Flat mode offers the best performance for applications, but require changes to the code (memory allocation) or the execution environment (NUMA node)
numactl —membind 0 ./exec # DDR4
numactl —membind 1 ./exec # MCDRAM
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
• Native Mode • Programs started on Xeon Phi KNC • Cross-compilation using –mmic • User access to Xeon Phi is necessary
• Offload to MIC (KNC) • Offload using OpenMP extensions
• Automatically offload some routines using MKL
• MKL Compiled assisted offload (CAO)
• MKL automatic Offload (AO)
• MPI tasks on Host and MIC • Treat the coprocessor like another host
• MPI only and MPI + X (X may be OpenMP, TBB, Cilk, OpenCL…etc.)
Programming Models on KNC
myFunction();main() {
#pragma offload target(mic)
}
Host MIC
main() {
#pragma offload target(mic)
}
Host MIC
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Native Mode on KNC
• First ensure that the application is suitable for native execution. • The application runs entirely on the MIC coprocessor without offload
from a host system. • Compile the application for native execution using the flag: -mmic • Build also the required libraries for native execution. • Copy the executable and any dependencies, such as run time libraries
to the coprocessor. • Mount file shares to the coprocessor for accessing input data sets and
saving output data sets. • Login to the coprocessor via console, setup the environment and run
the executable. • You can debug the native application via a debug server running on the
coprocessor.
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Programming on KNL
• C/C++/Fortran/Python/Java … • Feel like a standard node • But:
• Manycore approach • Cores relative slow • Intra-node parallelisation required for good performance
• Binary compatible with previous Xeon but not vice verca
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
Lab 1: KNL Cluster @ LRZ
Programming the new KNL Cluster at LRZ January, 24-25, 2018
KNL Cluster @ LRZ - CoolMUC3
34
CoolMUC3 Documentation
Documentation: https://www.lrz.de/services/compute/linux-cluster/coolmuc3/
Setup: - The front end node lxlogin8.lrz.de must be used to do development work for the CooLMUC-3 system. user@host~$ ssh lxlogin8.lrz.de -l a2c06aa Enter your password Login node is Broadwell Take a look at different queue with: Only : $sinfo will print lotof details Or $sinfo -o “%2OP %5a %.10l %16F” Will print the queues and permitted job sizes and runtimes
- Programming the new KNL Cluster at LRZ January, 24-25, 2018
CoolMUC3 Documentation
-We will use the reservation: KNL_Course -Copy the folder : cp -r /lrz/sys/courses/KNL . -We can compile for KNL architecture on Broadwell and Haswell architecture
Programming the new KNL Cluster at LRZ January, 24-25, 2018
user@mpp3-login8~$ salloc —ntasks=1 - -reservation=KNL_Course -t 02:00:00 user@mpp3-login8~$ squeue -u $USER
CoolMUC3 Documentation
There is a hello world program (hello.c) compile and run:
$icc hello.c -o hello $ ./hello
Try to use the “-xmic-avx512” flag $icc -xmic-avx512 hello.c -o hello $./hello
What happens ?
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
CoolMUC3 Documentation
Try now to do it with :
$icc -xcore-avx2 -axmic-avx512 hello.c -o hello $ ./hello Why does this run ?
Now execute :
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
user@mpp3-login8~$ salloc —ntasks=1 - -reservation=KNL_Course -t 02:00:00 user@mpp3-login8~$ squeue -u $USER user@mpp3-login8~$ srun --pty bash --reservation=KNL_Course user@mpp3-login8~$ssh mpp3r03c05s02
CoolMUC3 Documentation
Execute
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
$ numactl -H
From mpp3-login8 execute: $srun ./hello
Now login to mcct03.cos and mcct04.cos And execute “numactl -H” compare ?
Submitting Jobs on CoolMUC3
40
I_MPI_FABRICS
● The following network fabrics are available for the Intel Xeon Phi processor and coprocessor:
● This variable tells MPI which fabric to use during the runtime
Fabric Network Hardware and Software usedshm Shared-memorytcp TCP/IP-capable network fabrics, such as Ethernet and
InfiniBand (through IPoIB)ofa OFA-capable network fabrics including InfiniBand
(through OFED verbs)dapl DAPL–capable network fabrics, such as InfiniBand, iWarp,
Dolphin, and XPMEM (through DAPL)ofi or tmi OFA-capable network fabrics such as Intel True fabric, Intel
Omni-Path, InfiniBand and Eternet
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018
I_MPI_FABRICS
● The default can be changed by setting the I_MPI_FABRICS environment variable to I_MPI_FABRICS=<fabric> or I_MPI_FABRICS= <intra-node fabric>:<inter-nodes fabric>
● Intel® OPA MPI parameters: (TMI fabric) export I_MPI_FABRICS=shm:tmi
● Intranode: Shared Memory, Internode: DAPL(Default on SuperMIC/MUC) − export I_MPI_FABRICS=shm:dapl
● Intranode: Shared Memory, Internode: TCP(Can be used in case of Infiniband problems) − export I_MPI_FABRICS=shm:tcp
[email protected] the new KNL Cluster at LRZ January, 24-25, 2018