accelerators in abel - universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf ·...
TRANSCRIPT
Accelerators in Abel
Ole W. Saastad, Dr.Scient UiO/USIT / UAV/ ITF/FI
March 28th 2014
Universitetets senter for informasjonsteknologi
Background, what is an accelerator ?
The short explanation is that it's a device where most of the transistors are used for computation.
A calculating device more than a general multi purpose processor.
By using all transistors for calculations very high performance can be achieved.
Universitetets senter for informasjonsteknologi
Accelerators are not new
Universitetets senter for informasjonsteknologi
Top 500 supercomputers
• Of the top 500 systems, 53 now use accelerators
• 4 of the top 10 uses accelerators
• HPL performance
Universitetets senter for informasjonsteknologi
Benchmark – user fortran code
2288 MiB 5149 MiB 5859 MiB 6614 MiB
0
5
10
15
20
25
30
35
40
MxM offloading
Fortran 90 code, double prec.
Host procs
Co-processor
Memory footprint matrices
Pe
rfo
rma
nce
[Gflo
ps/
s]
Universitetets senter for informasjonsteknologi
Accelerators in Abel
• NVIDIA K20x
– 16 nodes with two each – 32 GPUs in total
• Intel Xeon Phi, 5110P
– 4 nodes with two each– 8 MIC systems in total
Universitetets senter for informasjonsteknologi
NVIDIA Kepler K20
Universitetets senter for informasjonsteknologi
Kepler K20, processor GK110
Universitetets senter for informasjonsteknologi
Kepler K20 architecture
Universitetets senter for informasjonsteknologi
Kepler K20 architecture
Universitetets senter for informasjonsteknologi
Kepler K20 architecture
Universitetets senter for informasjonsteknologi
Kepler K20 performance
Universitetets senter for informasjonsteknologi
GPU performance K20Xm
91 366 1464 3295 5149
0
100
200
300
400
500
600
DGEMM performance GPU vs. CPUTesla K20X vs Intel SB
CUDA BLASMKL BLAS
Total matrix foorptrint [MiB]
Pe
rfor
ma
nce
[Gflo
ps/s
]
45 183 732 1647 2575 3708 4577 5538
0
200
400
600
800
1000
1200
1400
SGEMM performance GPU vs CPU
Tesla K20X vs Intel SB
CUDA BLASMKL BLAS
Total matrix footprint [MiB]
Pe
rfor
ma
nce
[Gflo
ps/s
]Double precision 64 bit,1 Tflops/s
Single precision 32 bit, 2.6 Tflops/s
Universitetets senter for informasjonsteknologi
Accelerators hype or production ?
Universitetets senter for informasjonsteknologi
Exploiting the GPUs
• Pre compiled applications
– NAMD, MrBayes, Beagle, LAMMPS etc
• CUDA libraries
– BLAS, Sparse matrices, FFT
• Compiler supporting accelerator directives
– PGI support accelerator directives
Universitetets senter for informasjonsteknologi
NAMD 2.8 and 2.9
• GPU enabled• Easy to run
charmrun namd2 +idlepoll +p 2 ++local +devices 0,1 input.inp
Compute node GPUnode0
20
40
60
80
100
120
140
NAMD apoa1 benchmark
Single node performance
Type node
Wa
ll tim
e [s
ecs
]
Speedup : 122/39 = 3.1x
Universitetets senter for informasjonsteknologi
LAMMPS
• GPU enabled• Easy to run
mpirun lmp_cuda.double.x -sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000 in.lj.gpu
Standard compute node Accelerated node
0
100
200
300
400
500
600
700
800
Lennard-Jones potential benchmark
Single node performance
Node type
Ru
n ti
me
[se
cs]
Speedup 720/250 = 2.9x
Universitetets senter for informasjonsteknologi
Running applications with GPUsExample using lammps:
#SBATCH --job-name=lammps --account=proj --nodes=2#SBATCH --ntasks-per-node=8 --mem-per-cpu=7800M#SBATCH --partition=accel --gres=gpu:2 --time=01:00:00
. /cluster/bin/jobsetupmodule load lammps/2013.08.16module load cuda/5.0
EXE=lmp_cuda.double.xOPT="-sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000"INPUT=in.lj.gpu
mpirun $EXE $OPT < $INPUT
Universitetets senter for informasjonsteknologi
CUDA libraries – easy access
• Precompiled, just linking
– BLAS– Sparse – FFT– Random– Some extras, ref. doc.
Universitetets senter for informasjonsteknologi
CUDA libraries
From fortran 90 :call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
Same syntax as standard dgemm
Compile and link :
gfortran o dgemmdriver.x L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o lcublas dgemmdriver.f90
Interfaces hides the cuda syntax.
Universitetets senter for informasjonsteknologi
CUDA libraries
N CUDA BLAS MKL BLAS2000 91 3,29 34,144000 366 24,94 61,968000 1464 159,7 71,4412000 3295 345,55 72,0915000 5149 482,15 72,56
Performance in Gflops/sFootprint MB
91 366 1464 3295 5149
0
100
200
300
400
500
600
DGEMM performance GPU vs. CPUTesla K20X vs Intel SB
CUDA BLASMKL BLAS
Total matrix foorptrint [MiB]
Pe
rfor
ma
nce
[Gflo
ps/s
]
Speedup 482/73= 6.6x
Universitetets senter for informasjonsteknologi
Open ACC – very easy to get started
Universitetets senter for informasjonsteknologi
Open accelerator initiative info
• www.openacc-standard.org
• www.pgroup.com
• en.wikipedia.org/wiki/OpenACC
• developer.nvidia.com/openacc
Universitetets senter for informasjonsteknologi
Open ACCelerator initiative
Directives insertedInto your old code
Universitetets senter for informasjonsteknologi
Compiler supporting OpenACC
• Portland (PGI), pgcc, pgfortran, pgf90
– Installed on Abel
• CAPS HMPP
– Not installed on Abel
– Commercial, rather expensive
• GCC (soon)
– in version 5.0
Universitetets senter for informasjonsteknologi
Compiler supporting OpenACC
fortran 90 code:SUBROUTINE DGEMM_acc
!$acc region DO 90 J = 1,N IF (BETA.EQ.ZERO) THEN DO 50 I = 1,M C(I,J) = ZERO50 CONTINUE
......... 90 CONTINUE!$acc end region
Compile and link :
pgfortran o dgemmtest.x ta=nvidia,kepler dgemm.f dgemmtest.f90
Universitetets senter for informasjonsteknologi
Running accerated code
N PGI2000 91 2,33 2,514000 366 9,48 2,178000 1464 14,03 2,2112000 3295 14,09 2,215000 5149 11,85 1,79
Performance in Gflops/sFootprint MB PGI Accel
91 366 1464 3295 5149
0
2
4
6
8
10
12
14
16
Accelerated F77 code vs plain F77Portland accelerator directives, dgemm
PGI AccelPGI
Total matrix footprint [MiB]
Pe
rfor
ma
nce
[Gflo
ps/s
]
45 183 732 1647 2575 3708 4577 5538
0
5
10
15
20
25
30
Accelerated F77 code vs plain F77Portland accelerator directives, sgemm
PGI AccelPGI
Total matrix footprint [MiB]
Pe
rfor
ma
nce
[Gflo
ps/s
]
dgemm
sgemm
Universitetets senter for informasjonsteknologi
CUDA language - NVIDIA
• CUDA stands for «Compute Unified Device Architecture». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's
• Programming from scratch, special syntax for GPU
• Works only with NVIDIA
Universitetets senter for informasjonsteknologi
CUDA - NVIDIA
Universitetets senter for informasjonsteknologi
Programming with CUDA
__global__ void kernel(float* odata, int height, int width){ unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; if (x < width && y < height) { float c = tex2D(tex, x, y); odata[y*width+x] = c; }}
// Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);
Universitetets senter for informasjonsteknologi
CUDA 6.0 the new version
Universitetets senter for informasjonsteknologi
OpenCL
Universitetets senter for informasjonsteknologi
OpenCL language
• Open Compute Language
– Support for a range of processors incl x86-64
• An open standard supported by a multiple of vendors
• Complexity comparable to CUDA
• Performance comparable to CUDA
Universitetets senter for informasjonsteknologi
OpenCL// create the compute kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4
Universitetets senter for informasjonsteknologi
Running jobs / SLURM - GPUs
• Request both GPUs
– Qlogin --account xx --partition=accel --gres=gpu:2 --nodes=1 --ntasks-per-node=8
• #SBATCH --nodes=1 --ntasks-per-node=8
– Reserve all resources for your job
• #SBATCH --partition=accel --gres=gpu:2
Universitetets senter for informasjonsteknologi
Intel Xeon Phi – MIC architecture
Universitetets senter for informasjonsteknologi
Outstanding performance
Theoretical performace:
Clock frequency 1.05 GHz60 cores (x60)8 dp entry wide Vector (x8)FMA instruction (x2)
1.05*60*8*2 = 1008 Gflops/s
1 Tflops/s on a single PCIe card
Universitetets senter for informasjonsteknologi
MIC architecture
• 60 physical cores, x86-64 in order execution• 240 hardware threads• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)• 8 GiB GDDR5 main memory in 4 banks• Cache Coherent memory (directory based, TD)• Limited hardware prefetch• Software prefetch important
Universitetets senter for informasjonsteknologi
MIC architecture
Universitetets senter for informasjonsteknologi
Simple to program – X86-64 arch.
Universitetets senter for informasjonsteknologi
8 x double vector unit and FMA
Universitetets senter for informasjonsteknologi
Vector and FMA for M x M
Matrix multiplication
Typical line to compute A = B *C + D
Easy to map to FMA and vector since: A1 = B1 * C1 + D1A2 = B2 * C2 + D2A3 = B3 * C3 + D3
..A8 = B8 * C8 + D8
All this in one instruction VFMADDPD !
do i=iminloc,imaxloc uold(i,j,in)=u(i,in)+(flux(i-2,in)-flux(i-1,in))*dtdxend do
Universitetets senter for informasjonsteknologi
Benchmarks – Matmul using MKL
auto 0 50 80 90 100
0,0
200,0
400,0
600,0
800,0
1000,0
1200,0
MKL dgemm automatic offload
Two SB processors, One Phi card
2288 MiB
20599 MiB
57220 MiB
Percent offloaded to mic
Pe
rfo
rma
nce
[Gflo
ps/
s]
Universitetets senter for informasjonsteknologi
Benchmark – user fortran code
2288 MiB 5149 MiB 5859 MiB 6614 MiB
0
5
10
15
20
25
30
35
40
MxM offloading
Fortran 90 code, double prec.
Host procs
Co-processor
Memory footprint matrices
Pe
rfo
rma
nce
[Gflo
ps/
s]
Universitetets senter for informasjonsteknologi
Accelerators hype or production ?
Universitetets senter for informasjonsteknologi
Easy to program hard fully exploit
• Same source code – no changes, same compiler• 60 physical cores – one vector unit per core• 240 hardware threads – at least 120 is needed for fp work• 8/16 number wide vector unit - try to fill it all the time• Fused Multiply add instruction – when can you use this• Cache Coherent memory – nice but has a cost• OpenMP – threads – cc-memory • MPI – uses shared memory communication
Universitetets senter for informasjonsteknologi
Easy to program - native
• Compile using Intel compilers
– icc -mmic -openmp– ifort -mmic -openmp– Other flags are like for Sandy Bridge
• Compile on the host node and launch at MIC node
Universitetets senter for informasjonsteknologi
Easy to program - offload
• Use MKL calls
• Compile using Intel compilers
• Set flags to use offload– export MKL_MIC_ENABLE=1
– export OFFLOAD_DEVICES=0,1
– export MIC_KMP_AFFINITY=explicit,granularity=fine
Universitetets senter for informasjonsteknologi
Experience - native
60 120 180 240
125,0
130,0
135,0
140,0
145,0
150,0
155,0
Stream benchmark
Size 4.5 GiB, Affinity=scatter
Copy:
Scale:
Add:
Triad:
#threads
Ba
nd
wid
th [G
iB/s
]
Stream Memory bandwidthBenchmark.
Vector update
Universitetets senter for informasjonsteknologi
Experience - native
BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C
0%
20%
40%
60%
80%
100%
120%
140%
160%
Xeon Phi vs. single Sandy Bridge
NPB openmp
Benchmark
Re
lativ
e P
hi p
erf
orm
an
ce
NASA Parallel Benchmark
Serial, threaded (OpenMP)and MPI versions.
Run natively on Xeon Phi.
Hard to beat Sandy Bridge :(
Universitetets senter for informasjonsteknologi
Experience – offload user function
2288 MiB 5149 MiB 5859 MiB 6614 MiB
0
5
10
15
20
25
30
35
40
MxM offloading
Fortran 90 code, double prec.
Host procs
Co-processor
Memory footprint matrices
Pe
rfo
rma
nce
[Gflo
ps/
s]
Universitetets senter for informasjonsteknologi
Running jobs on the Xeon Phis
• Request both MICs
– Qlogin --account xx --partition=accel --gres=mic:2 --nodes=1 --ntasks-per-node=16
• #SBATCH --partition=accel --gres=mic:2
• #SBATCH --nodes=1 --ntasks-per-node=16
– Reserve all resources for your job
Universitetets senter for informasjonsteknologi
Running jobs on the Xeon Phis
• This is still an evaluation resource
• Log in to the host, like ssh c19-20 (17,18,19 & 20)• Log onto one of the Phis, ssh mic0 / mic1• The software are available at /phi• A user directory is at /phi/users/<username>
– Request one to be made for you
• The /phi directory is also mounted at the host