accelerators in abel - universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf ·...

Accelerators in Abel

Ole W. Saastad, Dr.Scient UiO/USIT / UAV/ ITF/FI

March 28th 2014

Universitetets senter for informasjonsteknologi

Background, what is an accelerator ?

The short explanation is that it's a device where most of the transistors are used for computation.

A calculating device more than a general multi purpose processor.

By using all transistors for calculations very high performance can be achieved.


Accelerators are not new


Top 500 supercomputers

• Of the top 500 systems, 53 now use accelerators

• 4 of the top 10 uses accelerators

• HPL performance


Benchmark – user fortran code

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading

Fortran 90 code, double prec.

Host procs

Co-processor

Memory footprint matrices

Pe

rfo

rma

nce

[Gflo

ps/

s]


Accelerators in Abel

• NVIDIA K20x

– 16 nodes with two each – 32 GPUs in total

• Intel Xeon Phi, 5110P

– 4 nodes with two each– 8 MIC systems in total


NVIDIA Kepler K20


Kepler K20, processor GK110


Kepler K20 architecture


Kepler K20 performance


GPU performance K20Xm

91 366 1464 3295 5149

0

100

200

300

400

500

600

DGEMM performance GPU vs. CPUTesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix foorptrint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

45 183 732 1647 2575 3708 4577 5538

0

200

400

600

800

1000

1200

1400

SGEMM performance GPU vs CPU

Tesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix footprint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]Double precision 64 bit,1 Tflops/s

Single precision 32 bit, 2.6 Tflops/s


Accelerators hype or production ?


Exploiting the GPUs

• Pre compiled applications

– NAMD, MrBayes, Beagle, LAMMPS etc

• CUDA libraries

– BLAS, Sparse matrices, FFT

• Compiler supporting accelerator directives

– PGI support accelerator directives


NAMD 2.8 and 2.9

• GPU enabled• Easy to run

charmrun namd2 +idlepoll +p 2 ++local +devices 0,1 input.inp

Compute node GPUnode0

20

40

60

80

100

120

140

NAMD apoa1 benchmark

Single node performance

Type node

Wa

ll tim

e [s

ecs

]

Speedup : 122/39 = 3.1x


LAMMPS

• GPU enabled• Easy to run

mpirun lmp_cuda.double.x -sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000 in.lj.gpu

Standard compute node Accelerated node

0

100

200

300

400

500

600

700

800

Lennard-Jones potential benchmark

Single node performance

Node type

Ru

n ti

me

[se

cs]

Speedup 720/250 = 2.9x


Running applications with GPUsExample using lammps:

#SBATCH --job-name=lammps --account=proj --nodes=2#SBATCH --ntasks-per-node=8 --mem-per-cpu=7800M#SBATCH --partition=accel --gres=gpu:2 --time=01:00:00

. /cluster/bin/jobsetupmodule load lammps/2013.08.16module load cuda/5.0

EXE=lmp_cuda.double.xOPT="-sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000"INPUT=in.lj.gpu

mpirun $EXE $OPT < $INPUT


CUDA libraries – easy access

• Precompiled, just linking

– BLAS– Sparse – FFT– Random– Some extras, ref. doc.


CUDA libraries

From fortran 90 :call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)

Same syntax as standard dgemm

Compile and link :

gfortran o dgemmdriver.x L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o lcublas dgemmdriver.f90

Interfaces hides the cuda syntax.


CUDA libraries

N CUDA BLAS MKL BLAS2000 91 3,29 34,144000 366 24,94 61,968000 1464 159,7 71,4412000 3295 345,55 72,0915000 5149 482,15 72,56

Performance in Gflops/sFootprint MB

91 366 1464 3295 5149

0

100

200

300

400

500

600

DGEMM performance GPU vs. CPUTesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix foorptrint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

Speedup 482/73= 6.6x


Open ACC – very easy to get started


Open accelerator initiative info

• www.openacc-standard.org

• www.pgroup.com

• en.wikipedia.org/wiki/OpenACC

• developer.nvidia.com/openacc

http://www.openacc-standard.org/

http://www.pgroup.com/


Open ACCelerator initiative

Directives insertedInto your old code


Compiler supporting OpenACC

• Portland (PGI), pgcc, pgfortran, pgf90

– Installed on Abel

• CAPS HMPP

– Not installed on Abel

– Commercial, rather expensive

• GCC (soon)

– in version 5.0


Compiler supporting OpenACC

fortran 90 code:SUBROUTINE DGEMM_acc

!$acc region DO 90 J = 1,N IF (BETA.EQ.ZERO) THEN DO 50 I = 1,M C(I,J) = ZERO50 CONTINUE

......... 90 CONTINUE!$acc end region

Compile and link :

pgfortran o dgemmtest.x ta=nvidia,kepler dgemm.f dgemmtest.f90


Running accerated code

N PGI2000 91 2,33 2,514000 366 9,48 2,178000 1464 14,03 2,2112000 3295 14,09 2,215000 5149 11,85 1,79

Performance in Gflops/sFootprint MB PGI Accel

91 366 1464 3295 5149

0

2

4

6

8

10

12

14

16

Accelerated F77 code vs plain F77Portland accelerator directives, dgemm

PGI AccelPGI


Pe

rfor

ma

nce

[Gflo

ps/s

]

45 183 732 1647 2575 3708 4577 5538

0

5

10

15

20

25

30

Accelerated F77 code vs plain F77Portland accelerator directives, sgemm

PGI AccelPGI


Pe

rfor

ma

nce

[Gflo

ps/s

]

dgemm

sgemm


CUDA language - NVIDIA

• CUDA stands for «Compute Unified Device Architecture». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's

• Programming from scratch, special syntax for GPU

• Works only with NVIDIA


CUDA - NVIDIA


Programming with CUDA

__global__ void kernel(float* odata, int height, int width){ unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; if (x < width && y < height) { float c = tex2D(tex, x, y); odata[y*width+x] = c; }}

// Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);


CUDA 6.0 the new version


OpenCL


OpenCL language

• Open Compute Language

– Support for a range of processors incl x86-64

• An open standard supported by a multiple of vendors

• Complexity comparable to CUDA

• Performance comparable to CUDA


OpenCL// create the compute kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4


Running jobs / SLURM - GPUs

• Request both GPUs

– Qlogin --account xx --partition=accel --gres=gpu:2 --nodes=1 --ntasks-per-node=8

• #SBATCH --nodes=1 --ntasks-per-node=8

– Reserve all resources for your job

• #SBATCH --partition=accel --gres=gpu:2


Intel Xeon Phi – MIC architecture


Outstanding performance

Theoretical performace:

Clock frequency 1.05 GHz60 cores (x60)8 dp entry wide Vector (x8)FMA instruction (x2)

1.05*60*8*2 = 1008 Gflops/s

1 Tflops/s on a single PCIe card


MIC architecture

• 60 physical cores, x86-64 in order execution• 240 hardware threads• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)• 8 GiB GDDR5 main memory in 4 banks• Cache Coherent memory (directory based, TD)• Limited hardware prefetch• Software prefetch important


MIC architecture


Simple to program – X86-64 arch.


8 x double vector unit and FMA


Vector and FMA for M x M

Matrix multiplication

Typical line to compute A = B *C + D

Easy to map to FMA and vector since: A1 = B1 * C1 + D1A2 = B2 * C2 + D2A3 = B3 * C3 + D3

..A8 = B8 * C8 + D8

All this in one instruction VFMADDPD !

do i=iminloc,imaxloc uold(i,j,in)=u(i,in)+(flux(i-2,in)-flux(i-1,in))*dtdxend do


Benchmarks – Matmul using MKL

auto 0 50 80 90 100

0,0

200,0

400,0

600,0

800,0

1000,0

1200,0

MKL dgemm automatic offload

Two SB processors, One Phi card

2288 MiB

20599 MiB

57220 MiB

Percent offloaded to mic

Pe

rfo

rma

nce

[Gflo

ps/

s]


Benchmark – user fortran code

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading


Host procs

Co-processor


Pe

rfo

rma

nce

[Gflo

ps/

s]


Accelerators hype or production ?


Easy to program hard fully exploit

• Same source code – no changes, same compiler• 60 physical cores – one vector unit per core• 240 hardware threads – at least 120 is needed for fp work• 8/16 number wide vector unit - try to fill it all the time• Fused Multiply add instruction – when can you use this• Cache Coherent memory – nice but has a cost• OpenMP – threads – cc-memory • MPI – uses shared memory communication


Easy to program - native

• Compile using Intel compilers

– icc -mmic -openmp– ifort -mmic -openmp– Other flags are like for Sandy Bridge

• Compile on the host node and launch at MIC node


Easy to program - offload

• Use MKL calls

• Compile using Intel compilers

• Set flags to use offload– export MKL_MIC_ENABLE=1

– export OFFLOAD_DEVICES=0,1

– export MIC_KMP_AFFINITY=explicit,granularity=fine


Experience - native

60 120 180 240

125,0

130,0

135,0

140,0

145,0

150,0

155,0

Stream benchmark

Size 4.5 GiB, Affinity=scatter

Copy:

Scale:

Add:

Triad:

#threads

Ba

nd

wid

th [G

iB/s

]

Stream Memory bandwidthBenchmark.

Vector update


Experience - native

BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C

0%

20%

40%

60%

80%

100%

120%

140%

160%

Xeon Phi vs. single Sandy Bridge

NPB openmp

Benchmark

Re

lativ

e P

hi p

erf

orm

an

ce

NASA Parallel Benchmark

Serial, threaded (OpenMP)and MPI versions.

Run natively on Xeon Phi.

Hard to beat Sandy Bridge :(


Experience – offload user function

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading


Host procs

Co-processor


Pe

rfo

rma

nce

[Gflo

ps/

s]


Running jobs on the Xeon Phis

• Request both MICs

– Qlogin --account xx --partition=accel --gres=mic:2 --nodes=1 --ntasks-per-node=16

• #SBATCH --partition=accel --gres=mic:2

• #SBATCH --nodes=1 --ntasks-per-node=16

– Reserve all resources for your job


Running jobs on the Xeon Phis

• This is still an evaluation resource

• Log in to the host, like ssh c19-20 (17,18,19 & 20)• Log onto one of the Phis, ssh mic0 / mic1• The software are available at /phi• A user directory is at /phi/users/<username>

– Request one to be made for you

• The /phi directory is also mounted at the host

accelerators in abel - universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf ·...

Documents