accelerators in abel - universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf ·...

53
Accelerators in Abel Ole W. Saastad, Dr.Scient UiO/USIT / UAV/ ITF/FI March 28 th 2014

Upload: others

Post on 25-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Accelerators in Abel

Ole W. Saastad, Dr.Scient UiO/USIT / UAV/ ITF/FI

March 28th 2014

Page 2: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Background, what is an accelerator ?

The short explanation is that it's a device where most of the transistors are used for computation.

A calculating device more than a general multi purpose processor.

By using all transistors for calculations very high performance can be achieved.

Page 3: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Accelerators are not new

Page 4: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Top 500 supercomputers

• Of the top 500 systems, 53 now use accelerators

• 4 of the top 10 uses accelerators

• HPL performance

Page 5: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Benchmark – user fortran code

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading

Fortran 90 code, double prec.

Host procs

Co-processor

Memory footprint matrices

Pe

rfo

rma

nce

[Gflo

ps/

s]

Page 6: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Accelerators in Abel

• NVIDIA K20x

– 16 nodes with two each – 32 GPUs in total

• Intel Xeon Phi, 5110P

– 4 nodes with two each– 8 MIC systems in total

Page 7: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

NVIDIA Kepler K20

Page 8: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Kepler K20, processor GK110

Page 9: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Kepler K20 architecture

Page 10: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Kepler K20 architecture

Page 11: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Kepler K20 architecture

Page 12: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Kepler K20 performance

Page 13: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

GPU performance K20Xm

91 366 1464 3295 5149

0

100

200

300

400

500

600

DGEMM performance GPU vs. CPUTesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix foorptrint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

45 183 732 1647 2575 3708 4577 5538

0

200

400

600

800

1000

1200

1400

SGEMM performance GPU vs CPU

Tesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix footprint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]Double precision 64 bit,1 Tflops/s

Single precision 32 bit, 2.6 Tflops/s

Page 14: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Accelerators hype or production ?

Page 15: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Exploiting the GPUs

• Pre compiled applications

– NAMD, MrBayes, Beagle, LAMMPS etc

• CUDA libraries

– BLAS, Sparse matrices, FFT

• Compiler supporting accelerator directives

– PGI support accelerator directives

Page 16: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

NAMD 2.8 and 2.9

• GPU enabled• Easy to run

charmrun namd2 +idlepoll +p 2 ++local +devices 0,1  input.inp

Compute node GPUnode0

20

40

60

80

100

120

140

NAMD apoa1 benchmark

Single node performance

Type node

Wa

ll tim

e [s

ecs

]

Speedup : 122/39 = 3.1x

Page 17: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

LAMMPS

• GPU enabled• Easy to run

mpirun lmp_cuda.double.x -sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000 in.lj.gpu

Standard compute node Accelerated node

0

100

200

300

400

500

600

700

800

Lennard-Jones potential benchmark

Single node performance

Node type

Ru

n ti

me

[se

cs]

Speedup 720/250 = 2.9x

Page 18: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Running applications with GPUsExample using lammps:

#SBATCH --job-name=lammps --account=proj --nodes=2#SBATCH --ntasks-per-node=8 --mem-per-cpu=7800M#SBATCH --partition=accel --gres=gpu:2 --time=01:00:00

. /cluster/bin/jobsetupmodule load lammps/2013.08.16module load cuda/5.0

EXE=lmp_cuda.double.xOPT="-sf gpu -c off -v g 2 -v x 128 -v y 128 -v z 128 -v t 1000"INPUT=in.lj.gpu

mpirun $EXE $OPT < $INPUT

Page 19: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA libraries – easy access

• Precompiled, just linking

– BLAS– Sparse – FFT– Random– Some extras, ref. doc.

Page 20: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA libraries

From fortran 90 :call cublas_dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)

Same syntax as standard dgemm

Compile and link :

gfortran ­o dgemmdriver.x ­L/usr/local/cuda/lib64 /usr/local/cuda/lib64/fortran_thunking.o ­lcublas dgemmdriver.f90

Interfaces hides the cuda syntax.

Page 21: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA libraries

N CUDA BLAS MKL BLAS2000 91 3,29 34,144000 366 24,94 61,968000 1464 159,7 71,4412000 3295 345,55 72,0915000 5149 482,15 72,56

Performance in Gflops/sFootprint MB

91 366 1464 3295 5149

0

100

200

300

400

500

600

DGEMM performance GPU vs. CPUTesla K20X vs Intel SB

CUDA BLASMKL BLAS

Total matrix foorptrint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

Speedup 482/73= 6.6x

Page 22: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Open ACC – very easy to get started

Page 23: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Open accelerator initiative info

• www.openacc-standard.org

• www.pgroup.com

• en.wikipedia.org/wiki/OpenACC

• developer.nvidia.com/openacc

Page 24: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Open ACCelerator initiative

Directives insertedInto your old code

Page 25: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Compiler supporting OpenACC

• Portland (PGI), pgcc, pgfortran, pgf90

– Installed on Abel

• CAPS HMPP

– Not installed on Abel

– Commercial, rather expensive

• GCC (soon)

– in version 5.0

Page 26: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Compiler supporting OpenACC

fortran 90 code:SUBROUTINE DGEMM_acc

!$acc region   DO 90 J = 1,N     IF (BETA.EQ.ZERO) THEN      DO 50 I = 1,M        C(I,J) = ZERO50    CONTINUE

   .........   90 CONTINUE!$acc end region 

Compile and link :

pgfortran ­o dgemm­test.x ­ta=nvidia,kepler dgemm.f dgemmtest.f90

Page 27: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Running accerated code

N PGI2000 91 2,33 2,514000 366 9,48 2,178000 1464 14,03 2,2112000 3295 14,09 2,215000 5149 11,85 1,79

Performance in Gflops/sFootprint MB PGI Accel

91 366 1464 3295 5149

0

2

4

6

8

10

12

14

16

Accelerated F77 code vs plain F77Portland accelerator directives, dgemm

PGI AccelPGI

Total matrix footprint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

45 183 732 1647 2575 3708 4577 5538

0

5

10

15

20

25

30

Accelerated F77 code vs plain F77Portland accelerator directives, sgemm

PGI AccelPGI

Total matrix footprint [MiB]

Pe

rfor

ma

nce

[Gflo

ps/s

]

dgemm

sgemm

Page 28: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA language - NVIDIA

• CUDA stands for «Compute Unified Device Architecture». CUDA is a parallel computing architecure and C based programming language for general purpose computing on NVIDIA GPU's

• Programming from scratch, special syntax for GPU

• Works only with NVIDIA

Page 29: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA - NVIDIA

Page 30: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Programming with CUDA

__global__ void kernel(float* odata, int height, int width){ unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; if (x < width && y < height) { float c = tex2D(tex, x, y); odata[y*width+x] = c; }}

// Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);

Page 31: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

CUDA 6.0 the new version

Page 32: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

OpenCL

Page 33: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

OpenCL language

• Open Compute Language

– Support for a range of processors incl x86-64

• An open standard supported by a multiple of vendors

• Complexity comparable to CUDA

• Performance comparable to CUDA

Page 34: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

OpenCL// create the compute kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4

Page 35: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Running jobs / SLURM - GPUs

• Request both GPUs

– Qlogin --account xx --partition=accel --gres=gpu:2 --nodes=1 --ntasks-per-node=8

• #SBATCH --nodes=1 --ntasks-per-node=8

– Reserve all resources for your job

• #SBATCH --partition=accel --gres=gpu:2

Page 36: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Intel Xeon Phi – MIC architecture

Page 37: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Outstanding performance

Theoretical performace:

Clock frequency 1.05 GHz60 cores (x60)8 dp entry wide Vector (x8)FMA instruction (x2)

1.05*60*8*2 = 1008 Gflops/s

1 Tflops/s on a single PCIe card

Page 38: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

MIC architecture

• 60 physical cores, x86-64 in order execution• 240 hardware threads• 512 bits wide vector unit (8x64 bit or 16x32 bit floats)• 8 GiB GDDR5 main memory in 4 banks• Cache Coherent memory (directory based, TD)• Limited hardware prefetch• Software prefetch important

Page 39: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

MIC architecture

Page 40: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Simple to program – X86-64 arch.

Page 41: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

8 x double vector unit and FMA

Page 42: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Vector and FMA for M x M

Matrix multiplication

Typical line to compute A = B *C + D

Easy to map to FMA and vector since: A1 = B1 * C1 + D1A2 = B2 * C2 + D2A3 = B3 * C3 + D3

..A8 = B8 * C8 + D8

All this in one instruction VFMADDPD !

do i=iminloc,imaxloc uold(i,j,in)=u(i,in)+(flux(i-2,in)-flux(i-1,in))*dtdxend do

Page 43: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Benchmarks – Matmul using MKL

auto 0 50 80 90 100

0,0

200,0

400,0

600,0

800,0

1000,0

1200,0

MKL dgemm automatic offload

Two SB processors, One Phi card

2288 MiB

20599 MiB

57220 MiB

Percent offloaded to mic

Pe

rfo

rma

nce

[Gflo

ps/

s]

Page 44: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Benchmark – user fortran code

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading

Fortran 90 code, double prec.

Host procs

Co-processor

Memory footprint matrices

Pe

rfo

rma

nce

[Gflo

ps/

s]

Page 45: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Accelerators hype or production ?

Page 46: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Easy to program hard fully exploit

• Same source code – no changes, same compiler• 60 physical cores – one vector unit per core• 240 hardware threads – at least 120 is needed for fp work• 8/16 number wide vector unit - try to fill it all the time• Fused Multiply add instruction – when can you use this• Cache Coherent memory – nice but has a cost• OpenMP – threads – cc-memory • MPI – uses shared memory communication

Page 47: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Easy to program - native

• Compile using Intel compilers

– icc -mmic -openmp– ifort -mmic -openmp– Other flags are like for Sandy Bridge

• Compile on the host node and launch at MIC node

Page 48: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Easy to program - offload

• Use MKL calls

• Compile using Intel compilers

• Set flags to use offload– export MKL_MIC_ENABLE=1

– export OFFLOAD_DEVICES=0,1

– export MIC_KMP_AFFINITY=explicit,granularity=fine

Page 49: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Experience - native

60 120 180 240

125,0

130,0

135,0

140,0

145,0

150,0

155,0

Stream benchmark

Size 4.5 GiB, Affinity=scatter

Copy:

Scale:

Add:

Triad:

#threads

Ba

nd

wid

th [G

iB/s

]

Stream Memory bandwidthBenchmark.

Vector update

Page 50: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Experience - native

BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C

0%

20%

40%

60%

80%

100%

120%

140%

160%

Xeon Phi vs. single Sandy Bridge

NPB openmp

Benchmark

Re

lativ

e P

hi p

erf

orm

an

ce

NASA Parallel Benchmark

Serial, threaded (OpenMP)and MPI versions.

Run natively on Xeon Phi.

Hard to beat Sandy Bridge :(

Page 51: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Experience – offload user function

2288 MiB 5149 MiB 5859 MiB 6614 MiB

0

5

10

15

20

25

30

35

40

MxM offloading

Fortran 90 code, double prec.

Host procs

Co-processor

Memory footprint matrices

Pe

rfo

rma

nce

[Gflo

ps/

s]

Page 52: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Running jobs on the Xeon Phis

• Request both MICs

– Qlogin --account xx --partition=accel --gres=mic:2 --nodes=1 --ntasks-per-node=16

• #SBATCH --partition=accel --gres=mic:2

• #SBATCH --nodes=1 --ntasks-per-node=16

– Reserve all resources for your job

Page 53: Accelerators in Abel - Universitetet i oslo › ... › accelerators-in-abel-28mar14.pdf · Accelerators in Abel ... –32 GPUs in total • Intel Xeon Phi, 5110P –4 nodes with

Universitetets senter for informasjonsteknologi

Running jobs on the Xeon Phis

• This is still an evaluation resource

• Log in to the host, like ssh c19-20 (17,18,19 & 20)• Log onto one of the Phis, ssh mic0 / mic1• The software are available at /phi• A user directory is at /phi/users/<username>

– Request one to be made for you

• The /phi directory is also mounted at the host