xeon phi: architecture general...

1

Xeon Phi: ArchitectureXeon Phi: ArchitectureGeneral informationGeneral information

Philipp BartelsPhilipp BartelsThomas LangeThomas Lange

2

TIANHE-2TIANHE-2

32.000 CPUs: XEON E5-2692 v2

48.000 Accelerators: XEON PHI 31S1P

Theoretical Peak: 54,902.4 Tflop/s (double)

Linpack Performance: 33,862.7 TFlop/s

5

- Accelerator / Co-Prozessor- Accelerator / Co-Prozessor

- general purpose cores (57-61)- general purpose cores (57-61)

- embedded Linux- embedded Linux

6

More characteristicsMore characteristics

Can get an IP-Adress

x86-64 instruction set

Extension: Initial Many Core Instructions (IMCI)

Quad-Hyperthreading

512-Bit Vektor Registers

7

Xeon Phi 7120D Xeon Phi 7120D

RCP: 4,235$ (Amazon.com: 3,507.82$)

61 cores (each 1.238 GHz)

Overall 30.5 MB L2-Cache

Main memory: 16GB GDDR5

TDP: 300W

10

Parallel programming modelsParallel programming models

OpenMPOpenMP

OpenACCOpenACC

Intel Cilk PlusIntel Cilk Plus

Intel TBBIntel TBB

OpenCLOpenCL

11

Pragma ExamplePragma Example

#pragma offload target (mic) #pragma offload target (mic) in(...) inout(...)in(...) inout(...) {{

#pragma omp parallel for#pragma omp parallel for

for(i=0; i<n; i++){for(i=0; i<n; i++){

c[i] = 2 * a[i] + b[i];c[i] = 2 * a[i] + b[i];

}}

}}

12

Intel Cilk PlusIntel Cilk Plus

3 simple keywords3 simple keywordscilk_forcilk_forcilk_spawncilk_spawncilk_synccilk_sync

Array notationArray notation

SIMD-enabled functionsSIMD-enabled functions

#pragma simd#pragma simd

13

exampleexample

cilk_forcilk_for (int i = 0; i < 8; ++i) (int i = 0; i < 8; ++i){{ do_work(i);do_work(i);}}

int fib(int n)int fib(int n){{ if (n < 2)if (n < 2) return n;return n; int x = int x = cilk_spawncilk_spawn fib(n-1);fib(n-1); int y = fib(n-2);int y = fib(n-2); cilk_sync;cilk_sync; return x + y;return x + y;}}

14

VectorizationVectorization

15

VectorizationVectorization

perform the same operation on multiple data perform the same operation on multiple data elements in a single instructionelements in a single instruction

#pragma omp simd #pragma omp simd for (i = 0; i < 1024; i++)for (i = 0; i < 1024; i++)

C[i] = A[i]*B[i];C[i] = A[i]*B[i];

//array notation in Intel Cilk Plus//array notation in Intel Cilk Plusfor (i = 0; i < 1024; i+=4)for (i = 0; i < 1024; i+=4)

C[i] = A[i:i+3]*B[i:i+3];C[i] = A[i:i+3]*B[i:i+3];

16

Vectorization of a loopVectorization of a loop

AutovectorizationAutovectorization

execute more than one iteration of the loop at the execute more than one iteration of the loop at the same timesame time

requirements:requirements:

straight-line codestraight-line code number of iterations must be knownnumber of iterations must be known no loop-carried dependenciesno loop-carried dependencies no special operators no special operators Must be the inner loopMust be the inner loop

17

ExampleExample

Can be vectorized by compilerCan be vectorized by compiler

for (i=1; i<MAX; i++) {for (i=1; i<MAX; i++) { a[i] = b[i] + c[i]a[i] = b[i] + c[i] d[i] = e[i] – a[i-1]d[i] = e[i] – a[i-1]}}

Cannot be vectorized by compilerCannot be vectorized by compiler

for (i=1; i<MAX; i++) for (i=1; i<MAX; i++) d[i] = e[i] – a[i-1]d[i] = e[i] – a[i-1] a[i] = b[i] + c[i]a[i] = b[i] + c[i]}}

18

Price $# Cores

Base core clock MHzsingle GFlops

double GFlopsAmount Main Mem.

Mem-BandwidthTDP

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000Xeon Phi 7120A Tesla K40

Comparison with Tesla K40 Comparison with Tesla K40

19

Who did whatWho did what

Thomas Lange: slide 9 to 17

Philipp Bartels: slide 18 and 1 to 8

20

Who did whatWho did what

Thomas Lange: slide 9 to 17

Philipp Bartels: slide 18 and 1 to 8

xeon phi: architecture general...

Documents