xeon phi: architecture general...
TRANSCRIPT
1
Xeon Phi: ArchitectureXeon Phi: ArchitectureGeneral informationGeneral information
Philipp BartelsPhilipp BartelsThomas LangeThomas Lange
2
TIANHE-2TIANHE-2
32.000 CPUs: XEON E5-2692 v2
48.000 Accelerators: XEON PHI 31S1P
Theoretical Peak: 54,902.4 Tflop/s (double)
Linpack Performance: 33,862.7 TFlop/s
3
4
5
- Accelerator / Co-Prozessor- Accelerator / Co-Prozessor
- general purpose cores (57-61)- general purpose cores (57-61)
- embedded Linux- embedded Linux
6
More characteristicsMore characteristics
Can get an IP-Adress
x86-64 instruction set
Extension: Initial Many Core Instructions (IMCI)
Quad-Hyperthreading
512-Bit Vektor Registers
7
Xeon Phi 7120D Xeon Phi 7120D
RCP: 4,235$ (Amazon.com: 3,507.82$)
61 cores (each 1.238 GHz)
Overall 30.5 MB L2-Cache
Main memory: 16GB GDDR5
TDP: 300W
8
10
Parallel programming modelsParallel programming models
OpenMPOpenMP
OpenACCOpenACC
Intel Cilk PlusIntel Cilk Plus
Intel TBBIntel TBB
OpenCLOpenCL
11
Pragma ExamplePragma Example
#pragma offload target (mic) #pragma offload target (mic) in(...) inout(...)in(...) inout(...) {{
#pragma omp parallel for#pragma omp parallel for
for(i=0; i<n; i++){for(i=0; i<n; i++){
c[i] = 2 * a[i] + b[i];c[i] = 2 * a[i] + b[i];
}}
}}
12
Intel Cilk PlusIntel Cilk Plus
3 simple keywords3 simple keywordscilk_forcilk_forcilk_spawncilk_spawncilk_synccilk_sync
Array notationArray notation
SIMD-enabled functionsSIMD-enabled functions
#pragma simd#pragma simd
13
exampleexample
cilk_forcilk_for (int i = 0; i < 8; ++i) (int i = 0; i < 8; ++i){{ do_work(i);do_work(i);}}
int fib(int n)int fib(int n){{ if (n < 2)if (n < 2) return n;return n; int x = int x = cilk_spawncilk_spawn fib(n-1);fib(n-1); int y = fib(n-2);int y = fib(n-2); cilk_sync;cilk_sync; return x + y;return x + y;}}
14
VectorizationVectorization
15
VectorizationVectorization
perform the same operation on multiple data perform the same operation on multiple data elements in a single instructionelements in a single instruction
#pragma omp simd #pragma omp simd for (i = 0; i < 1024; i++)for (i = 0; i < 1024; i++)
C[i] = A[i]*B[i];C[i] = A[i]*B[i];
//array notation in Intel Cilk Plus//array notation in Intel Cilk Plusfor (i = 0; i < 1024; i+=4)for (i = 0; i < 1024; i+=4)
C[i] = A[i:i+3]*B[i:i+3];C[i] = A[i:i+3]*B[i:i+3];
16
Vectorization of a loopVectorization of a loop
AutovectorizationAutovectorization
execute more than one iteration of the loop at the execute more than one iteration of the loop at the same timesame time
requirements:requirements:
straight-line codestraight-line code number of iterations must be knownnumber of iterations must be known no loop-carried dependenciesno loop-carried dependencies no special operators no special operators Must be the inner loopMust be the inner loop
17
ExampleExample
Can be vectorized by compilerCan be vectorized by compiler
for (i=1; i<MAX; i++) {for (i=1; i<MAX; i++) { a[i] = b[i] + c[i]a[i] = b[i] + c[i] d[i] = e[i] – a[i-1]d[i] = e[i] – a[i-1]}}
Cannot be vectorized by compilerCannot be vectorized by compiler
for (i=1; i<MAX; i++) for (i=1; i<MAX; i++) d[i] = e[i] – a[i-1]d[i] = e[i] – a[i-1] a[i] = b[i] + c[i]a[i] = b[i] + c[i]}}
18
Price $# Cores
Base core clock MHzsingle GFlops
double GFlopsAmount Main Mem.
Mem-BandwidthTDP
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000Xeon Phi 7120A Tesla K40
Comparison with Tesla K40 Comparison with Tesla K40
19
Who did whatWho did what
Thomas Lange: slide 9 to 17
Philipp Bartels: slide 18 and 1 to 8
20
Who did whatWho did what
Thomas Lange: slide 9 to 17
Philipp Bartels: slide 18 and 1 to 8