do multicore ao manycore: práticas de configuração, compilação e execução no coprocessador...
DESCRIPTION
Palestra ministrada por Luciano Palma no Intel Software Conference nos dias 6 de Agosto (NCC/UNESP/SP) e 12 de Agosto (COPPE/UFRJ/RJ).TRANSCRIPT
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
HW and SW Architecture of the Intel® Xeon Phi™ Coprocessor
Leo Borges ([email protected])
Intel - Software and Services Group
iStep-Brazil, August 2013
1
Click to edit Master title style
2
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.3
* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor
Efficient vectorization, threading, and parallel execution drives higher performance for many applications
Fraction Parallel
% Vector
Performance
7.00
5.00
3.00
1.00
1.00
0.20
0.00
0.40
0.60
0.80
0%
100%
50%75%
25%
Big Gains for Selected Applications
Scale to manycore
Parallelize
Vectorize
Medical imaging and biophysics
Computer Aided Design & Manufacturing
Climate modeling & weather prediction
Financial analyses, trading
Energy &oil exploration
Digital content creation
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.4
YES
Evaluating Your Applicationsfor Intel® Xeon Phi™
NO
YES
YES
YES
Can your workload benefit from more
memory bandwidth?
Can your workload benefit from
large vectors?
NO
NO
Can your workload scale to over 100 threads?
Use Intel® Xeon Phi™ coprocessors for applications that scale with:
• Threads • Vectors • Memory Bandwidth
Click to edit Master title style
5
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.6
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
6
Intel® Xeon® Phi™ Product FamilyBased on the Intel MIC Architecture
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.7
Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Execution unit
• >50 in-order cores
• Ring interconnect
• 64-bit addressing
• Scalar unit based on Intel® Pentium®
processor family
• Two pipelines
- Dual issue with scalar instructions
• One-per-clock scalar pipeline throughput
- 4 clock latency from issue to resolution
• 4 hardware threads per core
• Each thread issues instructions in turn
• Round-robin execution hides scalar unit latencyRing
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache32K L1 D-cache
Instruction Decode
VectorUnit
Scalar Unit
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.8
Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Vector unit
Ring
Scalar
Registers
Vector
Registers
512K L2 Cache
32K L1 I-cache32K L1 D-cache
Instruction Decode
VectorUnit
Scalar Unit
• Optimized
• Single and Double precision
• All new vector unit
• 512-bit SIMD Instructions – not Intel®
SSE, MMX™, or Intel® AVX
• 32 512-bit wide vector registers
- Hold 16 singles or 8 doubles per register
• Fully-coherent L1 and L2 caches
Takeaway: Vectorization is important
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.9
Individual cores are tied together via fully coherent caches into a bidirectional ring
• 9
GDDR
GDDRGDDR
GDDR
PCIexp
L1 32K I- D-cache per core3 cycle accessUp to 8 concurrent accesses
L2 512K cache per core11 cycle best accessUp to 32 concurrent
accesses
GDDR5 Memory16 memory channels- Up to 5.5 Gb/sec
16 GB 300ns access
Bidirectional ring115 GB/sec
Distributed Tag Directory (DTD)reduces ringsnoop traffic
PCIe port has itsown ring stop
Takeaway: Parallelization and data placement are important
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.10
Each Xeon Phi can be addressed asan Individual Node in the Cluster
• 1
0
6 to 16 GB GDDR5 memory
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
11© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
3 Family Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors
3120P 3120A
5 FamilyOptimized for High Density Environments
Performance/watt leadership
5120D
7 FamilyHighest Level of FeaturesPerformance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DP
Turbo
T
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
5120P
Click to edit Master title style
12
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance Considerations
Performance and Thread Parallelism
Conclusions & References
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.13
Reminder: Vectorization, What is it?
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
+
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
Vector- One Instruction- Eight Mathematical
Operations1
1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands
+
C
B
A
Scalar- One Instruction- One Mathematical
Operation
• Vectorizations is Core-Level Parallelism
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
14© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.14
Instruction InstructionWidth
OperandWidth
Number of Operations
per Instruction
Family
SSE 128-bit 32-bit (SP) 4 Westmere
SSE 128-bit 64-bit (DP) 2 Westmere
AVX 256-bit 32-bit (SP) 8 SandyBridge
AVX 256-bit 64-bit (DP) 4 SandyBridge
MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi
MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi
SIMD Vector Instructions per Family
2X
2X
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
15© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle
8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle
4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle
Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz)
16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP
8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP
Xeon Phi : One 512 bits SIMD FMA per cycle
16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle
8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle
Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz)
32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP
16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP
Theoretical Peak Flops on Xeon and Xeon Phi
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
16© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Theoretical Memory Bandwidth onXeon and Xeon Phi
Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory
8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP
8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP
Xeon Phi: 16 channels , 5.5 GT/s memory
4[bytes/channel]* 5.5[GT/s]* 16[channels] =
352 GB/s peak (ST : 170 GB/s *) on KNC 7120x
*ECC Enabled
Basical rules for theoretical memory BW [Bytes / second ] :
[Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
INTEL CONFIDENTIAL17
75
171
0
50
100
150
200
STREAM Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack (GF/s)
347
887
0
200
400
600
800
1000
DGEMM (GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM (GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
Synthetic BenchmarksIntel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4XUP TO
2.5XUP TO
2.2XUP TO
2.4X
Higher is Better
• 2S Intel® Xeon® • Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient
Click to edit Master title style
18
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Native, Offload and Variations
Performance and Thread Parallelism
Conclusions & References
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Wide Spectrum of Execution Models
General purpose serial and parallel
computing
Codes with highly-parallel phases
Highly-parallel codes
Codes with balanced needs
Main( )Foo( )
MPI_*()
Foo( )
Main( )Foo( )
MPI_*()
Main()Foo( )
MPI_*()
Main( )Foo( )
MPI_*()
Main( )Foo( )
MPI_*()
Multicore
Many-core
Multicore Centric Many-core Centric
(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)
Multi-core-hosted Offload Symmetric Many-core-hosted
Range of Models to Meet Application Needs
19
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor
20
Linux* OS
Intel® Xeon Phi™ Coprocessor support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor communication and application-
launch support
Intel® Xeon Phi™ Coprocessor Host Processor
System-level code System-level code
User-level codeUser-level code
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Runs either as an accelerator for offloadedhost computation…
21
Linux* OS
Intel® Xeon Phi™ Coprocessor support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor communication and application-
launch support
Intel® Xeon Phi™ Coprocessor Host Processor
System-level code System-level code
User-level codeUser-level code
Offload libraries, user-level driver, user-accessible APIs
and libraries
User code
Host-side offload application
User code
Offload libraries, user-accessible APIs and libraries
Target-side offload applicationAdvantages
• More memory available• Better file access• Host better on serial code• Better uses resources
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
…Or runs as a native orMPI* compute node via IP or OFED
22
Linux* OS
Intel® Xeon Phi™ Coprocessor support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor communication and application-
launch support
Intel® Xeon Phi™ Coprocessor Host Processor
System-level code System-level code
User-level codeUser-level code
Advantages• Simpler model
• No directives• Easier port
• Good kernel test
ssh or telnetconnection to coprocessor IP
address
Virtual terminal session
Use if• Not serial • Modest memory• Complex code
Target-side “native” application
User code
Standard OS libraries plus any 3rd-party or
Intel libraries
IB fabric
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Intel® Xeon Phi™ Coprocessor Becomes a Network Node
*
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network Connection
Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor
Virtual Network Connection
… …
Intel® Xeon Phi™ Architecture + Linux enables IP addressability
23
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Flexible: Enables Multiple Programming Models
24
CPU MIC
CPU MIC
Data
MPI
Data
Net
wo
rk
Homogenous network of many-core CPUs
CPU MIC
CPU MIC
Data
MPI
Data
Net
wo
rk
Data
Data
Heterogeneous network of homogeneous CPUs
CPU MIC
CPU MIC
MPI
Offload
Offload
Net
wo
rk
Data
Data
Homogenous network of heterogeneous nodes
Coprocessor only Host+Offload Symmetric
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The Intel® Manycore Platform Software Stack (Intel® MPSS) provides Linux* on the coprocessor
25
Authenticated users can treat it like another node
Add –mmic to compiles to create native programs
Intel MPSS supplies a virtual FS and native execution
ssh mic0 topMem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached
CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 1.00 1.04 1.01 1/2234 7265
PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND
7265 7264 fdkew R 7060 0.0 14 0.3 top
43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13]
5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8
5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser
sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64
scp native.exe mic0:/tmp
ssh mic0 “/tmp/native.exe <my-args>”
icc –O3 –g –mmic –o nativeMIC myNativeProgram.c
Xeon Phi can work as a Node
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Assisted Offload: Examples
• Offload section of code to the coprocessor.
• Offload any function call to the coprocessor.
26
#pragma offload target(mic) \
in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
in(C:length(matrix_elements)) \
out(C:length(matrix_elements) alloc_if(0))
{ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N); }
float pi = 0.0f;
#pragma offload target(mic)
#pragma omp parallel for reduction(+:pi)
for (i=0; i<count; i++) {
float t = (float)((i+0.5f)/count);
pi += 4.0f/(1.0f+t*t);
}
pi /= count;
Xeon Phi can work as a Coprocessor
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiler Assisted Offload: Example
• An example in Fortran:
27
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Offload directives are independent of function boundaries
28
Host
Intel® Xeon®
processor
Target
Intel® Xeon Xeon
Phi™ coprocessor
Execution• If at first offload the
target is available, the target program is loaded
• At each offload if the target is available, statement is run on target, else it is run on the host
• At program termination the target program is unloaded
f() {
#pragma offload
a = b + g();
h();
}
f_part1() {
a = b + g();
}
__attribute__ ((target(mic)))
g() {
...
}
h() {
...
}
__attribute__ ((target(mic)))
g() {
...
}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – share work between coprocessor and host using OpenMP*
omp_set_nested(1);
#pragma omp parallel private(ip)
{
#pragma omp sections
{
#pragma omp section
/* use pointer to copy back only part of potential array,
to avoid overwriting host */
#pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1))
#pragma omp parallel for private(ip)
for (i=0;i<np1;i++) {
ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]);
}
#pragma omp section
#pragma omp parallel for private(ip)
for (i=0;i<np2;i++) {
pot[i+np1] =
threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]);
}
}
}
29
Top level, runs on hostRuns on coprocessorRuns on host
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Pragmas and directives mark data and code to be offloaded and executed
30
C/C++ Syntax
Offload pragma #pragma offload <clauses> <statement>
Allow next statement to execute on coprocessor or host CPU
Variable/function offload properties
__attribute__((target(mic)))
Compile function for, or allocate variable on, both host CPU and coprocessor
Entire blocks of data/code defs
#pragma offload_attribute(push, target(mic))
#pragma offload_attribute(pop)
Mark entire files or large blocks of code to compile for both host CPU and coprocessorFortran Syntax
Offload directive !dir$ omp offload <clauses> <statement>
Execute OpenMP* parallel block on coprocessor
!dir$ offload <clauses> <statement>Execute next statement or function on coproc.
Variable/function offload properties
!dir$ attributes offload:<mic> :: <ret-name> OR <var1,var2,…>
Compile function or variable for CPU and coprocessor
Entire code blocks !dir$ offload begin <clauses>!dir$ end offload
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Options on offloads can control data copying and manage coprocessor dynamic allocation
31
Clauses Syntax Semantics
Multiple coprocessors target(mic[:unit] ) Select specific coprocessors
Conditional offload if (condition) / manadatory Select coprocessor or host compute
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes
Non-copied data nocopy(var-list modifiersopt) Data is local to target
Modifiers
Specify copy length length(N) Copy N elements of pointer’s type
Coprocessor memory allocation
alloc_if ( bool ) Allocate coprocessor space on this offload (default: TRUE)
Coprocessor memory release
free_if ( bool ) Free coprocessor space at the end of this offload (default: TRUE)
Control target data alignment
align ( N bytes ) Specify minimum memory alignment on coprocessor
Array partial allocation & variable relocation
alloc ( array-slice )
into ( var-expr )
Enables partial array allocation and data copy into other vars & ranges
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Data Persistence with Compiler Offload
32
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
#pragma offload target(mic) \
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \
in(A:length(NCOLA * LDA) free_if(0)) \
in(B:length(NCOLB * LDB) free_if(0)) \
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
#pragma offload target(mic) \
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \
nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \
nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
#pragma offload target(mic) \
nocopy(A:length(NCOLA * LDA) free_if(1)) \
nocopy(B:length(NCOLB * LDB) free_if(1)) \
{ }
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Data Persistence with Compiler Offload
33
#define ALLOC alloc_if(1) free_if(0)
#define REUSE alloc_if(0) free_if(0)
#define FREE alloc_if(0) free_if(1)
__declspec(target(mic)) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B
#pragma offload target(mic) \
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \
in(A:length(NCOLA * LDA) ALLOC ) \
in(B:length(NCOLB * LDB) ALLOC ) \
inout(C:length(N * LDC))
{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);
}
// Transfer matrix C1 to coprocessor and reuse matrices A and B
#pragma offload target(mic) \
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \
nocopy(A:length(NCOLA * LDA) REUSE ) \
nocopy(B:length(NCOLB * LDB) REUSE ) \
inout(C1:length(N * LDC1))
{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on the coprocessor
#pragma offload_transfer target(mic) \
nocopy(A:length(NCOLA * LDA) FREE ) \
nocopy(B:length(NCOLB * LDB) FREE )
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
To handle more complex data structures on the
coprocessor, use Virtual Shared Memory
An identical range of virtual addresses is reserved on both host an coprocessor: changes are shared at offload points, allowing:
• Seamless sharing of complex data structures, including linked lists
• Elimination of manual data marshaling and shared array management
• Freer use of new C++ features and standard classes
34
HostVM
coprocVM
Offload code
C/C++ executable
Host coprocessor
Same virtual address range
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example: Virtual Shared Memory
• Shared between host and Xeon Phi
35
// Shared variable declaration
_Cilk_shared T in1[SIZE];
_Cilk_shared T in2[SIZE];
_Cilk_shared T res[SIZE];
_Cilk_shared void compute_sum()
{
int i;
for (i=0; i<SIZE; i++) {
res[i] = in1[i] + in2[i];
}
}
(...)
// Call compute sum on Target
_Cilk_offload compute_sum();
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Virtual Shared Memory uses special allocation
to manage data sharing at offload boundaries
Declare virtual shared data using _Cilk_shared allocation specifier
Allocate virtual dynamic shared data using these special functions:
Shared data copying occurs automatically around offload sections
• Memory is only synchronized on entry to or exit from an offload call
• Only modified data blocks are transferred between host and coprocessor
Allows transfer of C++ objects
• Pointers are transportable when they point to “shared” data addresses
Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code
• E.g., locks, critical sections, etc.
This model is integrated with the Intel® Cilk™ Plus parallel extensions
36
Note: Not supported on Fortran - available for C/C++ only
_Offload_shared_malloc(), _Offload_shared_aligned_malloc(),
_Offload_shared_free(), _Offload_shared_aligned_free()
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Data sharing between host and coprocessor can be enabled using this Intel® Cilk™ Plus syntax
37
What Syntax
Function int _Cilk_shared f(int x){ return x+1; }
Code emitted for host and target; may be called from either side
Global _Cilk_shared int x = 0;
Datum is visible on both sides
File/Function static
static _Cilk_shared int x;
Datum visible on both sides, only to code within the file/function
Class class _Cilk_shared x {…};
Class methods, members and operators available on both sides
Pointer to shared data
int _Cilk_shared *p;
p is local (not shared), can point to shared data
A shared pointer
int *_Cilk_shared p;
p is shared; should only point at shared data
Entire blocks of code
#pragma offload_attribute( push, _Cilk_shared)
#pragma offload_attribute(pop)
Mark entire files or blocks of code _Cilk_shared using this pragma
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus syntax can also specify the
offloading of computation to the coprocessor
38
Feature Example
Offloading a function call
x = _Cilk_offload func(y);
func executes on coprocessor if possible
x = _Cilk_offload_to (card_num) func(y);
func must execute on specified coprocessor or an error occurs
Offloading asynchronously
x = _Cilk_spawn _Cilk_offload func(y);
func executes on coprocessor; continuation available for stealing
Offloading a parallel for-loop
_Cilk_offload _Cilk_for(i=0; i<N; i++){
a[i] = b[i] + c[i];
}
Loop executes in parallel on coprocessor. The loop is implicitly “un-inlined” as a function call.
Click to edit Master title style
39
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Advisor XEVTune Amplifier XEInspector XETrace Analyzer
Code Analysis
Comprehensive set of SW tools for Xeon and Xeon Phi Programing
Intel Cilk PlusThreading Building BlocksOpenMPOpenCLMPIOffload/Native/MYO
Programming Models
Math Kernel LibraryIntegrated Performance Primitives Intel Compilers
Libraries & Compilers
40
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Click to edit Master title style
First Level
• Second level
– Third level
– Fourth level
– Fifth level
INTEL CONFIDENTIAL
41
• Click to edit Master text styles
‒ Second level
Third level
o Fourth level
Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Options for Thread Parallelism
Intel® Math Kernel Library
OpenMP*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenCL*
Pthreads* and other threading libraries Programmer control
Ease of use / code maintainability
Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!
41
Click to edit Master title style
42
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism: OpenMP
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenMP* on the Coprocessor
• The basics work just like on the host CPU
• For both native and offload models
• Need to specify -openmp
• There are 4 hardware thread contexts per core
• Need at least 2 x ncore threads for good performance
– For all except the most memory-bound workloads
– Often, 3x or 4x (number of available cores) is best
– Very different from hyperthreading on the host!
– -opt-threads-per-core=n advises compiler how many threads to optimize for
• If you don’t saturate all available threads, be sure to set KMP_AFFINITY to control thread distribution
43
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Thread Affinity Interface
Allows OpenMP threads to be bound to physical or logical cores
• export environment variable KMP_AFFINITY=
– physical use all physical cores before assigning threads to other logical cores (other hardware thread contexts)
– compact assign threads to consecutive h/w contexts on same physical core (eg to benefit from shared cache)
– scatter assign consecutive threads to different physical cores (eg to maximize access to memory)
– balanced blend of compact & scatter (currently only available for Intel® MIC Architecture)
• Helps optimize access to memory or cache
• Particularly important if all available h/w threads not used
– else some physical cores may be idle while others run multiple threads
• See compiler documentation for (much) more detail
44
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenMP defaults
• OMP_NUM_THREADS defaults to
• 1 x ncore for host (or 2x if hyperthreading enabled)
• 4 x ncore for native coprocessor applications
• 4 x (ncore-1) for offload applications
– one core is reserved for offload daemons and OS
• Defaults may be changed via environment variables or via API calls on either the host or the coprocessor
45
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Target OpenMP environment (offload)
Use target-specific APIs to set for coprocessor target only, e.g.
omp_set_num_threads_target() (called from host)
omp_set_nested_target() etc
• Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload
• Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h>
Or define MIC – specific versions of env vars using
MIC_ENV_PREFIX=MIC (no underscore)
• Values on MIC no longer default to values on host
• Set values specific to MIC using
export MIC_OMP_NUM_THREADS=120 (all cards)
export MIC_2_OMP_NUM_THREADS=180 for card #2, etc
export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced”
46
Click to edit Master title style
47
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism: MKL
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
48
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
MKL Usage Models on Intel® Xeon Phi™
Coprocessor
49
• Automatic Offload– No code changes required
– Automatically uses both host and target
– Transparent data transfer and execution management
• Compiler Assisted Offload– Explicit controls of data transfer and remote execution using compiler
offload pragmas/directives
– Can be used together with Automatic Offload
• Native Execution – Uses the coprocessors as independent nodes
– Input data is copied to targets in advance
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
MKL Execution Models
50
Multicore Hosted
General purpose serial and parallel computing
Offload
Codes with highly-parallel phases
Many Core Hosted
Highly-parallel codes
Symmetric
Codes with balanced needs
Multicore(Intel® Xeon®)
Many-core(Intel® Xeon Phi™)
Multicore Centric Many-Core Centric
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Work Division Control in MKL Automatic Offload
51
Examples Notes
mkl_mic_set_Workdivision(MKL_TARGET_MIC, 0, 0.5)
Offload 50% of computation only to the 1st
card.
Examples Notes
MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st
card.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How to Use MKL with Compiler Assisted Offload• The same way you would offload any function call
to the coprocessor.
• An example in C:
52
#pragma offload target(mic) \
in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
in(C:length(matrix_elements)) \
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
Click to edit Master title style
53
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Performance and Thread Parallelism
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Conclusions
Intel® Xeon Phi™ coprocessor advantages:
• Comparable performance potential to other accelerators
• Faster time to solution due to reduced development effort
• Better investment protection with a single code base for processors and coprocessors
Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between
All with the familiar Intel development environment
54
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
One Stop Shop for:
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor DeveloperSite: http://software.intel.com/mic-developer
55
Obrigado.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
57