do multicore ao manycore: práticas de configuração, compilação e execução no coprocessador...

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.

and/or other countries. *Other names and brands may be claimed as the property of others.

HW and SW Architecture of the Intel® Xeon Phi™ Coprocessor

Leo Borges ([email protected])

Intel - Software and Services Group

iStep-Brazil, August 2013

1

http://software.intel.com/en-us/articles/optimization-notice

Click to edit Master title style

2

Introduction

High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software

Performance and Thread Parallelism

Conclusions & References


and/or other countries. *Other names and brands may be claimed as the property of others.3

* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor versus a standard multi-core Intel® Xeon® processor

Efficient vectorization, threading, and parallel execution drives higher performance for many applications

Fraction Parallel

% Vector

Performance

7.00

5.00

3.00

1.00

1.00

0.20

0.00

0.40

0.60

0.80

0%

100%

50%75%

25%

Big Gains for Selected Applications

Scale to manycore

Parallelize

Vectorize

Medical imaging and biophysics

Computer Aided Design & Manufacturing

Climate modeling & weather prediction

Financial analyses, trading

Energy &oil exploration

Digital content creation


http://www.forecastsforfilms.com/HurricaneFromSpace2-BPSPP-Ed.jpg



YES

Evaluating Your Applicationsfor Intel® Xeon Phi™

NO

YES

YES

YES

Can your workload benefit from more

memory bandwidth?

Can your workload benefit from

large vectors?

NO

NO

Can your workload scale to over 100 threads?

Use Intel® Xeon Phi™ coprocessors for applications that scale with:

• Threads • Vectors • Memory Bandwidth



5

Introduction






Intel Many Integrated Core (MIC, pronounced “Mike”)

Product Family/Architecture for Highly Parallel Applications

• Based on large number of smaller, low power, Intel Arch. Cores

• 512-bit wide vector engine

• Compliments Intel Xeon processor product line

• Provides breakthrough performance for highly parallel apps

– Familiar x86 programming model– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor– Initially a coprocessor with PCI Express form factor

First products announced at SC12: Code named Knights Corner (KNC)

• Up to 61 cores, 4 threads per core

• Up to 16GB GDDR5 memory (up to 352 GB/s)

• 225-300W (Cooling: Both passive & active SKUs)

• x16 PCIe Form-Factor (requires IA host)

6

Intel® Xeon® Phi™ Product FamilyBased on the Intel MIC Architecture




Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Execution unit

• >50 in-order cores

• Ring interconnect

• 64-bit addressing

• Scalar unit based on Intel® Pentium®

processor family

• Two pipelines

- Dual issue with scalar instructions

• One-per-clock scalar pipeline throughput

- 4 clock latency from issue to resolution

• 4 hardware threads per core

• Each thread issues instructions in turn

• Round-robin execution hides scalar unit latencyRing

Scalar

Registers

Vector

Registers

512K L2 Cache

32K L1 I-cache32K L1 D-cache

Instruction Decode

VectorUnit

Scalar Unit




Each Intel® Xeon Phi™ Coprocessor core is a fully functional multi-thread Vector unit

Ring

Scalar

Registers

Vector

Registers

512K L2 Cache

32K L1 I-cache32K L1 D-cache

Instruction Decode

VectorUnit

Scalar Unit

• Optimized

• Single and Double precision

• All new vector unit

• 512-bit SIMD Instructions – not Intel®

SSE, MMX™, or Intel® AVX

• 32 512-bit wide vector registers

- Hold 16 singles or 8 doubles per register

• Fully-coherent L1 and L2 caches

Takeaway: Vectorization is important




Individual cores are tied together via fully coherent caches into a bidirectional ring

• 9

GDDR

GDDRGDDR

GDDR

PCIexp

L1 32K I- D-cache per core3 cycle accessUp to 8 concurrent accesses

L2 512K cache per core11 cycle best accessUp to 32 concurrent

accesses

GDDR5 Memory16 memory channels- Up to 5.5 Gb/sec

16 GB 300ns access

Bidirectional ring115 GB/sec

Distributed Tag Directory (DTD)reduces ringsnoop traffic

PCIe port has itsown ring stop

Takeaway: Parallelization and data placement are important




Each Xeon Phi can be addressed asan Individual Node in the Cluster

• 1

0

6 to 16 GB GDDR5 memory


INTEL CONFIDENTIAL

• Click to edit Master text styles

‒ Second level

Third level

o Fourth level

Fifth level


11© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.


3 Family Outstanding Parallel Computing Solution

Performance/$ leadership

Intel® Xeon Phi™ Coprocessors

3120P 3120A

5 FamilyOptimized for High Density Environments

Performance/watt leadership

5120D

7 FamilyHighest Level of FeaturesPerformance leadership

7120P 7120X

16GB GDDR5

352 GB/s

> 1.2 TFlops DP

Turbo

T

8GB GDDR5

>300 GB/s

>1 TFlops DP

6GB GDDR5

240 GB/s

>1 TFlops DP

5120P



12

Introduction


Performance Considerations





Reminder: Vectorization, What is it?

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

+

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

Vector- One Instruction- Eight Mathematical

Operations1

1. Number of operations per instruction varies based on the which SIMD instruction is used and the width of the operands

+

C

B

A

Scalar- One Instruction- One Mathematical

Operation

• Vectorizations is Core-Level Parallelism


INTEL CONFIDENTIAL


‒ Second level

Third level

o Fourth level

Fifth level




Instruction InstructionWidth

OperandWidth

Number of Operations

per Instruction

Family

SSE 128-bit 32-bit (SP) 4 Westmere

SSE 128-bit 64-bit (DP) 2 Westmere

AVX 256-bit 32-bit (SP) 8 SandyBridge

AVX 256-bit 64-bit (DP) 4 SandyBridge

MIC ISA 512-bit 32-bit (SP) 16 Xeon Phi

MIC ISA 512-bit 64-bit (DP) 8 Xeon Phi

SIMD Vector Instructions per Family

2X

2X


INTEL CONFIDENTIAL


‒ Second level

Third level

o Fourth level

Fifth level




Sandy Bridge/Ivy Bridge : Two 256 bits SIMD per cycle

8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle

4 MUL (64b) and 4 ADD (64b): 8 Double Precision Flops / cycle

Theoretical peak for a 2-sockets E5-2697 v2 (12 cores @ 2.7 GHz)

16[Flops/cycle ]*2[sockets]*12[cores]*2.7[Gcycles/sec] = 1036.8 [Gflops/sec] SP

8[Flops/cycle ]* 2[sockets]*12[cores]*2.7[Gcycles/sec] = 518.4 [Gflops/sec] DP

Xeon Phi : One 512 bits SIMD FMA per cycle

16 MUL (32b) and 16 ADD (32b): 32 Single Precision Flops / cycle

8 MUL (64b) and 8 ADD (64b): 16 Double Precision Flops / cycle

Theoretical peak for a KNC 7120x (61 cores @ 1.24 GHz)

32[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 2420.5 [Gflops/sec] SP

16[Flops/cycle ]*61[cores]*1.24[Gcycles/sec] = 1210.2 [Gflops/sec] DP

Theoretical Peak Flops on Xeon and Xeon Phi


INTEL CONFIDENTIAL


‒ Second level

Third level

o Fourth level

Fifth level




Theoretical Memory Bandwidth onXeon and Xeon Phi

Sandy Bridge/Ivy Bridge: 4 channels , 2 sockets and 1600/1866 MHz memory

8*1.600* 4*2 = 102 GB/s peak (ST : 80 GB/s) on SNB-EP

8*1.866* 4*2 = 120 GB/s peak (ST : 90 GB/s) on IVB-EP

Xeon Phi: 16 channels , 5.5 GT/s memory

4[bytes/channel]* 5.5[GT/s]* 16[channels] =

352 GB/s peak (ST : 170 GB/s *) on KNC 7120x

*ECC Enabled

Basical rules for theoretical memory BW [Bytes / second ] :

[Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets


INTEL CONFIDENTIAL17

75

171

0

50

100

150

200

STREAM Triad (GB/s)

330

802

0

200

400

600

800

1000

SMP Linpack (GF/s)

347

887

0

200

400

600

800

1000

DGEMM (GF/s)

728

1,796

0

500

1000

1500

2000

SGEMM (GF/s)

Notes

1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000

2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672

3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster

+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.

++ Measured on the TACC+ Stampede Cluster

Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

Synthetic BenchmarksIntel® Xeon Phi™ Coprocessor and Intel® MKL

UP TO

2.4XUP TO

2.5XUP TO

2.2XUP TO

2.4X

Higher is Better

• 2S Intel® Xeon® • Intel Xeon Phi

ECC ON84% Efficient 83% Efficient 75% Efficient


18

Introduction


Native, Offload and Variations



INTEL CONFIDENTIAL


‒ Second level

Third level

o Fourth level

Fifth level




Wide Spectrum of Execution Models

General purpose serial and parallel

computing

Codes with highly-parallel phases

Highly-parallel codes

Codes with balanced needs

Main( )Foo( )

MPI_*()

Foo( )

Main( )Foo( )

MPI_*()

Main()Foo( )

MPI_*()

Main( )Foo( )

MPI_*()

Main( )Foo( )

MPI_*()

Multicore

Many-core

Multicore Centric Many-core Centric

(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)

Multi-core-hosted Offload Symmetric Many-core-hosted

Range of Models to Meet Application Needs

19




The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor

20

Linux* OS

Intel® Xeon Phi™ Coprocessor support libraries, tools, and

drivers

Linux* OS

PCI-E Bus PCI-E Bus

Intel® Xeon Phi™ Coprocessor communication and application-

launch support

Intel® Xeon Phi™ Coprocessor Host Processor

System-level code System-level code

User-level codeUser-level code




Runs either as an accelerator for offloadedhost computation…

21

Linux* OS


drivers

Linux* OS

PCI-E Bus PCI-E Bus


launch support




Offload libraries, user-level driver, user-accessible APIs

and libraries

User code

Host-side offload application

User code

Offload libraries, user-accessible APIs and libraries

Target-side offload applicationAdvantages

• More memory available• Better file access• Host better on serial code• Better uses resources




…Or runs as a native orMPI* compute node via IP or OFED

22

Linux* OS


drivers

Linux* OS

PCI-E Bus PCI-E Bus


launch support




Advantages• Simpler model

• No directives• Easier port

• Good kernel test

ssh or telnetconnection to coprocessor IP

address

Virtual terminal session

Use if• Not serial • Modest memory• Complex code

Target-side “native” application

User code

Standard OS libraries plus any 3rd-party or

Intel libraries

IB fabric




Intel® Xeon Phi™ Coprocessor Becomes a Network Node

*

Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor

Virtual Network Connection

Intel® Xeon® Processor Intel® Xeon Phi™ Coprocessor

Virtual Network Connection

… …

Intel® Xeon Phi™ Architecture + Linux enables IP addressability

23




Flexible: Enables Multiple Programming Models

24

CPU MIC

CPU MIC

Data

MPI

Data

Net

wo

rk

Homogenous network of many-core CPUs

CPU MIC

CPU MIC

Data

MPI

Data

Net

wo

rk

Data

Data

Heterogeneous network of homogeneous CPUs

CPU MIC

CPU MIC

MPI

Offload

Offload

Net

wo

rk

Data

Data

Homogenous network of heterogeneous nodes

Coprocessor only Host+Offload Symmetric


Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

The Intel® Manycore Platform Software Stack (Intel® MPSS) provides Linux* on the coprocessor

25

Authenticated users can treat it like another node

Add –mmic to compiles to create native programs

Intel MPSS supplies a virtual FS and native execution

ssh mic0 topMem: 298016K used, 7578640K free, 0K shrd, 0K buff, 100688K cached

CPU: 0.0% usr 0.3% sys 0.0% nic 99.6% idle 0.0% io 0.0% irq 0.0% sirq

Load average: 1.00 1.04 1.01 1/2234 7265

PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND

7265 7264 fdkew R 7060 0.0 14 0.3 top

43 2 root SW 0 0.0 13 0.0 [ksoftirqd/13]

5748 1 root S 119m 1.5 226 0.0 ./sep_mic_server3.8

5670 1 micuser S 97872 1.2 0 0.0 /bin/coi_daemon --coiuser=micuser

sudo scp /opt/intel/composerxe/lib/mic/libiomp5.so root@mic0:/lib64

scp native.exe mic0:/tmp

ssh mic0 “/tmp/native.exe <my-args>”

icc –O3 –g –mmic –o nativeMIC myNativeProgram.c

Xeon Phi can work as a Node


Compiler Assisted Offload: Examples

• Offload section of code to the coprocessor.

• Offload any function call to the coprocessor.

26

#pragma offload target(mic) \

in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

in(C:length(matrix_elements)) \

out(C:length(matrix_elements) alloc_if(0))

{ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,

&beta, C, &N); }

float pi = 0.0f;

#pragma offload target(mic)

#pragma omp parallel for reduction(+:pi)

for (i=0; i<count; i++) {

float t = (float)((i+0.5f)/count);

pi += 4.0f/(1.0f+t*t);

}

pi /= count;

Xeon Phi can work as a Coprocessor


Compiler Assisted Offload: Example

• An example in Fortran:

27

!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM

!DEC$ OMP OFFLOAD TARGET( MIC ) &

!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &

!DEC$ IN( A: LENGTH( NCOLA * LDA )), &

!DEC$ IN( B: LENGTH( NCOLB * LDB )), &

!DEC$ INOUT( C: LENGTH( N * LDC ))

CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &

A, LDA, B, LDB BETA, C, LDC )


Offload directives are independent of function boundaries

28

Host

Intel® Xeon®

processor

Target

Intel® Xeon Xeon

Phi™ coprocessor

Execution• If at first offload the

target is available, the target program is loaded

• At each offload if the target is available, statement is run on target, else it is run on the host

• At program termination the target program is unloaded

f() {

#pragma offload

a = b + g();

h();

}

f_part1() {

a = b + g();

}

__attribute__ ((target(mic)))

g() {

...

}

h() {

...

}

__attribute__ ((target(mic)))

g() {

...

}


Example – share work between coprocessor and host using OpenMP*

omp_set_nested(1);

#pragma omp parallel private(ip)

{

#pragma omp sections

{

#pragma omp section

/* use pointer to copy back only part of potential array,

to avoid overwriting host */

#pragma offload target(mic) in(xp) in(yp) in(zp) out(ppot:length(np1))

#pragma omp parallel for private(ip)

for (i=0;i<np1;i++) {

ppot[i] = threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i],yp[i],zp[i]);

}

#pragma omp section

#pragma omp parallel for private(ip)

for (i=0;i<np2;i++) {

pot[i+np1] =

threed_int(x0,xn,y0,yn,z0,zn,nx,ny,nz,xp[i+np1],yp[i+np1],zp[i+np1]);

}

}

}

29

Top level, runs on hostRuns on coprocessorRuns on host


Pragmas and directives mark data and code to be offloaded and executed

30

C/C++ Syntax

Offload pragma #pragma offload <clauses> <statement>

Allow next statement to execute on coprocessor or host CPU

Variable/function offload properties

__attribute__((target(mic)))

Compile function for, or allocate variable on, both host CPU and coprocessor

Entire blocks of data/code defs

#pragma offload_attribute(push, target(mic))

#pragma offload_attribute(pop)

Mark entire files or large blocks of code to compile for both host CPU and coprocessorFortran Syntax

Offload directive !dir$ omp offload <clauses> <statement>

Execute OpenMP* parallel block on coprocessor

!dir$ offload <clauses> <statement>Execute next statement or function on coproc.

Variable/function offload properties

!dir$ attributes offload:<mic> :: <ret-name> OR <var1,var2,…>

Compile function or variable for CPU and coprocessor

Entire code blocks !dir$ offload begin <clauses>!dir$ end offload


Options on offloads can control data copying and manage coprocessor dynamic allocation

31

Clauses Syntax Semantics

Multiple coprocessors target(mic[:unit] ) Select specific coprocessors

Conditional offload if (condition) / manadatory Select coprocessor or host compute

Inputs in(var-list modifiersopt) Copy from host to coprocessor

Outputs out(var-list modifiersopt) Copy from coprocessor to host

Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes

Non-copied data nocopy(var-list modifiersopt) Data is local to target

Modifiers

Specify copy length length(N) Copy N elements of pointer’s type

Coprocessor memory allocation

alloc_if ( bool ) Allocate coprocessor space on this offload (default: TRUE)

Coprocessor memory release

free_if ( bool ) Free coprocessor space at the end of this offload (default: TRUE)

Control target data alignment

align ( N bytes ) Specify minimum memory alignment on coprocessor

Array partial allocation & variable relocation

alloc ( array-slice )

into ( var-expr )

Enables partial array allocation and data copy into other vars & ranges


Data Persistence with Compiler Offload

32

__declspec(target(mic)) static float *A, *B, *C, *C1;

// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B


in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \

in(A:length(NCOLA * LDA) free_if(0)) \

in(B:length(NCOLB * LDB) free_if(0)) \

inout(C:length(N * LDC))

{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);

}

// Transfer matrix C1 to coprocessor and reuse matrices A and B


in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \

nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \

nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \

inout(C1:length(N * LDC1))

{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);

}

// Deallocate A and B on the coprocessor


nocopy(A:length(NCOLA * LDA) free_if(1)) \

nocopy(B:length(NCOLB * LDB) free_if(1)) \

{ }


Data Persistence with Compiler Offload

33

#define ALLOC alloc_if(1) free_if(0)

#define REUSE alloc_if(0) free_if(0)

#define FREE alloc_if(0) free_if(1)

__declspec(target(mic)) static float *A, *B, *C, *C1;

// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B


in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \

in(A:length(NCOLA * LDA) ALLOC ) \

in(B:length(NCOLB * LDB) ALLOC ) \

inout(C:length(N * LDC))

{ sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);

}

// Transfer matrix C1 to coprocessor and reuse matrices A and B


in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \

nocopy(A:length(NCOLA * LDA) REUSE ) \

nocopy(B:length(NCOLB * LDB) REUSE ) \

inout(C1:length(N * LDC1))

{ sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);

}

// Deallocate A and B on the coprocessor

#pragma offload_transfer target(mic) \

nocopy(A:length(NCOLA * LDA) FREE ) \

nocopy(B:length(NCOLB * LDB) FREE )


To handle more complex data structures on the

coprocessor, use Virtual Shared Memory

An identical range of virtual addresses is reserved on both host an coprocessor: changes are shared at offload points, allowing:

• Seamless sharing of complex data structures, including linked lists

• Elimination of manual data marshaling and shared array management

• Freer use of new C++ features and standard classes

34

HostVM

coprocVM

Offload code

C/C++ executable

Host coprocessor

Same virtual address range


Example: Virtual Shared Memory

• Shared between host and Xeon Phi

35

// Shared variable declaration

_Cilk_shared T in1[SIZE];

_Cilk_shared T in2[SIZE];

_Cilk_shared T res[SIZE];

_Cilk_shared void compute_sum()

{

int i;

for (i=0; i<SIZE; i++) {

res[i] = in1[i] + in2[i];

}

}

(...)

// Call compute sum on Target

_Cilk_offload compute_sum();


Virtual Shared Memory uses special allocation

to manage data sharing at offload boundaries

Declare virtual shared data using _Cilk_shared allocation specifier

Allocate virtual dynamic shared data using these special functions:

Shared data copying occurs automatically around offload sections

• Memory is only synchronized on entry to or exit from an offload call

• Only modified data blocks are transferred between host and coprocessor

Allows transfer of C++ objects

• Pointers are transportable when they point to “shared” data addresses

Well-known methods can be used to synchronize access to shared data and prevent data races within offloaded code

• E.g., locks, critical sections, etc.

This model is integrated with the Intel® Cilk™ Plus parallel extensions

36

Note: Not supported on Fortran - available for C/C++ only

_Offload_shared_malloc(), _Offload_shared_aligned_malloc(),

_Offload_shared_free(), _Offload_shared_aligned_free()


Data sharing between host and coprocessor can be enabled using this Intel® Cilk™ Plus syntax

37

What Syntax

Function int _Cilk_shared f(int x){ return x+1; }

Code emitted for host and target; may be called from either side

Global _Cilk_shared int x = 0;

Datum is visible on both sides

File/Function static

static _Cilk_shared int x;

Datum visible on both sides, only to code within the file/function

Class class _Cilk_shared x {…};

Class methods, members and operators available on both sides

Pointer to shared data

int _Cilk_shared *p;

p is local (not shared), can point to shared data

A shared pointer

int *_Cilk_shared p;

p is shared; should only point at shared data

Entire blocks of code

#pragma offload_attribute( push, _Cilk_shared)

#pragma offload_attribute(pop)

Mark entire files or blocks of code _Cilk_shared using this pragma


Intel® Cilk™ Plus syntax can also specify the

offloading of computation to the coprocessor

38

Feature Example

Offloading a function call

x = _Cilk_offload func(y);

func executes on coprocessor if possible

x = _Cilk_offload_to (card_num) func(y);

func must execute on specified coprocessor or an error occurs

Offloading asynchronously

x = _Cilk_spawn _Cilk_offload func(y);

func executes on coprocessor; continuation available for stealing

Offloading a parallel for-loop

_Cilk_offload _Cilk_for(i=0; i<N; i++){

a[i] = b[i] + c[i];

}

Loop executes in parallel on coprocessor. The loop is implicitly “un-inlined” as a function call.


39

Introduction





Click to edit Master text styles

• Second level

– Third level

– Fourth level

– Fifth level




Advisor XEVTune Amplifier XEInspector XETrace Analyzer

Code Analysis

Comprehensive set of SW tools for Xeon and Xeon Phi Programing

Intel Cilk PlusThreading Building BlocksOpenMPOpenCLMPIOffload/Native/MYO

Programming Models

Math Kernel LibraryIntegrated Performance Primitives Intel Compilers

Libraries & Compilers

40




First Level

• Second level

– Third level

– Fourth level

– Fifth level

INTEL CONFIDENTIAL

41


‒ Second level

Third level

o Fourth level

Fifth level




Options for Thread Parallelism

Intel® Math Kernel Library

OpenMP*

Intel® Threading Building Blocks

Intel® Cilk™ Plus

OpenCL*

Pthreads* and other threading libraries Programmer control

Ease of use / code maintainability

Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!

41



42

Introduction


Performance and Thread Parallelism: OpenMP



OpenMP* on the Coprocessor

• The basics work just like on the host CPU

• For both native and offload models

• Need to specify -openmp

• There are 4 hardware thread contexts per core

• Need at least 2 x ncore threads for good performance

– For all except the most memory-bound workloads

– Often, 3x or 4x (number of available cores) is best

– Very different from hyperthreading on the host!

– -opt-threads-per-core=n advises compiler how many threads to optimize for

• If you don’t saturate all available threads, be sure to set KMP_AFFINITY to control thread distribution

43


Thread Affinity Interface

Allows OpenMP threads to be bound to physical or logical cores

• export environment variable KMP_AFFINITY=

– physical use all physical cores before assigning threads to other logical cores (other hardware thread contexts)

– compact assign threads to consecutive h/w contexts on same physical core (eg to benefit from shared cache)

– scatter assign consecutive threads to different physical cores (eg to maximize access to memory)

– balanced blend of compact & scatter (currently only available for Intel® MIC Architecture)

• Helps optimize access to memory or cache

• Particularly important if all available h/w threads not used

– else some physical cores may be idle while others run multiple threads

• See compiler documentation for (much) more detail

44


OpenMP defaults

• OMP_NUM_THREADS defaults to

• 1 x ncore for host (or 2x if hyperthreading enabled)

• 4 x ncore for native coprocessor applications

• 4 x (ncore-1) for offload applications

– one core is reserved for offload daemons and OS

• Defaults may be changed via environment variables or via API calls on either the host or the coprocessor

45


Target OpenMP environment (offload)

Use target-specific APIs to set for coprocessor target only, e.g.

omp_set_num_threads_target() (called from host)

omp_set_nested_target() etc

• Protect with #ifdef __INTEL_OFFLOAD, undefined with –no-offload

• Fortran: USE MIC_LIB and OMP_LIB C: #include <offload.h>

Or define MIC – specific versions of env vars using

MIC_ENV_PREFIX=MIC (no underscore)

• Values on MIC no longer default to values on host

• Set values specific to MIC using

export MIC_OMP_NUM_THREADS=120 (all cards)

export MIC_2_OMP_NUM_THREADS=180 for card #2, etc

export MIC_3_ENV=“OMP_NUM_THREADS=240|KMP_AFFINITY=balanced”

46


47

Introduction


Performance and Thread Parallelism: MKL



48


MKL Usage Models on Intel® Xeon Phi™

Coprocessor

49

• Automatic Offload– No code changes required

– Automatically uses both host and target

– Transparent data transfer and execution management

• Compiler Assisted Offload– Explicit controls of data transfer and remote execution using compiler

offload pragmas/directives

– Can be used together with Automatic Offload

• Native Execution – Uses the coprocessors as independent nodes

– Input data is copied to targets in advance


MKL Execution Models

50

Multicore Hosted

General purpose serial and parallel computing

Offload

Codes with highly-parallel phases

Many Core Hosted

Highly-parallel codes

Symmetric

Codes with balanced needs

Multicore(Intel® Xeon®)

Many-core(Intel® Xeon Phi™)

Multicore Centric Many-Core Centric


Work Division Control in MKL Automatic Offload

51

Examples Notes

mkl_mic_set_Workdivision(MKL_TARGET_MIC, 0, 0.5)

Offload 50% of computation only to the 1st

card.

Examples Notes

MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st

card.


How to Use MKL with Compiler Assisted Offload• The same way you would offload any function call

to the coprocessor.

• An example in C:

52


in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

in(C:length(matrix_elements)) \

out(C:length(matrix_elements) alloc_if(0))

{

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,

&beta, C, &N);

}


53

Introduction






• Second level

– Third level

– Fourth level

– Fifth level




Conclusions

Intel® Xeon Phi™ coprocessor advantages:

• Comparable performance potential to other accelerators

• Faster time to solution due to reduced development effort

• Better investment protection with a single code base for processors and coprocessors

Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between

All with the familiar Intel development environment

54




• Second level

– Third level

– Fourth level

– Fifth level




One Stop Shop for:

Tools & Software Downloads

Getting Started Development Guides

Video Workshops, Tutorials, & Events

Code Samples & Case Studies

Articles, Forums, & Blogs

Associated Product Links

http://software.intel.com/mic-developer

Intel® Xeon Phi™ Coprocessor DeveloperSite: http://software.intel.com/mic-developer

55


Obrigado.



• Second level

– Third level

– Fourth level

– Fifth level




INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


57


do multicore ao manycore: práticas de configuração, compilação e execução no coprocessador...

Technology