threading programming models for multi- · micro-architectures ... • current processors place...

Programming Models for Multi-Threading

Brian Marshall, Advanced Research Computing

Advanced Research Computing2

Why Do Parallel Computing?

• Limits of single CPU computing– performance

– available memory

– I/O rates

• Parallel computing allows one to:– solve problems that don’t fit on a single CPU

– solve problems that can’t be solved in a reasonable time

• We can solve…– larger problems

– faster

– more cases

Advanced Research Computing

A Change in Moore’s Law!


Parallelism is the New Moore’s Law

• Power and energy efficiency impose a key constraint on design of micro-architectures

• Clock speeds have plateaued

• Hardware parallelism is increasing rapidly to make up the difference


Cluster System Architecture

GigE

InfiniBand

InfiniBand Switch Hierarchy

I/O NodesWORK File System

…

…

internet

Login Nodes

2950

Home Server

TopSpin 270

TopSpin 12012

16

GigE Switch Hierarchy

2

1

1

2

130HOME

Raid 5

TopSpin 120

Fibre Channel


Blade : Rack : System

• 1 node : 2 x 8 cores= 16 cores• 1 chassis : 10 nodes = 120 cores• 1 rack (frame) : 4 chassis = 480 cores• system : 10 racks = 4,800 cores

x 4

x 10


HPC Trends

Architecture Code

Single core Serial

Multicore OpenMP

GPU CUDA

Cluster MPI

P

MG

PU

MemoryMemory


Multi-core systems

• Current processors place multiple processor cores on a die• Communication details are increasingly complex

– Cache access– Main memory access– Quick Path / Hyper Transport socket connections– Node to node connection via network

Memory

Network

Memory Memory Memory Memory


Accelerator-based Systems

• Calculations made in both CPUs and Graphical Processing Unit• No longer limited to single precision calculations• Load balancing critical for performance• Requires specific libraries and compilers (CUDA, OpenCL)• Co-processor from Intel: MIC (Many Integrated Core)

Network

GPU

Memory

GPU

Memory

GPU

Memory

GPU

Memory


Motivation

• Where is unrealized performance and how do we extract it?

• How broad is the performance impact?

• Hierarchical parallelism– Increased importance of fine-grained and data

parallelism

– More cores available per processor


Where is the Parallelism ?• Level 1: Single instruction multiple data (SIMD)

vector registers within individual CPU cores• Level 2: Increasing number of cores per CPU• Level 3: Accelerator-equipped systems

– General purpose graphics processors (GPGPU)– Intel Xeon Phi / many integrated core (MIC)

• Level 4: Supercomputing resources– Large number of compute nodes– multiple levels of parallelism– Increasing heterogeneity in hardware components


Motivations for Multi-threading and Vectorization

• Expose parallelism that is inaccessible using MPI alone– Fine-grained parallelism– Task-parallelism

• Automatic vectorization (Single Instruction Multiple Data)– Vector processors are more prevalent and getting wider – Compilers will vectorize automatically if possible– Accelerators such as GPU / Intel Xeon Phi

• Multi-threaded code is important to efficiently multi-core processors– Multi-core CPU present in laptops, desktops, and supercomputers


Multi-threaded Programs

• Expose parallelism that is inaccessible using MPI alone– Fine-grained parallelism– Task-parallelism

• Automatic vectorization (Single Instruction Multiple Data)– Vector processors are more prevalent and getting wider – Compilers will vectorize automatically if possible– Accelerators such as GPU / Intel Xeon Phi

• Multi-threaded code is important to efficiently multi-core processors– Multi-core CPU present in laptops, desktops, and

supercomputers


Multi-threaded Programs

• OpenMP: Most widely used for CPU-based parallelization and for targeting the Intel Xeon Phi

• OpenACC: Primarily used in the development of GPU-based codes

• pthreads, C++ 11 (Multithreading features): in the C++ standard, not fully supported

• CUDA• OpenCL• Intel Thread Building Blocks (TBB), Cilk++


What is OpenMP?

• API for parallel programming on shared memory systems– Parallel “threads”

• Implemented through the use of:– Compiler Directives– Runtime Library– Environment Variables

• Supported in C, C++, and Fortran• Maintained by OpenMP Architecture Review

Board (http://www.openmp.org/)

http://www.openmp.org/


Shared Memory

• Your laptop

• Multicore, multiple memory NUMA system– HokieOne (SGI UV)

• One node on blueridge

P

Memory

P PP P


OpenMP constructsOpenMP language

extensions

parallel controlstructures

data environment

synchronization

governs flow of control in the program

parallel directive

specifiesvariables as shared or private

shared and privateclauses

coordinates thread execution

critical and atomic directivesbarrier directive

work sharing

distributes work among threads

do/parallel do and Section directives

runtime functions, env.

variables

Runtime environment

omp_set_num_threads()omp_get_thread_num()OMP_NUM_THREADSOMP_SCHEDULE


Factors Affecting Multi-thread Performance

• Avoid overhead of initializing new threads wherever possible

– Bind threads to physical hardware cores

• Cache coherence issues can cause serious performance degradation when memory is written by different cores

– Data for a calculation performed by a particular core should be local to that core

• Avoid synchronization; try to enforce thread safety without serializing code


Single Instruction Multiple Data (SIMD)

• Each clock cycle a processor loads instructions and data on which those instructions operate

• SIMD processors can apply a single instruction to multiple pieces of data in a single clock cycle

• Modern processors increasingly enable or rely on SIMD to achieve high performance:– Intel SandyBridge / IvyBridge / Haswell– AMD Opteron– IBM BlueGene Q– Accelerators such as GPU and the Intel Xeon Phi


Auto-Vectorization Summary• Performance gains from auto-vectorization are

not guaranteed:– Certain algorithms vectorize while others do not– Problem details can also impact performance– Compiler and hardware combination impact the

efficiency of vectorization

• However:– SIMD is becoming more prevalent and speedup can be

significant– SIMD data structure optimizations provide benefits on

both CPU and accelerators (GPU, Intel Xeon Phi)


Software Challenges for Multi-threading

• Programming models for multi-threading are actively evolving

• Compiler support and performance for different implementations can vary widely

• Tradeoffs between portability and performance– C++ 11, OpenMP – Architecture specific programming models: Intel

thread building blocks, Cilk++, CUDA, OpenCL etc.


Compiler Auto-Vectorization• Many compilers can automatically generate vector

instructions– Intel 13.0 – gcc 4.7– llvm 3.4– pgi 14.0– IBM XL

• How you write your code has a huge impact on whether or not the compiler will generate vector instructions (and how optimal it will be)

• The performance of the various compilers will vary


Programming Practices that Inhibit Auto-Vectorization

• Loops without single point of entry and exit• Branching prevents vectorization• Data dependencies

– Read after write– Write after read– Aliasing may cause compiler to assume data

dependencies exist for safety!

• Non-contiguous memory accesses• Function calls within loops


Data Structures and Auto-Vectorization

• Structure of arrays is preferred over array of structures

• Memory alignment has a big impact on how efficiently vectorization is performed

• Example task: add two vectors together to obtain a third vector:C[i] = A[i] + B[i]


Single Instruction Multiple Data (SIMD)



struct ArrayOfStruct{ double A,B,C; void add(){ C = A+B; }}/* … some code… */

ArrayOfStruct *AOS;AOS = new ArrayOfStruct[SIZE]

for (i=0; i<SIZE; i++) AOS[i].add();



struct StructOfArrays{ /* . . . */ double *A,*B,*C; void add(){ for (i=0; i<SIZE; i++) C[i] = A[i]+B[i]; }}/* … some code… */

StructOfArrays SOA(SIZE);

SOA.add(); // Same calculation // different data layout



• Compilers can often be prompted to print out information about whether vectorization is performed

icc –vec-report2 –restrict VecAdd.cpp• For the “array of structures” loop: for (i=0; i<SIZE; i++) AOS[i].add(); • The compiler prints the following:remark: loop was not vectorized: vectorization possible but seems inefficient.



• For the “structure of arrays” loop:for (i=0; i<SIZE; i++) C[i]=A[i]+B[i]; • The compiler prints the following:remark: LOOP WAS VECTORIZED(… structure of arrays is preferred for SIMD computations, including on accelerators like GPU)



// Memory alignment and auto-vectorization// Little things can make a big difference…double *A = new double[SIZE];double *B = new double[SIZE];double *C = new double[SIZE];

// Explicitly aligning memory is advantageous!__declspec (align(16)) double A[SIZE]; __declspec (align(16)) double B[SIZE];__declspec (align(16)) double C[SIZE];



• Compare the performance– Intel SandyBridge CPU

– Intel 13.0 compiler

– 256-bit SIMD register (4 x double per instruction)

• Aligned structure of arrays is a clear winner:– Array of structures = 2.1 seconds

– Structure of arrays = 0.99 seconds (~2x speedup)

– Aligned structure of arrays = 0.6 seconds (~3.5x)


Questions ???

threading programming models for multi- · micro-architectures ... • current processors place...

Documents