threading programming models for multi- · micro-architectures ... • current processors place...
TRANSCRIPT
Advanced Research Computing2
Why Do Parallel Computing?
• Limits of single CPU computing– performance
– available memory
– I/O rates
• Parallel computing allows one to:– solve problems that don’t fit on a single CPU
– solve problems that can’t be solved in a reasonable time
• We can solve…– larger problems
– faster
– more cases
Advanced Research Computing
Parallelism is the New Moore’s Law
• Power and energy efficiency impose a key constraint on design of micro-architectures
• Clock speeds have plateaued
• Hardware parallelism is increasing rapidly to make up the difference
Advanced Research Computing
Cluster System Architecture
GigE
InfiniBand
InfiniBand Switch Hierarchy
I/O NodesWORK File System
…
…
internet
Login Nodes
2950
Home Server
TopSpin 270
TopSpin 12012
16
GigE Switch Hierarchy
2
1
1
2
130HOME
Raid 5
TopSpin 120
Fibre Channel
Advanced Research Computing
Blade : Rack : System
• 1 node : 2 x 8 cores= 16 cores• 1 chassis : 10 nodes = 120 cores• 1 rack (frame) : 4 chassis = 480 cores• system : 10 racks = 4,800 cores
x 4
x 10
Advanced Research Computing
HPC Trends
Architecture Code
Single core Serial
Multicore OpenMP
GPU CUDA
Cluster MPI
P
MG
PU
MemoryMemory
Advanced Research Computing
Multi-core systems
• Current processors place multiple processor cores on a die• Communication details are increasingly complex
– Cache access– Main memory access– Quick Path / Hyper Transport socket connections– Node to node connection via network
Memory
Network
Memory Memory Memory Memory
Advanced Research Computing
Accelerator-based Systems
• Calculations made in both CPUs and Graphical Processing Unit• No longer limited to single precision calculations• Load balancing critical for performance• Requires specific libraries and compilers (CUDA, OpenCL)• Co-processor from Intel: MIC (Many Integrated Core)
Network
GPU
Memory
GPU
Memory
GPU
Memory
GPU
Memory
Advanced Research Computing
Motivation
• Where is unrealized performance and how do we extract it?
• How broad is the performance impact?
• Hierarchical parallelism– Increased importance of fine-grained and data
parallelism
– More cores available per processor
Advanced Research Computing
Where is the Parallelism ?• Level 1: Single instruction multiple data (SIMD)
vector registers within individual CPU cores• Level 2: Increasing number of cores per CPU• Level 3: Accelerator-equipped systems
– General purpose graphics processors (GPGPU)– Intel Xeon Phi / many integrated core (MIC)
• Level 4: Supercomputing resources– Large number of compute nodes– multiple levels of parallelism– Increasing heterogeneity in hardware components
Advanced Research Computing
Motivations for Multi-threading and Vectorization
• Expose parallelism that is inaccessible using MPI alone– Fine-grained parallelism– Task-parallelism
• Automatic vectorization (Single Instruction Multiple Data)– Vector processors are more prevalent and getting wider – Compilers will vectorize automatically if possible– Accelerators such as GPU / Intel Xeon Phi
• Multi-threaded code is important to efficiently multi-core processors– Multi-core CPU present in laptops, desktops, and supercomputers
Advanced Research Computing
Multi-threaded Programs
• Expose parallelism that is inaccessible using MPI alone– Fine-grained parallelism– Task-parallelism
• Automatic vectorization (Single Instruction Multiple Data)– Vector processors are more prevalent and getting wider – Compilers will vectorize automatically if possible– Accelerators such as GPU / Intel Xeon Phi
• Multi-threaded code is important to efficiently multi-core processors– Multi-core CPU present in laptops, desktops, and
supercomputers
Advanced Research Computing
Multi-threaded Programs
• OpenMP: Most widely used for CPU-based parallelization and for targeting the Intel Xeon Phi
• OpenACC: Primarily used in the development of GPU-based codes
• pthreads, C++ 11 (Multithreading features): in the C++ standard, not fully supported
• CUDA• OpenCL• Intel Thread Building Blocks (TBB), Cilk++
Advanced Research Computing
What is OpenMP?
• API for parallel programming on shared memory systems– Parallel “threads”
• Implemented through the use of:– Compiler Directives– Runtime Library– Environment Variables
• Supported in C, C++, and Fortran• Maintained by OpenMP Architecture Review
Board (http://www.openmp.org/)
Advanced Research Computing
Shared Memory
• Your laptop
• Multicore, multiple memory NUMA system– HokieOne (SGI UV)
• One node on blueridge
P
Memory
P PP P
Advanced Research Computing
OpenMP constructsOpenMP language
extensions
parallel controlstructures
data environment
synchronization
governs flow of control in the program
parallel directive
specifiesvariables as shared or private
shared and privateclauses
coordinates thread execution
critical and atomic directivesbarrier directive
work sharing
distributes work among threads
do/parallel do and Section directives
runtime functions, env.
variables
Runtime environment
omp_set_num_threads()omp_get_thread_num()OMP_NUM_THREADSOMP_SCHEDULE
Advanced Research Computing
Factors Affecting Multi-thread Performance
• Avoid overhead of initializing new threads wherever possible
– Bind threads to physical hardware cores
• Cache coherence issues can cause serious performance degradation when memory is written by different cores
– Data for a calculation performed by a particular core should be local to that core
• Avoid synchronization; try to enforce thread safety without serializing code
Advanced Research Computing
Single Instruction Multiple Data (SIMD)
• Each clock cycle a processor loads instructions and data on which those instructions operate
• SIMD processors can apply a single instruction to multiple pieces of data in a single clock cycle
• Modern processors increasingly enable or rely on SIMD to achieve high performance:– Intel SandyBridge / IvyBridge / Haswell– AMD Opteron– IBM BlueGene Q– Accelerators such as GPU and the Intel Xeon Phi
Advanced Research Computing
Auto-Vectorization Summary• Performance gains from auto-vectorization are
not guaranteed:– Certain algorithms vectorize while others do not– Problem details can also impact performance– Compiler and hardware combination impact the
efficiency of vectorization
• However:– SIMD is becoming more prevalent and speedup can be
significant– SIMD data structure optimizations provide benefits on
both CPU and accelerators (GPU, Intel Xeon Phi)
Advanced Research Computing
Software Challenges for Multi-threading
• Programming models for multi-threading are actively evolving
• Compiler support and performance for different implementations can vary widely
• Tradeoffs between portability and performance– C++ 11, OpenMP – Architecture specific programming models: Intel
thread building blocks, Cilk++, CUDA, OpenCL etc.
Advanced Research Computing
Compiler Auto-Vectorization• Many compilers can automatically generate vector
instructions– Intel 13.0 – gcc 4.7– llvm 3.4– pgi 14.0– IBM XL
• How you write your code has a huge impact on whether or not the compiler will generate vector instructions (and how optimal it will be)
• The performance of the various compilers will vary
Advanced Research Computing
Programming Practices that Inhibit Auto-Vectorization
• Loops without single point of entry and exit• Branching prevents vectorization• Data dependencies
– Read after write– Write after read– Aliasing may cause compiler to assume data
dependencies exist for safety!
• Non-contiguous memory accesses• Function calls within loops
Advanced Research Computing
Data Structures and Auto-Vectorization
• Structure of arrays is preferred over array of structures
• Memory alignment has a big impact on how efficiently vectorization is performed
• Example task: add two vectors together to obtain a third vector:C[i] = A[i] + B[i]
Advanced Research Computing
Data Structures and Auto-Vectorization
struct ArrayOfStruct{ double A,B,C; void add(){ C = A+B; }}/* … some code… */
ArrayOfStruct *AOS;AOS = new ArrayOfStruct[SIZE]
for (i=0; i<SIZE; i++) AOS[i].add();
Advanced Research Computing
Data Structures and Auto-Vectorization
struct StructOfArrays{ /* . . . */ double *A,*B,*C; void add(){ for (i=0; i<SIZE; i++) C[i] = A[i]+B[i]; }}/* … some code… */
StructOfArrays SOA(SIZE);
SOA.add(); // Same calculation // different data layout
Advanced Research Computing
Data Structures and Auto-Vectorization
• Compilers can often be prompted to print out information about whether vectorization is performed
icc –vec-report2 –restrict VecAdd.cpp• For the “array of structures” loop: for (i=0; i<SIZE; i++) AOS[i].add(); • The compiler prints the following:remark: loop was not vectorized: vectorization possible but seems inefficient.
Advanced Research Computing
Data Structures and Auto-Vectorization
• For the “structure of arrays” loop:for (i=0; i<SIZE; i++) C[i]=A[i]+B[i]; • The compiler prints the following:remark: LOOP WAS VECTORIZED(… structure of arrays is preferred for SIMD computations, including on accelerators like GPU)
Advanced Research Computing
Data Structures and Auto-Vectorization
// Memory alignment and auto-vectorization// Little things can make a big difference…double *A = new double[SIZE];double *B = new double[SIZE];double *C = new double[SIZE];
// Explicitly aligning memory is advantageous!__declspec (align(16)) double A[SIZE]; __declspec (align(16)) double B[SIZE];__declspec (align(16)) double C[SIZE];
Advanced Research Computing
Data Structures and Auto-Vectorization
• Compare the performance– Intel SandyBridge CPU
– Intel 13.0 compiler
– 256-bit SIMD register (4 x double per instruction)
• Aligned structure of arrays is a clear winner:– Array of structures = 2.1 seconds
– Structure of arrays = 0.99 seconds (~2x speedup)
– Aligned structure of arrays = 0.6 seconds (~3.5x)