high performance computing - university of iceland · 8/23/2017 · part two – introduction to...

ADVANCED SCIENTIFIC COMPUTING

Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

Introduction to High Performance ComputingAugust 23, 2017Room TG-227

High Performance Computing

Part Two

Outline

Part Two – Introduction to High Performance Computing 2 / 70

Outline

High Performance Computing (HPC) Basics Four basic building blocks of HPC TOP500 and Performance Benchmarks Shared Memory and Distributed Memory Architectures Hybrid and Emerging Architectures

HPC Ecosystem Technologies Software Environments & Scheduling System Architectures & Network Topologies Data Access & Large-scale Infrastructures

Parallel Programming Basics Message Passing Interface (MPI) OpenMP GPGPUs Selected Programming Challenges


High Performance Computing (HPC) Basics


What is High Performance Computing?

Wikipedia: ‘redirects from HPC to Supercomputer’ Interesting – gives us already a hint what it is generally about:

HPC includes work on ‘four basic building blocks’ in this course: Theory (numerical laws, physical models, speed-up performance, etc.) Technology (multi-core, supercomputers, networks, storages, etc.) Architecture (shared-memory, distributed-memory, interconnects, etc.) Software (libraries, schedulers, monitoring, applications, etc.)

Part Two – Introduction to High Performance Computing

A supercomputer is a computer at the frontline of contemporaryprocessing capacity – particularly speed of calculation

[1] Wikipedia ‘Supercomputer’ Online

[2] Introduction to High Performance Computing for Scientists and Engineers

5 / 70

HTC

networkinterconnectionless important!


HPC vs. High Throughput Computing (HTC) Systems

High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections. These are compute-oriented systems.

High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores. These are data-oriented systems

HPC

networkinterconnectionimportant

6 / 70

Parallel Computing

All modern supercomputers depend heavily on parallelism

Often known as ‘parallel processing’ of some problem space Tackle problems in parallel to enable the ‘best performance’ possible

‘The measure of speed’ in High Performance Computing matters Common measure for parallel computers established by TOP500 list Based on benchmark for ranking the best 500 computers worldwide


We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way


[3] TOP 500 supercomputing sites

7 / 70

TOP 500 List (June 2017)


powerchallenge

EU #1

[3] TOP 500 supercomputing sites 8 / 70

LINPACK benchmarks and Alternatives

TOP500 ranking is based on the LINPACK benchmark

LINPACK covers only a single architectural aspect (‘critics exist’) Measures ‘peak performance’: All involved ‘supercomputer elements’

operate on maximum performance Available through a wide variety of ‘open source implementations’ Success via ‘simplicity & ease of use’ thus used for over two decades

Realistic applications benchmark suites might be alternatives HPC Challenge benchmarks (includes 7 tests) JUBE benchmark suite (based on real applications)


LINPACK solves a dense system of linear equations of unspecified size.

[4] LINPACK Benchmark implementation

[5] HPC Challenge Benchmark Suite

[6] JUBE Benchmark Suite

The top 10 systems in the TOP500 list are dominated by companies, e.g. IBM, CRAY, Fujitsu, etc.

9 / 70

Dominant Architectures of HPC Systems

Traditionally two dominant types of architectures Shared-Memory Computers Distributed Memory Computers

Often hierarchical (hybrid) systems of both in practice Dominance in the last couple of years in the community on X86-based

commodity clusters running the Linux OS on Intel/AMD processors

More recently both above considered as ‘programming models’


Shared-memory parallelization with OpenMP Distributed-memory parallel programming with MPI

10 / 70

Shared-Memory Computers

Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)

The Problem of ‘Cache Coherence’ (in UMA/ccNUMA) Different CPUs use Cache to ‘modify same cache values’ Consistency between cached data & data in memory must be guaranteed ‘Cache coherence protocols’ ensure a consistent view of memory


A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space


11 / 70

Shared-Memory with UMA

Socket is a physical package (with multiple cores), typically a replacable component

Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache – Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and

mediates connections to memory


UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations.

Also called Symmetric Multiprocessing (SMP)


12 / 70

Shared-Memory with ccNUMA

Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one

‘logical’ single address space of ‘physically distributed memory’


ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems)

Network logic makes the aggregated memory appear as one single address space


13 / 70

Programming with Shared Memory using OpenMP

OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data


Shar

ed M

emor

yT1 T2 T3 T4 T5

Shared-memory programming enables immediate access to all data from allprocessors without explicit communication

OpenMP is dominant shared-memory programming standard today (v3)

[7] OpenMP API Specification

14 / 70

Distributed-Memory Computers

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today


A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly


15 / 70

Programming with Distributed Memory using MPI

No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX

Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method


P1 P2 P3 P4 P5

Distributed-memory programming enables explicit message passing as communication between processors

MPI is dominant distributed-memory programming standard today (v2.2)

[8] MPI Standard

16 / 70

Hierarchical Hybrid Computers

Shared-memory nodes (here ccNUMA) with local NIs NI mediates connections to other remote ‘SMP nodes’


A hierarchical hybrid parallel computer is neither a purely shared-memory nor a purely distributed-memory type system but a mixture of both

Large-scale ‘hybrid’ parallel computers have shared-memory building blocks interconnected with a fast network today


17 / 70

Programming Hybrid Systems

Experience from HPC Practice Most parallel applications still take no notice of the hardware structure Use of pure MPI for parallelization remains the dominant programming

(historical reason: old supercomputers all distributed-memory type)

Challenges with the ‘mapping problem’ The performance of hybrid (as well as pure MPI codes) depends crucially

on factors not directly connected to the programming model It largely depends on the association of threads and processes to cores

Emerging ‘hybrid programming models‘ using GPGPUs and CPUs


Hybrid systems programming uses MPI as explicit internode communication and OpenMP for parallelization within the node

18 / 70

Emerging HPC System Architecture Developments

Increasing number of other ‘new’ emerging system architectures

General Purpose Computation on Graphics Processing Unit (GPGPUs) Use of GPUs instead for computer graphics for computing Programming models are OpenCL and Nvidia CUDA Getting more and more adopted in many application fields

Field Programmable Gate Array (FPGAs) Integrated circuit designed to be configured by a user after shipping Enables updates of functionality and reconfigurable ‘wired’

interconnects Cell processors Enables combination of general-purpose cores with co-processing

elements that accelerate dedicated forms of computationsPart Two – Introduction to High Performance Computing

Often in state of flux/vendor-specific More details are quickly outdated Not tackled in depth in this course

19 / 70

HPC Ecosystem Technologies


HPC Software Environment

Operating System Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’

Scheduling Systems Manage concurrent access of users on Supercomputers Different scheduling algorithms can be used with different ‘batch queues’ Example: SLURM @ JOTUNN Cluster, LoadLeveler @ JUQUEEN, etc.

Monitoring Systems Monitor and test status of the system (‘system health checks/heartbeat’) Enables view of usage of system per node/rack (‘system load’) Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc.

Performance Analysis Systems Measure performance of an application

and recommend improvements Examples: SCALASCA, VAMPIR, etc.


HPC systems typically provide a software environment thatsupport the processing ofparallel applications

21 / 70

Example: Ganglia @ Jötunn Cluster


[9] Jötunn Cluster Ganglia Monitoring Online

22 / 70

Scheduling Principles

HPC Systems are typically not used in an interactive fashion Program application starts ‘processes‘ on processors (‘do a job for a user‘) Users of HPC systems send ‘job scripts‘ to schedulers to start programs Scheduling enables the sharing of the HPC system with other users Closely related to Operating Systems with a wide varity of algorithms

E.g. First Come First Serve (FCFS) Queues processes in the order that they arrive in the ready queue.

E.g. Backfilling Enables to maximize cluster utilization and throughput Scheduler searches to find jobs that can fill gaps in the schedule Smaller jobs farther back in the queue run ahead of a job waiting at the

front of the queue (but this job should not be delayed by backfilling!)


Scheduling is the method by which user processes are given access to processor time (shared)

23 / 70

Example: Concurrent Usage of a Supercomputer

Part Two – Introduction to High Performance Computing [10] LLView Tool 24 / 70

System Architectures

HPC systems are very complex ‘machines‘ with many elements CPUs & multi-cores ‘multi-threading‘ capabilities Data access levels Different levels of Caches Network topologies Various interconnects (Example: IBM Bluegene/Q)


HPC faced a significant change in practice with respect to performance increase after years Getting more speed for free by waiting for new CPU generations does not work any more Multicore processors emerge that require to use those multiple resources efficiently in parallel

25 / 70

Example: Supercomputer BlueGene/Q

Part Two – Introduction to High Performance Computing [10] LLView Tool 26 / 70

Significant advances in CPU (or microprocessor chips) Multi-core architecture with dual,

quad, six, or n processing cores Processing cores are all on one chip

Multi-core CPU chip architecture Hierarchy of caches (on/off chip) L1 cache is private to each core; on-chip L2 cache is shared; on-chip L3 cache or Dynamic random access memory (DRAM); off-chip

Multi-core CPU Processors


one chip

Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies

[11] Distributed & Cloud Computing Book

27 / 70

Example: BlueGene Architecture Evolution


BlueGene/P

BlueGene/Q

28 / 70

Network Topologies

Large-scale HPC Systems have special network setups Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB) Different network topologies, e.g. tree, 5D Torus network, mesh, etc.

(raise challenges in task mappings and communication patterns)



Source: IBM

29 / 70

Data Access

P = Processor core elements Compute: floating points or integers Arithmetic units (compute operations) Registers (feed those units with operands)

‘Data access‘ for application/levels Registers: ‘accessed w/o any delay‘ L1D = Level 1 Cache – Data (fastest, normal) L2 = Level 2 Cache (fast, often) L3 = Level 3 Cache (still fast, less often) Main memory (slow, but larger in size) Storage media like harddisk, tapes, etc.

(too slow to be used in direct computing)


The DRAM gap is the large discrepancy between main memory and cache bandwidths

fast

er

chea

per


too slow

30 / 70

HPC Relationship to ‘Big Data‘


[12] F. Berman: ‘Maximising the Potential of Research Data’31 / 70

Large-scale Computing Infrastructures

Large computing systems are often embedded in infrastructures Grid computing for distributed data storage and processing via middleware The success of Grid computing was renowned when being mentioned by

Prof. Rolf-Dieter Heuer, CERN Director General, in the context of the Higgs Boson Discovery:

Other large-scale distributed infrastructures exist Partnership for Advanced Computing in Europe (PRACE) EU HPC Extreme Engineering and Discovery Environment (XSEDE) US HPC


‘Results today only possible due to extraordinary performance of Accelerators – Experiments – Grid computing’.

[13] Grid Computing YouTube Video

32 / 70

Towards new HPC Architectures – DEEP-EST EU Project


BN

Many CoreCPU

CN

General Purpose

CPU

MEM

GCE

FPGA MEM

NAM

FPGA

General Purpose

CPU

NVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAM

DN

MEM

FPGA

General Purpose

CPUMEM

NVRAMNVRAM

Possible Application Workload 33 / 70

[14] DEEP-EST EU Project

Parallel Programming Basics



Distributed-Memory Computers Reviewed

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today

A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly

Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

Programming Model:

Message Passing

35 / 70


Programming with Distributed Memory using MPI

No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX

Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method

P1 P2 P3 P4 P5

Distributed-memory programming enables explicit message passing as communication between processors

MPI is dominant distributed-memory programming standard today (v3.1)

[8] MPI Standard

36 / 70


What is MPI?

‘Communication library’ abstracting from low-level network view Offers 500+ available functions to communicate between computing nodes Practice reveals: parallel applications often require just ~12 (!) functions Includes routines for efficient ‘parallel I/O’ (using underlying hardware)

Supports ‘different ways of communication’ ‘Point-to-point communication’ between two computing nodes (P P) Collective functions involve ‘N computing nodes in useful communiction’

Deployment on Supercomputers Installed on (almost) all parallel computers Different languages: C, Fortran, Python, R, etc. Careful: Different versions might be installed

Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

37 / 70


HPC Machine

Message Passing: Exchanging Data with Send/Receive

P1 P2 P3 P4 P5 P6

P

M

P

M

P

M

P

M

Each processor has its own data in its memory that can not be seen/accessed by other processors

DATA: 17

DATA: 06 DATA: 19

DATA: 80

Point-To-PointCommunications

NEW: 17

NEW: 06

Compute Node

38 / 70


Collective Functions : Broadcast (one-to-many)

P

M

P

M

P

M

P

M

DATA: 17

DATA: 06 DATA: 19

DATA: 80

NEW: 17

NEW: 17NEW: 17

Broadcast distributes the same data to many or even all other processors

39 / 70


Collective Functions: Scatter (one-to-many) Scatter distributes

different data to many or even all other processors P

M

P

M

P

M

P

M

DATA: 30DATA: 20DATA: 10

NEW: 10

DATA: 80

DATA: 19DATA: 06

NEW: 20NEW: 30

40 / 70


Collective Functions: Gather (many-to-one) Gather collects data

from many or even all other processorsto one specific processor P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 80NEW: 19NEW: 06

41 / 70


Collective Functions: Reduce (many-to-one) Reduce combines

collection with computation based on data from many or even all other processors

P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 122

Usage of reduce includes finding a global minimum or maximum, sum, or product of the different data locatedat different processors

++

+

+

global sumas example+

42 / 70


Is MPI yet another network library?

TCP/IP and socket programming libraries are plentiful available Do we need a dedicated communication & network protocols library? Goal: simplify programming in parallel programming, focus on applications

Selected reasons Designed for performance within large parallel computers (e.g. no security) Supports various interconnects between ‘computing nodes’ (hardware) Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’

MPI is not designed to handle any communication in computer networks and is thus very special Not good for clients that constantly establishing/closing connections again and again (e.g. would

have very slow performance in MPI) Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond

firewalls, no message encryption directly available, etc.)

43 / 70


(MPI) Basic Building Blocks: A main() function

The main() function is automatically started when launching a C program

Normally the ‘return code’ denotes whether the program exitwas ok (0) or problematic (-1)

Practice view: use of resiliency is not part of MPI (e.g. automatic restart and error handlings), therefore this is rarely used in practice

‘standard C programming…’

44 / 70


(MPI) Basic Building Blocks: Variables & Output

Libraries can be used by including C header files, here library forscreen outputs for example

Two integer variables that are later useful for working withspecific data obtained fromMPI library

Output with printf using stdio library: ‘Hello World’ and which process is printing out of the summary of all n processes

‘standard C programming…’

45 / 70


MPI Basic Building Blocks: Header & Init/Finalize

‘standard C programming including MPI library use…’ Libraries can be used by including C header files, here library forMPI included

The MPI_INIT() function initializes the MPI environment and can take inputs via themain() functionarguments

MPI_Finalize() shuts down the MPI environment(after this statement no parallel execution of the code can take place)

46 / 70


MPI Basic Building Blocks: Rank & Size Variables

‘standard C programming including MPI library use…’

MPI_COMM_WORLD communicator constantdenotes the ‘region of communication’, here all processes

The MPI_Comm_size()function determines the overall number of n processes in the parallel program: stores it in variable size

The MPI_Comm_rank()function determines the unique identifier for each processor:stores it in variablerank with valures (0 … n-1)

47 / 70


Compiling & Executing an MPI program

Compilers and linkers need various information where include files and libraries can be found E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’ Compiling is different for each programming language

Executing the MPI program on 4 processors Normally batch system allocations

(cf. SLURM on JÖTUNN cluster) Manual start-up example:

Output of the program Order of outputs

can vary because I/Oscreen ‘serial resource’

P

M

P

M

P

M

P

M

hello hello

hello hello

$> mpirun –np 4 ./hello

create 4 processes that produceoutput in parallel

48 / 70


Practice: Our 4 CPU Program alongside many other Programs

[10] LLView Tool

Maybe our program!

49 / 70


MPI Communicators

Each MPI activity specifies thecontext in which a corresponding function is performed MPI_COMM_WORLD

(region/context of all processes) Create (sub-)groups of the

processes / virtual groups of processes

Peform communicationsonly within these sub-groupseasily with well-defined processes

Using communicators wisely in collective functionscan reduce the number of affected processors

[15] LLNL MPI Tutorial

50 / 70


Shared-Memory Computers: Reviewed

Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)

A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space


Programming model: work on shared address space (‘local access to memory’)

51 / 70


Programming with Shared Memory using OpenMP

OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data (Shared-Memory concept itself is very old, like POSIX Threads)

Shar

ed M

emor

yT1 T2 T3 T4 T5

Shared-memory programming enables immediate access to all data from allprocessors without explicit communication

OpenMP is dominant shared-memory programming standard today (v3)

[7] OpenMP API Specification

52 / 70


What is OpenMP?

OpenMP is a library for specifying ‘parallel regions in serial code’ Defined by major computer hardware/software vendors portability! Enable scalability with parallelization constructs w/o fixed thread numbers Offers a suitable data environment for easier parallel processing of data Uses specific environment variables for clever decoupling of code/problem Included in standard C compiler distributions (e.g. gcc)

Threads are the central entity in OpenMP Threads enable ‘work-sharing’ and

share address space (where data resides) Threads can be synchronized if needed Lightweight process that share

common address space with other threads Initiating (aka ‘spawning’) n threads is less

costly then n processes (e.g. variable space)

Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

Threads are lightweight processes that work with data in memory

53 / 70


Parallel and Serial Regions

modified from [2] Introduction to High Performance Computing for Scientists and Engineers

fork() initiated by master thread (exists always) creates team of threads

Team of threads concurrently work on shared-memory data actively in parallel regions

join() initiates the ‘shutdown’ of the parallel region and terminates team of threads

Team of threads maybe also put to sleep until next parallel region begins

Number of threads can be different in each parallel region

OpenMPprogram

54 / 70


Number of Threads & Scalability

The real number of threads normally not known at compile time (There are methods for doing it in the program do not use them!) Number is set in scripts and/or environment variable before executing

Parallel programming is done without knowing number of threads

OpenMP programs should be always written in a way that itdoes not assume a specific number of threads Scalable program

int main()

{

#pragma omp parallel

printf(„Hello World“);

}

int main()

{

#pragma omp parallel

printf(„Hello World“);

}

./helloworld.exe

Hello World

Hello World

Hello World

Hello World

./helloworld.exe

Hello World

Hello World

Hello World

Hello World

compile & execute

mT T0

T1

T2

T3

masterthreadbecomes T0

55 / 70


OpenMP Basic Building Blocks: Library & Sentinel

‘standard fortran programming…’

‘standard C/C++ programming…’

The OpenMP library contains OpenMP API definitions

The Sentinel is a special string that starts an OpenMP compiler directive

Practice view: programming OpenMP in C/C++ and fortran is slightly different,but providing the same basic concepts (e.g. no end of parallel region in C/C++, local variables, etc.)

56 / 70


OpenMP Basic Building Blocks: Unique Thread IDs

‘standard fortran programming…’

‘standard C/C++ programming…’

do_work_package() routine code is now executed in parallel by each thread

BUT also sub-routines of that routine are now executed in parallel

omp_get_thread_num() function provides unique Thread ID (0…n-1)

omp_get_num_threads() function obtains number of active threads in the current parallel region

57 / 70


OpenMP Basic Building Blocks: Private Variables (Fortran)

PRIVATE defines local variables for each thread

Each thread works independently and thus needs space to ‘store’ local results

Practice view: the real parallelization idea is here in the loop: the simple sum of two arrays For each value of i we can compute and store array values independently from each other

Same code executed n times with n threads, BUT tid is unique and thus different for each thread

58 / 70


Traditional HelloWorld Example (C/C++)#include <omp.h>

#include <stdio.h>

int main(argc,argv)

int argc; char *argv[]; {

int nthreads, tid;

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

printf(„Hello World from thread = %d\n“, tid);

if (tid == 0){

nthreads = omp_get_num_threads();

printf(„Number of threads in parallel region = %d\n“, nthreads);

}

}

}

Simple Parallel Program Only the master (tid=0) provides

output of how many threads are existing in the parallel region

Shared variable nthreads Local variable tid

59 / 70


OpenMP Basic Building Blocks: Loops (do, for in C/C++) FIRSTPRIVATE()

copies initial value of shared variable to local variable(simple init here otherwise problems in loop)

DO loop (in front of usual do) distributes (automatically) loop iterations among threads as specifically supported ‘work-sharing’ construct

Smart programming support by OpenMP: Loops are very often part of scientific applications! Less burden for programmer: no manual definition of local variables (e.g. i automatically localized)

local sum exists, but where is the global sum?

60 / 70


OpenMP Basic Building Blocks: Critical Regions Local sum exists in

each of the different threads

We have n times the local variable value for sum now

Race Condition in shared-memory: shared variable pi will be set concurrently by the different threads

Value of pi depends on the exact order the threads access pi and assign wrong values

Critical regions define a region within a parallel region where at most one thread at a time executes code (e.g. sum of new pi based on pi)

61 / 70


OpenMP Basic Building Blocks: Reduction Several operations are

common in scientific applications

+, *, -, &, |, ^, &&, ||, max, min

REDUCTION() with operator + on variable s enables here …

Starting with a local copy of s for each thread

During progress of parallel region each local copy of s will be accumulated seperately by each thread

At the end of the parallel region automatically synchronized and accumulated with resulting master thread variable

Reduction operations are a smart alternative to manualcritical regions definitions around operations of variables

Reduction operation automatically localizes variable

62 / 70

Many-core GPUs

Use of very many simple cores High throughput computing-oriented architecture Use massive parallelism by executing a lot of concurrent threads slowly Handle an ever increasing amount of multiple instruction threads CPUs instead typically execute a single long thread as fast as possible

Many-core GPUs are used in large clusters and within massively parallel supercomputers today Named General-Purpose

Computing on GPUs (GPGPU)


Graphics Processing Unit (GPU) is great for data parallelism and task parallelism Compared to multi-core CPUs, GPUs consist of a many-core architecture with

hundreds to even thousands of very simple cores executing threads rather slowly


63 / 70

GPU Acceleration

GPU accelerator architecture example (e.g. NVIDEA card) GPUs can have 128 cores on one single GPU chip Each core can work with eight threads of instructions GPU is able to concurrently execute 128 * 8 = 1024 threads Interaction and thus major (bandwidth)

bottleneck between CPU and GPU is via memory interactions

E.g. applications that use matrix –vector multiplication


CPU acceleration means that GPUs accelerate computing due to a massive parallelism with thousands of threads compared to only a few threads used by conventional CPUs

GPUs are designed to compute large numbers of floating point operations in parallel


64 / 70

NVIDEA Fermi GPU Example



65 / 70

Challenges: Domain Decomposition & Load Imbalance


[16] Map Analysis - Understanding Spatial Patterns and Relationships, Book

Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

unusedresources

Load imbalance hampers performance, because some resources are underutilized

boundary halo

Challenges: Ghost/Halo Regions & Stencil Methods


[2] Introduction to High Performance Computing for Scientists and Engineers3 * 16 = 48 4 * 8 = 32

Stencil-based iterative methods update array elements according to a fixed pattern called ‘stencil‘

The key of stencil methods is its regular structure mostly implemented using arrays in codes

Lecture Bibliography


Lecture Bibliography

[1] Wikipedia ‘Supercomputer’, Online: http://en.wikipedia.org/wiki/Supercomputer [2] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein,

Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X

[3] TOP500 Supercomputing Sites, Online: http://www.top500.org/ [4] LINPACK Benchmark, Online: http://www.netlib.org/benchmark/hpl/ [5] HPC Challenge Benchmark Suite, Online: http://icl.cs.utk.edu/hpcc/ [6] JUBE Benchmark Suite, Online:

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html [7] The OpenMP API specification for parallel programming, Online:

http://openmp.org/wp/openmp-specifications/ [8] The MPI Standard, Online: http://www.mpi-forum.org/docs/ [9] Jötunn HPC Cluster, Online: http://ihpc.is/jotunn/ [10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html [11] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book,

Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049 [12] Fran Berman, ‘Maximising the Potential of Research Data’ [13] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online:

http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw [14] DEEP-EST EU Project, Online: http://www.deep-projects.eu/ [15] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/ [16] Map Analysis, Understanding Spatial Patterns and Relationships, Joseph K. Berry, Online:

http://www.innovativegis.com/basis/Books/MapAnalysis/Default.htm


high performance computing - university of iceland · 8/23/2017 · part two – introduction to...

Documents