high performance computing - university of iceland · 8/23/2017  · part two – introduction to...

70
ADVANCED SCIENTIFIC COMPUTING Dr. – Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre, Germany Introduction to High Performance Computing August 23, 2017 Room TG-227 High Performance Computing Part Two

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

ADVANCED SCIENTIFIC COMPUTING

Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

Introduction to High Performance ComputingAugust 23, 2017Room TG-227

High Performance Computing

Part Two

Page 2: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Outline

Part Two – Introduction to High Performance Computing 2 / 70

Page 3: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Outline

High Performance Computing (HPC) Basics Four basic building blocks of HPC TOP500 and Performance Benchmarks Shared Memory and Distributed Memory Architectures Hybrid and Emerging Architectures

HPC Ecosystem Technologies Software Environments & Scheduling System Architectures & Network Topologies Data Access & Large-scale Infrastructures

Parallel Programming Basics Message Passing Interface (MPI) OpenMP GPGPUs Selected Programming Challenges

Part Two – Introduction to High Performance Computing 3 / 70

Page 4: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

High Performance Computing (HPC) Basics

Part Two – Introduction to High Performance Computing 4 / 70

Page 5: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

What is High Performance Computing?

Wikipedia: ‘redirects from HPC to Supercomputer’ Interesting – gives us already a hint what it is generally about:

HPC includes work on ‘four basic building blocks’ in this course: Theory (numerical laws, physical models, speed-up performance, etc.) Technology (multi-core, supercomputers, networks, storages, etc.) Architecture (shared-memory, distributed-memory, interconnects, etc.) Software (libraries, schedulers, monitoring, applications, etc.)

Part Two – Introduction to High Performance Computing

A supercomputer is a computer at the frontline of contemporaryprocessing capacity – particularly speed of calculation

[1] Wikipedia ‘Supercomputer’ Online

[2] Introduction to High Performance Computing for Scientists and Engineers

5 / 70

Page 6: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

HTC

networkinterconnectionless important!

Part Two – Introduction to High Performance Computing

HPC vs. High Throughput Computing (HTC) Systems

High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections. These are compute-oriented systems.

High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores. These are data-oriented systems

HPC

networkinterconnectionimportant

6 / 70

Page 7: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Parallel Computing

All modern supercomputers depend heavily on parallelism

Often known as ‘parallel processing’ of some problem space Tackle problems in parallel to enable the ‘best performance’ possible

‘The measure of speed’ in High Performance Computing matters Common measure for parallel computers established by TOP500 list Based on benchmark for ranking the best 500 computers worldwide

Part Two – Introduction to High Performance Computing

We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way

[2] Introduction to High Performance Computing for Scientists and Engineers

[3] TOP 500 supercomputing sites

7 / 70

Page 8: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

TOP 500 List (June 2017)

Part Two – Introduction to High Performance Computing

powerchallenge

EU #1

[3] TOP 500 supercomputing sites 8 / 70

Page 9: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

LINPACK benchmarks and Alternatives

TOP500 ranking is based on the LINPACK benchmark

LINPACK covers only a single architectural aspect (‘critics exist’) Measures ‘peak performance’: All involved ‘supercomputer elements’

operate on maximum performance Available through a wide variety of ‘open source implementations’ Success via ‘simplicity & ease of use’ thus used for over two decades

Realistic applications benchmark suites might be alternatives HPC Challenge benchmarks (includes 7 tests) JUBE benchmark suite (based on real applications)

Part Two – Introduction to High Performance Computing

LINPACK solves a dense system of linear equations of unspecified size.

[4] LINPACK Benchmark implementation

[5] HPC Challenge Benchmark Suite

[6] JUBE Benchmark Suite

The top 10 systems in the TOP500 list are dominated by companies, e.g. IBM, CRAY, Fujitsu, etc.

9 / 70

Page 10: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Dominant Architectures of HPC Systems

Traditionally two dominant types of architectures Shared-Memory Computers Distributed Memory Computers

Often hierarchical (hybrid) systems of both in practice Dominance in the last couple of years in the community on X86-based

commodity clusters running the Linux OS on Intel/AMD processors

More recently both above considered as ‘programming models’

Part Two – Introduction to High Performance Computing

Shared-memory parallelization with OpenMP Distributed-memory parallel programming with MPI

10 / 70

Page 11: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Shared-Memory Computers

Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)

The Problem of ‘Cache Coherence’ (in UMA/ccNUMA) Different CPUs use Cache to ‘modify same cache values’ Consistency between cached data & data in memory must be guaranteed ‘Cache coherence protocols’ ensure a consistent view of memory

Part Two – Introduction to High Performance Computing

A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space

[2] Introduction to High Performance Computing for Scientists and Engineers

11 / 70

Page 12: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Shared-Memory with UMA

Socket is a physical package (with multiple cores), typically a replacable component

Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache – Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and

mediates connections to memory

Part Two – Introduction to High Performance Computing

UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations.

Also called Symmetric Multiprocessing (SMP)

[2] Introduction to High Performance Computing for Scientists and Engineers

12 / 70

Page 13: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Shared-Memory with ccNUMA

Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one

‘logical’ single address space of ‘physically distributed memory’

Part Two – Introduction to High Performance Computing

ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems)

Network logic makes the aggregated memory appear as one single address space

[2] Introduction to High Performance Computing for Scientists and Engineers

13 / 70

Page 14: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Programming with Shared Memory using OpenMP

OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data

Part Two – Introduction to High Performance Computing

Shar

ed M

emor

yT1 T2 T3 T4 T5

Shared-memory programming enables immediate access to all data from allprocessors without explicit communication

OpenMP is dominant shared-memory programming standard today (v3)

[7] OpenMP API Specification

14 / 70

Page 15: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Distributed-Memory Computers

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today

Part Two – Introduction to High Performance Computing

A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly

[2] Introduction to High Performance Computing for Scientists and Engineers

15 / 70

Page 16: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Programming with Distributed Memory using MPI

No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX

Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method

Part Two – Introduction to High Performance Computing

P1 P2 P3 P4 P5

Distributed-memory programming enables explicit message passing as communication between processors

MPI is dominant distributed-memory programming standard today (v2.2)

[8] MPI Standard

16 / 70

Page 17: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Hierarchical Hybrid Computers

Shared-memory nodes (here ccNUMA) with local NIs NI mediates connections to other remote ‘SMP nodes’

Part Two – Introduction to High Performance Computing

A hierarchical hybrid parallel computer is neither a purely shared-memory nor a purely distributed-memory type system but a mixture of both

Large-scale ‘hybrid’ parallel computers have shared-memory building blocks interconnected with a fast network today

[2] Introduction to High Performance Computing for Scientists and Engineers

17 / 70

Page 18: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Programming Hybrid Systems

Experience from HPC Practice Most parallel applications still take no notice of the hardware structure Use of pure MPI for parallelization remains the dominant programming

(historical reason: old supercomputers all distributed-memory type)

Challenges with the ‘mapping problem’ The performance of hybrid (as well as pure MPI codes) depends crucially

on factors not directly connected to the programming model It largely depends on the association of threads and processes to cores

Emerging ‘hybrid programming models‘ using GPGPUs and CPUs

Part Two – Introduction to High Performance Computing

Hybrid systems programming uses MPI as explicit internode communication and OpenMP for parallelization within the node

18 / 70

Page 19: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Emerging HPC System Architecture Developments

Increasing number of other ‘new’ emerging system architectures

General Purpose Computation on Graphics Processing Unit (GPGPUs) Use of GPUs instead for computer graphics for computing Programming models are OpenCL and Nvidia CUDA Getting more and more adopted in many application fields

Field Programmable Gate Array (FPGAs) Integrated circuit designed to be configured by a user after shipping Enables updates of functionality and reconfigurable ‘wired’

interconnects Cell processors Enables combination of general-purpose cores with co-processing

elements that accelerate dedicated forms of computationsPart Two – Introduction to High Performance Computing

Often in state of flux/vendor-specific More details are quickly outdated Not tackled in depth in this course

19 / 70

Page 20: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

HPC Ecosystem Technologies

Part Two – Introduction to High Performance Computing 20 / 70

Page 21: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

HPC Software Environment

Operating System Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’

Scheduling Systems Manage concurrent access of users on Supercomputers Different scheduling algorithms can be used with different ‘batch queues’ Example: SLURM @ JOTUNN Cluster, LoadLeveler @ JUQUEEN, etc.

Monitoring Systems Monitor and test status of the system (‘system health checks/heartbeat’) Enables view of usage of system per node/rack (‘system load’) Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc.

Performance Analysis Systems Measure performance of an application

and recommend improvements Examples: SCALASCA, VAMPIR, etc.

Part Two – Introduction to High Performance Computing

HPC systems typically provide a software environment thatsupport the processing ofparallel applications

21 / 70

Page 22: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Example: Ganglia @ Jötunn Cluster

Part Two – Introduction to High Performance Computing

[9] Jötunn Cluster Ganglia Monitoring Online

22 / 70

Page 23: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Scheduling Principles

HPC Systems are typically not used in an interactive fashion Program application starts ‘processes‘ on processors (‘do a job for a user‘) Users of HPC systems send ‘job scripts‘ to schedulers to start programs Scheduling enables the sharing of the HPC system with other users Closely related to Operating Systems with a wide varity of algorithms

E.g. First Come First Serve (FCFS) Queues processes in the order that they arrive in the ready queue.

E.g. Backfilling Enables to maximize cluster utilization and throughput Scheduler searches to find jobs that can fill gaps in the schedule Smaller jobs farther back in the queue run ahead of a job waiting at the

front of the queue (but this job should not be delayed by backfilling!)

Part Two – Introduction to High Performance Computing

Scheduling is the method by which user processes are given access to processor time (shared)

23 / 70

Page 24: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Example: Concurrent Usage of a Supercomputer

Part Two – Introduction to High Performance Computing [10] LLView Tool 24 / 70

Page 25: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

System Architectures

HPC systems are very complex ‘machines‘ with many elements CPUs & multi-cores ‘multi-threading‘ capabilities Data access levels Different levels of Caches Network topologies Various interconnects (Example: IBM Bluegene/Q)

Part Two – Introduction to High Performance Computing

HPC faced a significant change in practice with respect to performance increase after years Getting more speed for free by waiting for new CPU generations does not work any more Multicore processors emerge that require to use those multiple resources efficiently in parallel

25 / 70

Page 26: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Example: Supercomputer BlueGene/Q

Part Two – Introduction to High Performance Computing [10] LLView Tool 26 / 70

Page 27: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Significant advances in CPU (or microprocessor chips) Multi-core architecture with dual,

quad, six, or n processing cores Processing cores are all on one chip

Multi-core CPU chip architecture Hierarchy of caches (on/off chip) L1 cache is private to each core; on-chip L2 cache is shared; on-chip L3 cache or Dynamic random access memory (DRAM); off-chip

Multi-core CPU Processors

Part Two – Introduction to High Performance Computing

one chip

Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies

[11] Distributed & Cloud Computing Book

27 / 70

Page 28: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Example: BlueGene Architecture Evolution

Part Two – Introduction to High Performance Computing

BlueGene/P

BlueGene/Q

28 / 70

Page 29: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Network Topologies

Large-scale HPC Systems have special network setups Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB) Different network topologies, e.g. tree, 5D Torus network, mesh, etc.

(raise challenges in task mappings and communication patterns)

Part Two – Introduction to High Performance Computing

[2] Introduction to High Performance Computing for Scientists and Engineers

Source: IBM

29 / 70

Page 30: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Data Access

P = Processor core elements Compute: floating points or integers Arithmetic units (compute operations) Registers (feed those units with operands)

‘Data access‘ for application/levels Registers: ‘accessed w/o any delay‘ L1D = Level 1 Cache – Data (fastest, normal) L2 = Level 2 Cache (fast, often) L3 = Level 3 Cache (still fast, less often) Main memory (slow, but larger in size) Storage media like harddisk, tapes, etc.

(too slow to be used in direct computing)

Part Two – Introduction to High Performance Computing

The DRAM gap is the large discrepancy between main memory and cache bandwidths

fast

er

chea

per

[2] Introduction to High Performance Computing for Scientists and Engineers

too slow

30 / 70

Page 31: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

HPC Relationship to ‘Big Data‘

Part Two – Introduction to High Performance Computing

[12] F. Berman: ‘Maximising the Potential of Research Data’31 / 70

Page 32: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Large-scale Computing Infrastructures

Large computing systems are often embedded in infrastructures Grid computing for distributed data storage and processing via middleware The success of Grid computing was renowned when being mentioned by

Prof. Rolf-Dieter Heuer, CERN Director General, in the context of the Higgs Boson Discovery:

Other large-scale distributed infrastructures exist Partnership for Advanced Computing in Europe (PRACE) EU HPC Extreme Engineering and Discovery Environment (XSEDE) US HPC

Part Two – Introduction to High Performance Computing

‘Results today only possible due to extraordinary performance of Accelerators – Experiments – Grid computing’.

[13] Grid Computing YouTube Video

32 / 70

Page 33: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Towards new HPC Architectures – DEEP-EST EU Project

Part Two – Introduction to High Performance Computing

BN

Many CoreCPU

CN

General Purpose

CPU

MEM

GCE

FPGA MEM

NAM

FPGA

General Purpose

CPU

NVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAM

DN

MEM

FPGA

General Purpose

CPUMEM

NVRAMNVRAM

Possible Application Workload 33 / 70

[14] DEEP-EST EU Project

Page 34: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Parallel Programming Basics

Part Two – Introduction to High Performance Computing 34 / 70

Page 35: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Distributed-Memory Computers Reviewed

Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today

A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly

Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

Programming Model:

Message Passing

35 / 70

Page 36: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Programming with Distributed Memory using MPI

No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX

Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method

P1 P2 P3 P4 P5

Distributed-memory programming enables explicit message passing as communication between processors

MPI is dominant distributed-memory programming standard today (v3.1)

[8] MPI Standard

36 / 70

Page 37: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

What is MPI?

‘Communication library’ abstracting from low-level network view Offers 500+ available functions to communicate between computing nodes Practice reveals: parallel applications often require just ~12 (!) functions Includes routines for efficient ‘parallel I/O’ (using underlying hardware)

Supports ‘different ways of communication’ ‘Point-to-point communication’ between two computing nodes (P P) Collective functions involve ‘N computing nodes in useful communiction’

Deployment on Supercomputers Installed on (almost) all parallel computers Different languages: C, Fortran, Python, R, etc. Careful: Different versions might be installed

Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

37 / 70

Page 38: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

HPC Machine

Message Passing: Exchanging Data with Send/Receive

P1 P2 P3 P4 P5 P6

P

M

P

M

P

M

P

M

Each processor has its own data in its memory that can not be seen/accessed by other processors

DATA: 17

DATA: 06 DATA: 19

DATA: 80

Point-To-PointCommunications

NEW: 17

NEW: 06

Compute Node

38 / 70

Page 39: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Collective Functions : Broadcast (one-to-many)

P

M

P

M

P

M

P

M

DATA: 17

DATA: 06 DATA: 19

DATA: 80

NEW: 17

NEW: 17NEW: 17

Broadcast distributes the same data to many or even all other processors

39 / 70

Page 40: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Collective Functions: Scatter (one-to-many) Scatter distributes

different data to many or even all other processors P

M

P

M

P

M

P

M

DATA: 30DATA: 20DATA: 10

NEW: 10

DATA: 80

DATA: 19DATA: 06

NEW: 20NEW: 30

40 / 70

Page 41: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Collective Functions: Gather (many-to-one) Gather collects data

from many or even all other processorsto one specific processor P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 80NEW: 19NEW: 06

41 / 70

Page 42: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Collective Functions: Reduce (many-to-one) Reduce combines

collection with computation based on data from many or even all other processors

P

M

P

M

P

M

P

M

DATA: 17 DATA: 80

DATA: 19DATA: 06

NEW: 122

Usage of reduce includes finding a global minimum or maximum, sum, or product of the different data locatedat different processors

++

+

+

global sumas example+

42 / 70

Page 43: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Is MPI yet another network library?

TCP/IP and socket programming libraries are plentiful available Do we need a dedicated communication & network protocols library? Goal: simplify programming in parallel programming, focus on applications

Selected reasons Designed for performance within large parallel computers (e.g. no security) Supports various interconnects between ‘computing nodes’ (hardware) Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’

MPI is not designed to handle any communication in computer networks and is thus very special Not good for clients that constantly establishing/closing connections again and again (e.g. would

have very slow performance in MPI) Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond

firewalls, no message encryption directly available, etc.)

43 / 70

Page 44: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

(MPI) Basic Building Blocks: A main() function

The main() function is automatically started when launching a C program

Normally the ‘return code’ denotes whether the program exitwas ok (0) or problematic (-1)

Practice view: use of resiliency is not part of MPI (e.g. automatic restart and error handlings), therefore this is rarely used in practice

‘standard C programming…’

44 / 70

Page 45: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

(MPI) Basic Building Blocks: Variables & Output

Libraries can be used by including C header files, here library forscreen outputs for example

Two integer variables that are later useful for working withspecific data obtained fromMPI library

Output with printf using stdio library: ‘Hello World’ and which process is printing out of the summary of all n processes

‘standard C programming…’

45 / 70

Page 46: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

MPI Basic Building Blocks: Header & Init/Finalize

‘standard C programming including MPI library use…’ Libraries can be used by including C header files, here library forMPI included

The MPI_INIT() function initializes the MPI environment and can take inputs via themain() functionarguments

MPI_Finalize() shuts down the MPI environment(after this statement no parallel execution of the code can take place)

46 / 70

Page 47: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

MPI Basic Building Blocks: Rank & Size Variables

‘standard C programming including MPI library use…’

MPI_COMM_WORLD communicator constantdenotes the ‘region of communication’, here all processes

The MPI_Comm_size()function determines the overall number of n processes in the parallel program: stores it in variable size

The MPI_Comm_rank()function determines the unique identifier for each processor:stores it in variablerank with valures (0 … n-1)

47 / 70

Page 48: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Compiling & Executing an MPI program

Compilers and linkers need various information where include files and libraries can be found E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’ Compiling is different for each programming language

Executing the MPI program on 4 processors Normally batch system allocations

(cf. SLURM on JÖTUNN cluster) Manual start-up example:

Output of the program Order of outputs

can vary because I/Oscreen ‘serial resource’

P

M

P

M

P

M

P

M

hello hello

hello hello

$> mpirun –np 4 ./hello

create 4 processes that produceoutput in parallel

48 / 70

Page 49: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Practice: Our 4 CPU Program alongside many other Programs

[10] LLView Tool

Maybe our program!

49 / 70

Page 50: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

MPI Communicators

Each MPI activity specifies thecontext in which a corresponding function is performed MPI_COMM_WORLD

(region/context of all processes) Create (sub-)groups of the

processes / virtual groups of processes

Peform communicationsonly within these sub-groupseasily with well-defined processes

Using communicators wisely in collective functionscan reduce the number of affected processors

[15] LLNL MPI Tutorial

50 / 70

Page 51: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Shared-Memory Computers: Reviewed

Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)

A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space

[2] Introduction to High Performance Computing for Scientists and Engineers

Programming model: work on shared address space (‘local access to memory’)

51 / 70

Page 52: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Programming with Shared Memory using OpenMP

OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data (Shared-Memory concept itself is very old, like POSIX Threads)

Shar

ed M

emor

yT1 T2 T3 T4 T5

Shared-memory programming enables immediate access to all data from allprocessors without explicit communication

OpenMP is dominant shared-memory programming standard today (v3)

[7] OpenMP API Specification

52 / 70

Page 53: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

What is OpenMP?

OpenMP is a library for specifying ‘parallel regions in serial code’ Defined by major computer hardware/software vendors portability! Enable scalability with parallelization constructs w/o fixed thread numbers Offers a suitable data environment for easier parallel processing of data Uses specific environment variables for clever decoupling of code/problem Included in standard C compiler distributions (e.g. gcc)

Threads are the central entity in OpenMP Threads enable ‘work-sharing’ and

share address space (where data resides) Threads can be synchronized if needed Lightweight process that share

common address space with other threads Initiating (aka ‘spawning’) n threads is less

costly then n processes (e.g. variable space)

Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

Threads are lightweight processes that work with data in memory

53 / 70

Page 54: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Parallel and Serial Regions

modified from [2] Introduction to High Performance Computing for Scientists and Engineers

fork() initiated by master thread (exists always) creates team of threads

Team of threads concurrently work on shared-memory data actively in parallel regions

join() initiates the ‘shutdown’ of the parallel region and terminates team of threads

Team of threads maybe also put to sleep until next parallel region begins

Number of threads can be different in each parallel region

OpenMPprogram

54 / 70

Page 55: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Number of Threads & Scalability

The real number of threads normally not known at compile time (There are methods for doing it in the program do not use them!) Number is set in scripts and/or environment variable before executing

Parallel programming is done without knowing number of threads

OpenMP programs should be always written in a way that itdoes not assume a specific number of threads Scalable program

int main()

{

#pragma omp parallel

printf(„Hello World“);

}

int main()

{

#pragma omp parallel

printf(„Hello World“);

}

./helloworld.exe

Hello World

Hello World

Hello World

Hello World

./helloworld.exe

Hello World

Hello World

Hello World

Hello World

compile & execute

mT T0

T1

T2

T3

masterthreadbecomes T0

55 / 70

Page 56: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Library & Sentinel

‘standard fortran programming…’

‘standard C/C++ programming…’

The OpenMP library contains OpenMP API definitions

The Sentinel is a special string that starts an OpenMP compiler directive

Practice view: programming OpenMP in C/C++ and fortran is slightly different,but providing the same basic concepts (e.g. no end of parallel region in C/C++, local variables, etc.)

56 / 70

Page 57: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Unique Thread IDs

‘standard fortran programming…’

‘standard C/C++ programming…’

do_work_package() routine code is now executed in parallel by each thread

BUT also sub-routines of that routine are now executed in parallel

omp_get_thread_num() function provides unique Thread ID (0…n-1)

omp_get_num_threads() function obtains number of active threads in the current parallel region

57 / 70

Page 58: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Private Variables (Fortran)

PRIVATE defines local variables for each thread

Each thread works independently and thus needs space to ‘store’ local results

Practice view: the real parallelization idea is here in the loop: the simple sum of two arrays For each value of i we can compute and store array values independently from each other

Same code executed n times with n threads, BUT tid is unique and thus different for each thread

58 / 70

Page 59: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

Traditional HelloWorld Example (C/C++)#include <omp.h>

#include <stdio.h>

int main(argc,argv)

int argc; char *argv[]; {

int nthreads, tid;

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

printf(„Hello World from thread = %d\n“, tid);

if (tid == 0){

nthreads = omp_get_num_threads();

printf(„Number of threads in parallel region = %d\n“, nthreads);

}

}

}

Simple Parallel Program Only the master (tid=0) provides

output of how many threads are existing in the parallel region

Shared variable nthreads Local variable tid

59 / 70

Page 60: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Loops (do, for in C/C++) FIRSTPRIVATE()

copies initial value of shared variable to local variable(simple init here otherwise problems in loop)

DO loop (in front of usual do) distributes (automatically) loop iterations among threads as specifically supported ‘work-sharing’ construct

Smart programming support by OpenMP: Loops are very often part of scientific applications! Less burden for programmer: no manual definition of local variables (e.g. i automatically localized)

local sum exists, but where is the global sum?

60 / 70

Page 61: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Critical Regions Local sum exists in

each of the different threads

We have n times the local variable value for sum now

Race Condition in shared-memory: shared variable pi will be set concurrently by the different threads

Value of pi depends on the exact order the threads access pi and assign wrong values

Critical regions define a region within a parallel region where at most one thread at a time executes code (e.g. sum of new pi based on pi)

61 / 70

Page 62: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Reduction Several operations are

common in scientific applications

+, *, -, &, |, ^, &&, ||, max, min

REDUCTION() with operator + on variable s enables here …

Starting with a local copy of s for each thread

During progress of parallel region each local copy of s will be accumulated seperately by each thread

At the end of the parallel region automatically synchronized and accumulated with resulting master thread variable

Reduction operations are a smart alternative to manualcritical regions definitions around operations of variables

Reduction operation automatically localizes variable

62 / 70

Page 63: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Many-core GPUs

Use of very many simple cores High throughput computing-oriented architecture Use massive parallelism by executing a lot of concurrent threads slowly Handle an ever increasing amount of multiple instruction threads CPUs instead typically execute a single long thread as fast as possible

Many-core GPUs are used in large clusters and within massively parallel supercomputers today Named General-Purpose

Computing on GPUs (GPGPU)

Part Two – Introduction to High Performance Computing

Graphics Processing Unit (GPU) is great for data parallelism and task parallelism Compared to multi-core CPUs, GPUs consist of a many-core architecture with

hundreds to even thousands of very simple cores executing threads rather slowly

[11] Distributed & Cloud Computing Book

63 / 70

Page 64: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

GPU Acceleration

GPU accelerator architecture example (e.g. NVIDEA card) GPUs can have 128 cores on one single GPU chip Each core can work with eight threads of instructions GPU is able to concurrently execute 128 * 8 = 1024 threads Interaction and thus major (bandwidth)

bottleneck between CPU and GPU is via memory interactions

E.g. applications that use matrix –vector multiplication

Part Two – Introduction to High Performance Computing

CPU acceleration means that GPUs accelerate computing due to a massive parallelism with thousands of threads compared to only a few threads used by conventional CPUs

GPUs are designed to compute large numbers of floating point operations in parallel

[11] Distributed & Cloud Computing Book

64 / 70

Page 65: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

NVIDEA Fermi GPU Example

Part Two – Introduction to High Performance Computing

[11] Distributed & Cloud Computing Book

65 / 70

Page 66: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Challenges: Domain Decomposition & Load Imbalance

Part Two – Introduction to High Performance Computing 66 / 70

[16] Map Analysis - Understanding Spatial Patterns and Relationships, Book

Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

unusedresources

Load imbalance hampers performance, because some resources are underutilized

boundary halo

Page 67: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Challenges: Ghost/Halo Regions & Stencil Methods

Part Two – Introduction to High Performance Computing 67 / 70

[2] Introduction to High Performance Computing for Scientists and Engineers3 * 16 = 48 4 * 8 = 32

Stencil-based iterative methods update array elements according to a fixed pattern called ‘stencil‘

The key of stencil methods is its regular structure mostly implemented using arrays in codes

Page 68: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Lecture Bibliography

Part Two – Introduction to High Performance Computing 68 / 70

Page 69: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Lecture Bibliography

[1] Wikipedia ‘Supercomputer’, Online: http://en.wikipedia.org/wiki/Supercomputer [2] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein,

Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X

[3] TOP500 Supercomputing Sites, Online: http://www.top500.org/ [4] LINPACK Benchmark, Online: http://www.netlib.org/benchmark/hpl/ [5] HPC Challenge Benchmark Suite, Online: http://icl.cs.utk.edu/hpcc/ [6] JUBE Benchmark Suite, Online:

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html [7] The OpenMP API specification for parallel programming, Online:

http://openmp.org/wp/openmp-specifications/ [8] The MPI Standard, Online: http://www.mpi-forum.org/docs/ [9] Jötunn HPC Cluster, Online: http://ihpc.is/jotunn/ [10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html [11] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book,

Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049 [12] Fran Berman, ‘Maximising the Potential of Research Data’ [13] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online:

http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw [14] DEEP-EST EU Project, Online: http://www.deep-projects.eu/ [15] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/ [16] Map Analysis, Understanding Spatial Patterns and Relationships, Joseph K. Berry, Online:

http://www.innovativegis.com/basis/Books/MapAnalysis/Default.htm

Part Two – Introduction to High Performance Computing 69 / 70

Page 70: High Performance Computing - University of Iceland · 8/23/2017  · Part Two – Introduction to High Performance Computing A shared-memory parallel computer is a system in which

Part Two – Introduction to High Performance Computing 70 / 70