introduction to parallel programming with the message

Introduction to parallel programming with

the Message Passing Interface

Szymon WinczewskiFaculty of Applied Physics and Mathematics

Gdansk University of TechnologyGdansk, Poland

Firenze, 2016.

Course details

● Two sessions, each three-hour long.● Questions are welcome:

please ask during (or after) the lecture.● You may also contact me later via e-mail:

[email protected].● A PDF copy of the slides (and course materials)

will be available to download from:

http://www.mif.pg.gda.pl/homepages/swinczew/MPI_Firenze

● Two parts: theoretical (3 hours) and practical (also 3 hours).

Course outline

● Part 1 (theoretical, rather general) - basics of parallel processing:

a) why (and where) do we need parallel processing?

b) computer performance: how it is measured and how did it change across last decades?

c) serial vs parallel computer: similarities and differences,

d) examples of parallel computers,

e) four typical way's to compute – Flynn's taxonomy,

f) memory models (shared, distributed),

g) how to decompose the computational problem: trivial, functional and data decomposition.

Course outline

● Part 2 (more practical) - basics of the Message Passing Interface:

a) structure of the message,

b) blocking point-to-point communication,

c) initialization and deinitialization,

d) collective communication,

e) computer exercises.

Part 1 – basics of parallel processing

Why do scientists need (parallel) computers?

● Many theoretical models can not be solved analytically.

● Therefore they are often solved numerically, with the aid of computers.

● Examples: - computer-aided design (CAD),- computational fluid dynamics, - computational nanotechnology, - quantum chemistry,- and many more.

How to measure and compare the performance of computers?

Performance of cars

● With car it's easy – you can usually judge its performance by looking at the car.

Performance of cars

● When in doubt, you may consult the speedometer.

● For a car the maximum velocity is a good measure of overall performance.

● Of course there are also some other characteristics that may be important (acceleration, fuel consumption, etc.).

medium performance high performance

Performance of computers

● Can we do the same with a computer?

● With similar computers: perhaps yes, by comparing the clock speed (actually, a frequency, in MHz).

● Not a good idea if the computers have different architectures!

How to measure performance of computers?

● MIPS – Million Instructions Per Second,● MIPS ≠ MHz:

a) usually computers need several clock tics to execute an instruction, so usually MIPS < MHz,

b) newer computers can execute several instructionssimultaneously (pipelining) and thus can often process morethan one instruction in one clock tick, for example Pentium 4 can (on average) process 5 instructions in 2 clock tics, so in this case MIPS > MHz.

● MIPS measures the performance of the CPU.● Obviously: the more, the better!

Drawbacks of the MIPS index

● Measures only the performance of the CPU.● Does not take into account the performance of other

components, such as memory, FPU, HDD, etc. ● Hardware producers (at marketers' insistance)

usually specify peak performance, not average performance – that's cheating a little.

● As a consequence, this measure has fallen into disgrace:

MIPS = Meaningless Indicator of Processor Speed.

How to measure performance of computers?

● FLOPS – FLoating-point Operations Per Second.

● Another common measure. Similar to MIPS, but focused on measuring the performance of FPU (floating-point unit). The FLOPS index counts floating-point instructions.

● Since the scientific computations are usually FPU-intensive, the FLOPS index is better suited to measure the performance of computers in scientific computations.

CPU – controls program execution, performs integer calculations

FPU – performs floating-point (non-integer) calculations

How does processing power change with time?

● Performance of personal computers.


● Performance of personal computers and supercomputers.


● Two main conclusions:

a) performance increases exponentially with time (Moore's Law),

b) aside from personal computers, there exist supercomputers, their processing power being approx. 10 000 greater!

● How does supercomputers obtain this greater processing power?

Moore's law

● Complexity of processors doubles every 2 years (complexity ≈ number of transistors).

● The law works since 1965. In theory even backwards!

● Another view of Moore's law: if f is the number of computations per second per $100 paid for the hardware, then y = f(t) grows exponentially.

Vector computers

● Initially supercomputers were vector computers.● Main difference: vector computers are able to

execute one instruction on many data simultaneously:a) serial computer: multiply a by 3.b) vector computer: multiply a and the next 31

numbers by 3, all at once.

Example of a vector computer: Cray-1 ('76).

● It had long, vector registers. Each register could store sixty-four 64-bit numbers. Owing to this, it could add, multiply, substract, etc. 64 large numbers at once.

● It had separate pipelines for different instructions, e.g. addition and substraction was realized in separate circuits. That way it could add and substract numbers at the same time (and 64 numbers!).

● In '76 it was a big success, and - in fact – first successful vector computer.

Vectorization is a way to improve performance

vector computers

Other ways to improve performance

● Vector computers were popular in '70s and '80s, but they are not very common today.

● Main drawback of the vector computers was that their multiple processing units could only work with a single instruction, albeit with different data.

● They were also very expensive and difficult to build.● In '90s parallel computers have replaced vector computers.● Parallel computer: can simultaneously process multiple

data with different instructions.

Parallelization is another way to improve performance

parallel computers

Parallel computer

● A system of many (usually identical) processing units, working together.

● The main idea: it is easier to produce 100 slow computers and combine them together than to produce one computer that would be 100 times faster.

troubles expected

Parallel computer vs distributed system

● Parallel computer – a collection of at least two processors capable of jointly solving a complex computational task, usually having the same architecture and controlled by the same operating system.

● Distributed system – a collection of independent computers, linked together via a network, with distributed operating software, often having different architechtures and operating systems.

Parallel computer vs distributed system

● Typical parallel computers are geographically tight (one room, one building) and use very fast networks (e.g. gigabit ethernet or even faster, dedicated networks).

● Typical distributed systems are usually spread geographically and linked with a slow network (e.g. desktop computers of many users in several countries communicating via Internet).

An example of a parallel computer● Tryton cluster at the TASK Supercomputing Centre (Gdansk, Poland),

● 1607 2-processor computers („nodes”) combined into one parallel computer with 3214 processors (Intel Xeon E5 v3 @2.3 GHz, 12-core) and 38 568 cores.

● Each node has 128/256 GB of memory, the whole computer has 2018 TB of RAM.

● All this linked by InfiniBand FDR 56 Gb/s.

● Theoretical performance: 1.48 PFLOPS (1 PFLOPS = 1015 FLOPS)

● Put to use in march 2015, ranked 163th on the top500.org list (November 2015).

Another example of a parallel computer

● Tianhe-2 (Milky Way 2), located in National Super Computer Center in Guangzhou (China).

● Ranked 1st on top500.org list (November 2015).

● Number of cores: 3 120 000.

● Theoretical performance: 54.9 PFLOPS.

● Linpack Performance: 33.86 PFLOPS.

● Power: 17 808 kW.

● Memory: 1 024 000 GB.

● Processors: 12-core Intel Xeon E5-2692 v2 @2.2 GHz.

An example of a distributed system: the SETI@home project

● SETI - Search for Extraterrestrial Intelligence.

● Hundreds of thousands of computers belonging to independent users.

● Mostly PC computers, connected via Internet.

● Execute the same program in their idle time (usually as a screensaver).

● With over 145 000 active computers had the ability to compute over 670 TFLOPS (June 2013).

Flynn's taxonomy

● Four types of computers (depending on how they process data):

a) SISD (single-instruction, single-data) – in each step one instruction is executed, on one datum => this is a typical serial machine.

b) SIMD (single-instruction, multiple-data) – in each step one instruction is executed, but on many data simultaneously => this is a typical vector computer.

c) MISD (multiple-instruction, single-data) – in each step many instructions are executed, on one datum => this is extremely rare, only experimental computers try this (wavefront processors, resembling the human brain).

d) MIMD (multiple-instruction, multiple-data) – in each step many instructions are executed, on many data simultaneously => this is a typical parallel computer.

MIMD architecture

● The approach that we will be interested in (parallel computers).

● In the MIMD architecture all processors usually execute the same program, but the path that each processor takes in the program depends on the number of the processor.

● In different words: processors execute different portions of the program.

if ( i_am_processor_1 )process_task_1();

else if ( i_am_processor_2 )process_task_2();

...

Parallel computers

● From now on we will focus on parallel computers, that is MIMD architecture.

● These are often „Linux clusters”, created by joining tens of PC-class computers with a high-speed network, and installing specialized („cluster”) software.

How is memory organized?

● One technique – shared memory (SM).

● All processors have access to the same (large) operating memory, which is shared (common) for all.

● All processors are said to work in the same address space.

● Processors can communicate using this shared memory – communication is thus easy and fast, but...

Shared memory - difficulties

● What happens when two processors want to simultaneously write to the same address?

● What when one tries to write, and one tries to read? Should then the old value be read or the new value?

● These are synchronization hazards.

Shared memory - difficulties

● Problems of technological nature – specialized hardware is needed to join (cross-over) many processors with one memory.

● The complexity (and thus the cost) of this hardware grows drastically for nproc > 20.

● With more processors the performance suffers because of the synchronization required: one processor can access the memory at a particular time, the others have to wait. The more processors - the longer waiting times.

● To summarize: this is very convenient for the programmer, but a nightmare for the engineer designing a computer.

● Very expensive.

How is memory organized?

● Another technique – distributed memory (DM).

● Each processor has its own, separate memory (own address space).

● Because memories are private, processor n „does not see” what resides in memory of processor m (if n ≠ m).

● This is rather easy for the engineer: you take separate computers and link them with fast network.

● This is difficult to the programmer: now the processors can not exchange data through the memory. A way to exchange messages (via the network) is needed!

MIMD+DM parallel processing

● A parallel computer, executing different instructions on each processor, processing many data simultaneously, built with distributed memory.

● We will be interested in this architecture.● Most parallel computers (clusters) fall into this

category.

MIMD+DM parallel processing

● Most important: such a parallel machine is not a monolith, but an ensemble of separate units („nodes”), each with its own memory, linked with a network.

● These nodes can not work on a common task until the programmer distributes work onto the nodes. This is called problem decomposition.

● If a program is to be run on a parallel computer, it must be written as to take the parallel architecture into account (passing messages, etc. must be explicitly coded).

A trivial example

● The task is to add up all elements of a 200 x 200 matrix.

● On a serial computer:

read matrix from a fle;sum=0;

for all rows{

for all columns{

sum = sum + matrix[row][column];}

} output the sum;

A trivial example

● On a parallel computer (say, 8 processors):

a) on one processor read matrix from a file,

b) split the matrix into 8 smaller parts, each with 25 x 200 elements <= problem decomposition,

c) send one part to each processor <= communication,

d) let each processor find the sum of elements that belong to its part of the matrix,

e) send computed partial sums back to one processor <= communication,

f) add the sums from 8 processors and output the total sum.

Problem decomposition

● Dividing one, large task into a set of many, smaller tasks.● We, the programmers, must decompose the problem on our

own, so that it is ready for parallel processing.● Three types of decoposition are most often encountered:

a) trivial decopositon,

b) functional decoposition,

c) data decoposition:- geometric,- scattered spatial decoposition,- task farm.

Trivial decomposition

● The easiest one.● Can be applied if the data can be divided into independent parts,

and each of these parts can be treated with the same algorithm.● Example: add up all elements of a 200 x 200 matrix.● Each processor gets a portion of the data to compute and takes

care of it. There are no dependencies, so the processors do not need to communicate (apart from distributing the data at the beggining, and collecting the results at the end).

● The total time to compute = the time it takes to process the largest part.

Trivial decomposition

● Occurs quite frequently.● One example is numerical integration.

● Another typical application: Monte Carlo.

Functional decomposition

● This time we divide not the data, but the algorithm.

● We divide the algorithm into independent stages, each processor gets one stage to take care of.

● Similar to an assembly line in factory.


● An analogy: guitar factory. Imagine that the production of a (single) guitar can be divided into five stages:

a) create guitar body,

b) add tuning knobs,

c) paint guitar body,

d) put on string,

e) tune the guitar.

● Assign one person (processor) to each stage.● Each guitar goes thru all five stages before it is finally produced.● We can work with 5 guitars at once.


● The same concept may be applied to a program.

● This method does not allow us to use more processors than the number of stages.

● However, some (more demanding) stages can be split into smaller parts or assigned larger number of workers (processors).

● It is important to balance work: we need to wait for the slowest stage.

Data decomposition

● Another type: data is divided across processors.

● Dependencies may be present => need to communicate.

● An example (from molecular dynamics): there are atoms in the box, we need to compute forces acting between pair of atoms.

● Atoms are distributed across processors according to some geometrical criterion.

#1 #2

#3 #4

Data decomposition

● When calculating the force between atoms assigned to the same processor, there is no problem.

● When calculating the force between atoms assigned to different processors, a communication must take place, because the particles reside in different memories.

● The need to communicate is what distinguishes the data decomposition from the trivial decomposition.

Task farm

● This is the last type of data decomposition.● Closely related to the master-slave concept.● Processors are split into two groups:

masters and workers (slaves). ● In the simplest case there is only one master

and all other processors are slaves. ● ===> obrazek z masterem i niewolnikami

Task farm

● Master splits the computational task into smaller tasks (called grains) and puts them in the „to do” pool.

● Each of the worker, until all work is done:a) asks the master to assign him a grain of work,b) gets a grain of work,c) processes this grain of work,d) sends the results back to the master.

Problem decomposition - summary

● Is essential: if we want to solve some problem in parallel, first we need to decompose it into smaller problems.

● It is us, programmers, who need to decompose the problem in one way or another before it is ready for parallel processing.

● There are many ways to decompose problems. We have studied only the basic ones.

● Often a problem may be tackled by a combination of decomposition strategies.

introduction to parallel programming with the message

Documents