parallel programs to multi-processor computers!

Parallel programs to multi-processor

computers!

Author: Igor Oreschenkov

Date: 04.02.2009

Abstract For instance, when a man cannot manage everything he creates a tally for himself, brainless,

irresponsible, able only to solder pins, or carry heavy loads, or write from dictation but doing this very

well.

Arkadiy and Boris Strugatskie

The article is an introduction into parallel programs for beginners. It was published in "PC World"

magazine (http://www.viva64.com/go.php?url=429).

Parallel program, a particular case of labor division Hardly would we have seen the world around us as we see it if in its time mankind didn't invented the

principle of labor division. Due to this there are professionals specializing in a particular sphere and able

to perform their work perfectly. Implementation of this approach in the sphere of industry allowed us to

increase product release tenfold. That's why it's no wonder that when computers appeared there

arouse the question of applying the principle of labor division to computer programs' operation.

Research in this sphere allowed us to detect also a pleasing fact that complicated mathematical

calculations related, for example, to matrix operations, can be split into independent subtasks. Each

subtask can be executed separately and then you can get the result of solution of the source task by

combining the results of subtasks' solutions. As subtasks are independent they can be solved

simultaneously by several executors. If a computer takes the role of the executor this approach allows

us to find solutions for computationally very complicated tasks of realistic modeling of real-environment

objects, cryptanalysis, real-time control of technological processes. It is this principle according to which

all the supercomputers - the pride of scientific laboratories and computational centers - operate.

Having reached the limit of processors' performance increase relating to fundamental physical limits,

computer manufacturers began to increase the number of computational nodes in their products. And

now inside the chassis of a common PC you may find what has been earlier called a supercomputer - a

multi-core computing device. Operation systems have been ready for this technology for a long time.

Unix and Linux, OS/2, Mac OS and modern versions of Microsoft Windows show performance increase

at simultaneous execution of several programs. For example, while you watch a film, antivirus scanning

of your hard disk can be performed without your noticing it. But there are situations when solution of

only one task can demand a lot of computational resources. Leaving aside the problem of nuclear fusion

synthesis or cracking a password to an archive, let's think about such common things as music and video

coding and computer games. Programs implementing these tasks enable fully the resources of CPU. But

increase of the number of processors in the system won't give you marked performance gain in these

examples if no special methods were used while developing the programs. Indeed, how does an

operation system know the peculiarities of coding methods' implementation to be able to distribute

calculations between processors? Can it know how to perform parallel calculation of a game-situation

scene?

Processes and threads - twins So, we have a multi-processor computer. We have analyzed the task and singled out the subtasks which

can be solved simultaneously. How can we implement solution of these subtasks in a program? Modern

operation systems offer us two variants of code execution - in the form of processes and in the form of

threads.

A process is operation of a program loaded into main memory and ready for execution. So, when we

simultaneously perform two actions - serf the news in the Internet and burn files on a CD, two processes

are executed simultaneously - an internet-browser and a CD-writing program are operating. A process

consists of program code, operated data and various resources, for example, files or system queues

belonging to the program. Each process is executed in its address space, i.e. has access only to its own

area of main memory. On the one hand it is good because one process cannot interfere with the other

(until they both address the non-divisible resource, for example, the disk drive, but it is a different thing)

and errors in operation of one program in no way will influence operation of the other. But on the other

hand, if processes are meant for solving one common task, we face the question: how will they

exchange information and interact at all?

Launch of processes - Unix and Windows

What is interesting, developers of Mircosoft Windows and Unix (Linux) operation systems applied

different approaches to the question of launching a child process. From the viewpoint of a programmer

using WIN32 API everything is quite natural. In the point of launch of a support process CreateProcess()

function is called whose one argument indicates the path to EXE-file with the program to be executed.

In Unix-like operation system this is implemented in a different way. In the code point where we need to

create a new process, fork() system call is used. As the result the operating program "forks", that is the

state of the program at the moment of fork() execution is copied (by the state we understand the values

of the CPU registers, stack, data area and list of open files) and on the basis of this copy a new process is

launched possessing the same information as the parent process. Both parent and child processes

continue executing the same program beginning with the command following fork() system call. As

execution of one and the same work by two processors is senseless, we face the question: how to

implement solution of an independent subtask in each process we've got? The point is that from the

viewpoint of the parent process the result of fork() function is the identifier of the child process, and

fork() simply returns zero value into the child process. Each process can identify itself by this code. And

all the rest is a technical matter. The child process has only to prepare an environment for executing a

new program, load its code with the help of exec() system call and transfer control to it.

A thread is a means of forking a program inside a process. The process can contain several threads each

of which is executed independently from the others keeping its own values of registers and its own

stack. Threads have access to all the global resources of the executed process, for example, open files or

memory segments. Each process has at least one executed thread, the main thread. When solution of

the task comes to the section allowing paralleling, the processor spawns the necessary number of

additional threads executed simultaneously, solving their subtasks and returning the results to the main

thread.

To solve complex tasks it is not enough just to perform parallel execution of different code sections. You

need also to arrange exchange of the results of their operation, that is organize interaction between

these sections. One would think, in the case of threads it could be easily implemented by providing

access to the defined common area of main memory. But this simplicity is deceptive because operations

of writing into main memory by different threads can intersect very unpredictably so that the integrity

of the information kept in this memory area will be broken.

Example of threads' interaction

Imagine a billing system of a cellular operator executed on a powerful multi-processor server. It's no

wonder that charge-off from a client's account for the provided services is performed by one program

thread while charge of payments on the client's account is performed by another thread. The first

operation can be written in the form: S := S - A where S - balance of the account at the moment of

operation, A - cost of phone calls made. The processor will execute this operation in three steps: at first

the source value of S variable will be received, then A will be subtracted from it, after that the result will

be written into S cell. Similarly the second operation is performed: S := S + B where B - sum of money

paid by the client. Suppose the client make a call some time later after he has paid through a bank

office. As charge of payments on the account occurs with some delay due to objective reasons, it is

possible that both operations - charge-off and charge-on - will be executed in the billing system nearly at

the same time. Suppose that charge-off operation be executed first and the charge-on operation follows

it with one-beat lag (figure 1). After the three beats of the charge-off operation the current account

balance will equal S-A but the fourth beat will make this value S+B.

Figure 1 - An example of a billing system's operation

Thus, the final balance of the account will equal S+B instead of the right S-A+B value. In a different

situation such a collision could be not so good for the client. That's why to avoid such problems it is

necessary to take special measures to protect separate program sections from simultaneous execution.

In the next article sections about synchronization and interaction of simultaneously executed program

sections the terms "process" and "thread" will be used together as the described mechanisms are in

most cases applicable to both notions.

Synchronization... Modern operation systems offer a wide range of means for synchronizing parallel processes. The

simplest way for synchronizing simultaneously executed code sections is waiting of one thread for

termination of execution of the other. The same mechanism provides support of operation systems

executing asynchronous operations, such as file input-output or data exchange through network.

Besides, functions of waiting can be used at mutual announcing about the events taking place (figure 2).

Figure 2 - Event is an object of operation systems which can switch its state from "common" to "signal"

Thus, for example, while developing a system of bank trading day an event can be registered for

announcing about receipt of funds to a client's account. This event can be waited for by the thread

performing execution of client pay directions.

If there is a possibility that different program sections can simultaneously modify some variable it is

necessary to implement protection of this variable. For example, if one thread deals with calculations of

a game scene and the other deals with its repainting, the latter should be performed only after all the

necessary calculations are done. For this purpose a special object mutex is introduced into operation

systems (English "mutual execution").

Figure 3 - Mutex is a system object which can be in one of the two internal states - "free" or "busy".

Before a thread begins to change the variable it should perform capture of the mutex corresponding to

this variable. If it succeeds the mutex changes its internal state and the thread continues execution. If

the other thread tries to capture the corresponding mutex while the first thread works with the variable,

it will be rejected and it will have to wait until the mutex is released by the first thread (figure 3). Thus,

at a concrete moment of time it is guaranteed that only one section works with the variable.

While programming there can be situations when you need to modify several global variables in one

code section. If there is a possibility that this code will be executed by several threads simultaneously,

you should implement its protection. Specially for such cases operation systems offer to arrange the

corresponding code into a critical section.

Figure 4 - Critical section is a program section which can be executed by only one thread at a time.

The critical section can be executed by only one thread at a time. The other thread can inquire if the

critical section is available for execution and in case of negative answer can either wait until it is free or

execute another operation (figure 4). Critical sections differ from mutexes also in that one and the same

thread can enter the critical section it has captured many times while when trying to capture a mutex

the thread's execution will stop and a so called "deadlock" can occur.

Suppose we have a set of single-type resources, for example, several windows for displaying information

or a pool of network printers. To keep record of distribution of such resources between processors we

use semaphores.

Figure 5 - Semaphore allows you to keep record of distribution of single-type resources between threads

Semaphore is a system object which is a counter. Before a thread starts working with the resource from

the set it should address the semaphore associated with it. If the value of the semaphore counter is

greater than zero the thread is allowed to work with the resource while the counter's value decrements

by one (figure 5). But if all the resources of the set are used the thread will be enqueued. When the

resource is free the thread announces the semaphore about it, the semaphore's value rises and the

resource can be used again.

Sometimes it is necessary to urgently stop execution of a process to perform some urgent actions. Some

operation systems use signals for this purpose. A process which has to execute a special operation sends

a signal to the other process. When writing this other process a procedure of processing signals should

be implemented: when the process gets a signal it can suspend execution or terminate its work, execute

a special subprogram created for such a case or just ignore the signal.

... and interaction As it was said above, to solve a common task together processors may need not only to coordinate their

work but also to exchange information - for example, intermediate results. For this purpose you can use

main-memory sharing, information exchange through a file, transfer of data through unidirectional

pipes or imitation of network operation (figure 6).

Figure 6 - Ways of exchanging data between processors

The quickest and most obvious way of information exchange between simultaneously executed code

sections is to use the shared area of main memory. One process writes data into memory, the other

reads them and vice versa. But in this case you need to implement synchronization of the processes by

one of the above mentioned ways. If information exchange between the threads through the shared

memory can be implemented directly the processes should at first request the necessary memory

segment from operation systems and coordinate the procedure of its usage.

The next method of information exchange is using pipes. Pipe is a system object allowing you to transfer

information in one direction, as a rule. The most known examples of pipes are standard input/output

streams (stdin, stdout). Stdout of one process can be directed into the stdin of the other. After this the

information being written into stdout of the first process can be read by the second process. Operation

system allow us to create additional pipes. To work with pipes we use functions whose syntax is similar

to functions of file operations.

You can perform data exchange with the help of files as well. Modern operation systems have

embedded mechanisms of buffering information participating in file operations that's why this method

is rather effective. But from the security viewpoint this approach yields to those described above as an

unauthorized application can get access to the file used for inter-processor exchange. Before using this

method you should study the guidance on implementation of file operations in a concrete operation

system and take measures to protect data.

Duplex data exchange between processes can be implemented by operation system network means.

With the help of sockets two processes can establish a channel between them to transfer data in the

same way as a browser and a HTTP-server. The format of the data being transferred in this case can be

absolutely of any kind - the point is that the processes use the same agreements of exchange procedure,

the protocol. Moreover, nothing prevents these processes from being executed on different computers.

And with this last notice we pass on to the next article section.

Supercomputers - everywhere! Speaking about parallel execution of programs, we suggested that we use a computer with several

processors. But such a computer is not the only platform for executing parallel programs. Let's consider

a simplest computational network consisting of ten common one-processor computers united by a

FastEthernet twisted pair. Looks familiar, doesn't it? What is each computer in fact? Generally speaking,

it is a processor with main and disk memory available to it. Each computer is a perfect area for executing

one process. And what if on each of the ten computers we launch simultaneously programs which solve

subtasks of cracking the password to an archive? Obviously, solution of the task will demand much less

time than in case of using one computer. And what if the network consists not of 10 but of 100

computers? And this network is linked to the Internet?

From this example we see that even small local networks have a great potential computational power.

And if you hardly can use this power for computer games, for CAD systems it will be good enough. A

group of computers united by the bus of transferring data into a single computational system is called a

cluster system or cluster. You need only to write a special program in which you should implement the

possibility of information exchange between its simultaneously executed copies (processes) through the

local network. Generally speaking, this program may have traditional client-server architecture, i.e. be

able both to send messages to its neighbors and receive messages from them. But there already exist

ready solutions for simplifying development of programs for cluster systems, for example, MPI and

PVM, which offer a programmer means for performing both point-to-point and collective interaction

between processes executed on the nodes of the cluster system, and also the methods of their

synchronization.

MPI specification (Message Passing Interface) offers a programming model in which a program spawns

several processes interacting with the help of addressing subprograms of message passing and

receiving. Its implementations are libraries of subprograms which can be used in C/C++ and Fortran

languages. When launching an MPI-program a fixed number of processes is created. MPI specification

has been created as an industry standard that's why all its means are focused on getting high

performance when used on symmetrical multi-processor systems and homogeneous cluster systems

(supercomputers). There is a free implementation of MPI - MPICH (MPI CHamelion), whose Windows

and Linux versions are available for download here: http://www.viva64.com/go.php?url=405. Being

installed on a PC MPICH system can perform development and debugging of programs for further usage

without any modifications on clusters or supercomputers.

Unlike MPI the system of development and execution of parallel programs PVM (Parallel Virtual

Machine) has been created within the framework of a research project meant for heterogeneous

computational complexes. It allows you to quickly unite a heterogeneous set of computers connected

through network into a computational resource which is called "Parallel Computing Machine". The

computers may have different architectures and operate on different OS. PVM system is a set of

libraries and utilities meant for development and debugging of parallel programs and also for controlling

configuration of the Virtual Computing Machine. C/C++ and Fortran languages are supported.

Configuration of the computational complex can change dynamically by means of excluding some nodes

and adding new ones. Such universality is possible also because of some decrease of performance in

comparison with MPI system.

Thus, both MPI and PVM allow you without much effort to turn the local network of your organization

into a powerful computing machine able to solve complex tasks.

We should admit that parallel computational systems are very common nowadays. Operation system

provide developers with the necessary low-level service, while for solving applied tasks there exist ready

proved means in the form of specialized utilities and libraries for popular high-level programming

languages. And a programmer should master methods of parallel program development to keep up with

the modern tendencies.

parallel programs to multi-processor computers!

Technology