mat-2.108 independent research project in applied...

23
Mat-2.108 Independent Research Project in Applied Mathematics APPLYING PARALLEL COMPUTING IN THE NUMERICAL SOLUTION OF OPTIMAL CONTROL PROBLEMS Sampsa Ruutu Systems Analysis Laboratory Helsinki University of Technology

Upload: others

Post on 15-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

Mat-2.108 Independent Research Project in Applied Mathematics

APPLYING PARALLEL COMPUTING IN THE NUMERICAL SOLUTION OF OPTIMAL CONTROL PROBLEMS

Sampsa RuutuSystems Analysis Laboratory

Helsinki University of Technology

Page 2: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

Table of Contents1 Introduction...................................................................................................................................... 32 Air Combat Duel.............................................................................................................................. 3

2.1 Missile Support Time Game......................................................................................................42.2 Solution Methods.......................................................................................................................4

3 Parallel Computing...........................................................................................................................53.1 Overview................................................................................................................................... 53.2 When to Use Parallel Computing.............................................................................................. 63.3 Parallel Computer Architectures............................................................................................... 7

3.3.1 Flynn's Taxonomy.............................................................................................................. 73.3.2 Memory Architectures........................................................................................................ 8

3.4 Parallel Programming Models................................................................................................... 93.4.1 Message Passing Interface (MPI)....................................................................................... 9

3.5 Designing a Parallel Program.................................................................................................. 103.5.1 Partitioning....................................................................................................................... 103.5.2 Load Balancing.................................................................................................................113.5.3 Communications and Granularity.....................................................................................12

4 Performance Analysis.....................................................................................................................124.1 Amdahl’s Law......................................................................................................................... 124.2 Observed Speedup................................................................................................................... 134.3 Efficiency.................................................................................................................................13

5 Numerical Examples...................................................................................................................... 135.1 The IBMSC Parallel Environment.......................................................................................... 135.2 An MPI Program..................................................................................................................... 14

5.2.1 Static Load Balancing.......................................................................................................145.2.2 Dynamic Load Balancing................................................................................................. 15

5.3 Results..................................................................................................................................... 155.3.1 Fixed Problem Analysis....................................................................................................15

5.4 Analysis with a Variable Problem Size................................................................................... 186 Summary and Conclusions............................................................................................................. 187 Appendix........................................................................................................................................ 20

7.1 MPI Communication Costs..................................................................................................... 207.2 Code Excerpts..........................................................................................................................20

7.2.1 Static Load Balancing.......................................................................................................207.2.2 Dynamic Load Balancing................................................................................................. 21

8 References...................................................................................................................................... 23

2 of 23

Page 3: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

1 Introduction

1 Introduction

In this study, parallel computing is applied in the numerical solution of optimal control problems. In

many cases situations arise where a large number of independent optimal control problems must be

solved for various initial states, and instead of using only one processor for the computation, the

workload can be distributed to different processors that operate simultaneously.

Here, the optimal control problems are a part of an air combat model in which the pilots must

decide the optimal support times of launched missiles. The support time refers to the time the

launching aircraft gives target information to the missile. Although the computational problem

considered here comprises the numerical solution of a set of optimal control problems, the same

methods of parallel programming can be used in various other applications where independent

computation must be done for a set of data pieces.

The numerical solution of optimal control problems generally takes a lot of computational resources

even though quite efficient algorithms exist that are designed for this purpose. Solving an optimal

control problem can be done with either direct or indirect methods. Indirect methods rely on solving

the first order necessary conditions of the optimum and finding a solution by using some root

finding algorithm. The disadvantage of indirect methods is that the converging region is small, so a

good initial guess is required. This makes them less robust than direct methods. In direct methods

the control and state variables are parameterized and the problem is solved using some efficient

non-linear programming solver.

2 Air Combat Duel

The air combat setting studied in this paper considers a duel of two fighter aircraft. Both are

equipped with medium range air-to-air missiles that they launch towards each other simultaneously.

The initial states of the aircraft are chosen in a way that they are too close to avoid the

confrontation.

In order to reach its target, a missile requires target information from either the launching aircraft or

its own radar. After an aircraft has launched a missile, it typically tracks the target aircraft with its

radar for a specific time and transmits the target information to the missile. After this, the launching

aircraft evades and the missile has to extrapolate the position of the target using prior information.

Finally, the missile tries to lock on the target by switching on its own radar. This happens when the

3 of 23

Page 4: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

2 Air Combat Duel

missile has reached a certain lock-on distance to the estimated target position. The lock-on distance

is the maximum distance the missile radar is able to lock on to its target with a given probability.

Point-mass models with three degrees of freedom are used to model the dynamics of the aircraft and

missile. The aircraft can be controlled by adjusting the angle of attack, bank angle and thrust. The

missile autopilot uses a feedback guidance law called Proportional Navigation.

2.1 Missile Support Time Game

In the setting above, the pilots must decide the supporting times of the missile, i.e. the times that

they transmit target information to the missile. A longer support time will shorten the extrapolation

time of the missile, thus increasing the probability of the missile being able to lock on to its target

and eventually making a successful hit. However, prolonging the support time also increases the

probability of hit of the enemy missile since during this phase the aircraft must fly towards the

enemy aircraft.

In Refs. [1] and [2] the launch and evasion times are computed using artificial intelligence

techniques combined with pursuit-evasion game solutions based on the assumption that the

adversary follows a specific feedback rule. However, with these techniques only suboptimal

controls can be obtained. The system studied in this paper is based on a static game in which the

pay-off functions of both players are weighted averages of the probabilities of hit and their own

survival. The game and its solution methods are described in detail in Ref. [3].

2.2 Solution Methods

To be able to solve the optimal support time for an arbitrary initial state in real time, a set of optimal

control problems must be solved beforehand for various initial states. In these problems, the players

try to minimize the closing velocity of the approaching missile to minimize the probability of hit.

The support times of both players are restricted a set of discrete time intervals, denoted by

T B={tminB , t 2

B , , t nB} , (2.1)

T R={tminR , t 2

R , , t nR} , (2.2)

where

B = Blue player,

R = Red player.

4 of 23

Page 5: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

2 Air Combat Duel

For each initial state, the solution of the optimal control problem is computed for a set of support

time pairs, denoted by

t sB , t s

R∈T B×T R . (2.3)

Once these optimal control problems have been computed, the values of the pay-off functions can

be calculated. The solution related to an arbitrary initial state can then be obtained in real time by

linearly interpolating from the previously computed results.

As such, it is not crucial that the optimal control problems solved off-line are computed fast.

However, since the optimal control problems related to the different initial states are independent,

with parallel computation it is possible to use existing computing resources more efficiently.

3 Parallel Computing

3.1 Overview

Computer programs are traditionally intended to be executed by a single processor. Serial programs

comprise a series of instructions that the processor executes sequentially one at a time. The

paradigm of parallel computing is different, since the idea is to use multiple computing resources

simultaneously. Whereas a serial computer such as an ordinary PC typically has only one processor,

a parallel computer consists of multiple processors that are able to work together. There are

different types of parallel computers ranging from a single computer containing several processors

to multiple computers in a network. A combination of these is also possible.

The main reason for using parallel computing is optimizing the use of time and computational

resources. The same problem can be solved faster if the workload is divided to several processors,

or alternatively a larger problem can be solved in a given timeframe. Another reason to use parallel

computing might be to overcome memory constraints since the lack of sufficient memory can be an

issue when solving a large problem on a single serial computer.

Over time the computing power of computers has increased rapidly, but at the same time also the

demand for greater computing power has increased. This is because the increase in computing

power allows complex systems to be modelled more accurately. Traditionally parallel computing

has been seen as the high end of computing with particularly science and engineering as its main

application areas. Important research areas that use parallel programming extensively range from

5 of 23

Page 6: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

climate models to the research of chemical reactions and electronic circuits to name but a few.

However, applications in science and engineering are not the only areas that employ parallel

computing anymore since nowadays also commercial applications such as multimedia applications

and parallel databases require more and more computing resources.

With the current technology the performance increase of a single processor cannot continue to

increase indefinitely since physical limitations such as the speed of light ultimately pose constraints

to clock cycle times of processors. There are also limits to miniaturization, and the number of

transistors on a chip cannot increase forever. Also, even though it could be theoretically possible to

build faster processors, it is more economical to use multiple moderately fast processors to achieve

the same performance. The factors that affect the performance of a computer are the time required

to perform a basic operation and the number of basic operations that the computer can perform at

the same time, and with parallel computing it is possible to increase the latter. In summary,

parallelism seems to be the future of computing judging from the multiple processor architectures

that have emerged and faster networks that allow distributed computing.

3.2 When to Use Parallel Computing

A parallel computing approach is feasible when the computation can be divided into separate work

pieces that can be computed simultaneously. Parallelization will not be a good alternative if the

different tasks depend on each other's results heavily and need to be computed in a sequential order.

Good results are achieved especially in situations where the tasks are independent of each other and

the required communication between the tasks is therefore very little if any. These kinds of

problems are sometimes called embarrassingly parallel since their parallelization is fairly easy. For

the implementation a parallel supercomputer is not always required since the development of

computer networks has allowed the distribution of the computational work to multiple ordinary

computers over an Ethernet network or even the internet. SETI@home [4] is a good example of an

embarrassingly parallel problem in which the workload is distributed over millions of home

computer over the internet. Another example is the solution of the set of the aforementioned optimal

control problems, where the computational workload of the different initial states does not depend

on each other.

6 of 23

Page 7: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

3.3 Parallel Computer Architectures

A computer system consists of one or more processors, memory, input/output devices and

communication channels. In the von Neumann architecture both program and data instructions are

stored in the memory. The CPU (central processing unit) gets the instructions and data from the

memory and executes them sequentially. These kinds of computers are called stored-program

computers and comprise virtually all computers today.

3.3.1 Flynn's Taxonomy

Computer architectures can be classified in a number of ways. Flynn's Taxonomy [5] [6], which

classifies the architectures depending on the number of instruction and data streams they have, is

one of the most widely used. The four classes according to Flynn's Taxonomy are described below.

• SISD (Single Instruction, Single Data): The most common type of computer which includes

serial computers such as most PCs. Since the CPU operates on only one instruction and data

stream, the execution is deterministic.

• SIMD (Single Instruction, Multiple Data): A type of parallel computer, including processor

arrays and vector pipelines, where each processor executes the same set of instructions but with

different data elements. It is well suited for problems with a high degree of regularity and its

advantages are that the execution is deterministic and the amount of logic required is small.

• MISD (Multiple Instruction, Single Data): A type of parallel computer in which each

processor does different operations on the same data elements. Usually it is more efficient to

have multiple data streams if there are multiple instruction streams, and because of this the

MISD architecture is very rare.

• MIMD (Multiple Instruction, Multiple Data): The most common type of parallel computer

nowadays including most supercomputers, networked computer grids and some PCs. Each

processor has a distinct set of instructions and data.

One addition to Flynn's taxonomy is SPMD (single program, multiple data) which is physically a

MIMD computer. However, each processor executes the same program code making the

programming simpler compared to MIMD. SPMD is also more flexible than SIMD since the

different tasks which are executed by different processors can be at different parts of the program at

a given time. This can be taken care of by the programmer with the use of control structures. In this

7 of 23

Page 8: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

sense SPMD is more of a style of computing than a distinct architecture. Nevertheless, SPMD has

become very popular.

3.3.2 Memory Architectures

Computer architectures can also be classified according to the type of memory they have. Parallel

computers can have shared memory, distributed memory or hybrid architectures combining

elements of both.

In shared memory computers there is one global memory that all processors can access. The

advantage of shared memory is that data sharing between the tasks is fast. The main disadvantage is

poor scalability which results from the increase in traffic between the CPUs and the memory as the

number of CPUs increases. Scalability is also limited due to the fact that it is expensive to build a

shared-memory computer with a very large number of processors. The programmer must also take

care of synchronization to avoid undesired results when accessing the global memory.

Distributed memory computers on the other hand contain local memory for each processor, and

there is no global memory nor can they access each other's memories directly. There is a

communication network that is used to send data between the tasks with message passing.

Communication between the tasks must be taken care of by the programmer explicitly. The

advantages of distributed memory are scalability and the fact that other CPUs don't interfere when

the CPU accesses its memory. Also, distributed memory computers are cheaper to build since

multiple ordinary von Neumann computers can be joined together in a network. The drawback is

that communication between the tasks is more complicated than in shared memory computers.

The most powerful supercomputers today have hybrid architecture of shared and distributed

memory. They typically consist of multiple nodes that are connected in a network, with each node

containing multiple processors and a memory these processors share. Figure 1 depicts an example of

hybrid distributed-shared memory architecture.

Figure 1: Hybrid distributed-shared memory architecture

8 of 23

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

MemoryCPUCPUCPUCPU

Page 9: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

3.4 Parallel Programming Models

Parallel programming models can be divided into three categories: shared memory, message passing

and data parallel. These models are not necessarily specific to the machine or memory architectures

described above, but some are better suited for certain types of architectures. The shared-memory

programming model, in which the tasks have a global address space, is easy to use for the

programmer on a shared-memory computer. An example of a shared-memory programming model

implementation is OpenMP. In the data parallel model there are different tasks each of which

operates on a part of a larger data structure. HPF (High Performance Fortran) is an example of an

implementation of the data parallel model.

In the message passing model each task has its own memory, and as the name implies, messages are

exchanged to send and receive data between the tasks. As such, it is well suited for parallel

computers with distributed memory architecture. However, message passing can also be used on a

shared-memory computer and is thus more flexible to use than the shared-memory programming

model. Message passing can be seen as a lower level model than the other two models mentioned,

and in hybrid models on a distributed-memory computer message passing can be the backbone of

the communications between the tasks, yet invisible to the programmer.

3.4.1 Message Passing Interface (MPI)

MPI (Message Passing Interface) is a specification for message passing libraries, which has become

the industry standard for message passing. It is supported on many platforms and portability is

therefore good. MPI specifications have been defined for C/C++ and Fortran languages. The MPI

standard can be found in Ref. [7].

The computing tasks communicate by calling library routines to send and receive messages. When a

MPI program is run, a fixed number of tasks specified by the user are created, one task per

processor. In Figure 2 communication between two tasks is shown.

For the tasks to be able to communicate to each other, a communicator is required. The default

communicator is MPI_COMM_WORLD. Each task has a rank (process ID) within the

communicator which can be used to control the program execution and to specify the source and

destination of the messages between the tasks.

9 of 23

Page 10: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

Figure 2: Send - Receive between MPI tasks

3.5 Designing a Parallel Program

As described earlier, the main motivation in employing parallel computing is to optimize

computational resources to be able to save time. However, minimizing the computation time is not

the only criterion when designing a parallel program and other aspects such as portability and

scalability must also be taken into consideration. Scalability refers to the capability to increase

performance as the number of processors increases. For example, if the execution of a given parallel

program is restricted to a fixed number of processors, its scalability will be poor even though the

performance with the fixed processor count might be good.

There are tools available to automatically convert a serial program into a parallel program, such as a

parallelizing compiler or pre-processor. However, automatic parallelization is much less flexible

than manual parallelization and the results may not be as desired. This is why it is rarely a good

alternative to use.

3.5.1 Partitioning

Breaking down the problem at hand is one of the first steps when designing a parallel program. The

computational problem can be partitioned into smaller parts either by domain decomposition or

functional decomposition.

• Domain decomposition: The data associated with the computational problem is divided

between the tasks. After this each task can do similar operations on the portion of data.

• Functional decomposition: The computation is distributed among the tasks, and each work

on the same set of data.

10 of 23

Data

Task 1

Machine B

Receive (data)

Data

Task 0

Machine A

Send (data)

Data

Task 1

Machine B

Receive (data)

Data

Task 0

Machine A

Send (data)

Page 11: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

3.5.2 Load Balancing

Once the problem has been decomposed into smaller parts, the work must be assigned to the

different tasks. If load balancing is not efficient, tasks will be idle some of the time thus decreasing

performance. There are two main ways to balance the workload among the tasks. The first and most

straightforward way is to divide the workload evenly. This method is efficient if the workload

specific to the different data pieces is equal or at least can be known beforehand, in which case each

task can be assigned the correct amount of computational work.

However, in many cases the computational work required for the data pieces varies and cannot be

known accurately beforehand. For these kinds of problems a more efficient way is to allocate the

work dynamically to the tasks. This can be done by using a scheduler – task pool in which one task

acts a master while the others act as slave tasks. In this approach the purpose of the master task is to

distribute the data pieces one at a time to the slaves, which do the computational work.

The advantage of the latter method is that the load balancing can be done efficiently even though the

work pieces allocated to the tasks are not of the same size. Also, if there are performance deviations

between the different processors, the method of dynamic work allocation can prove to be

advantageous. However, since the master task only schedules the work pieces and doesn't

participate in the actual computation, it will undoubtedly be idle some of the time. This can be an

issue if there are only a few processors available for the execution of the parallel program. However,

when the processor count increases one processor can easily be sacrificed to achieve a better load

balance for the whole system. A master-slave hierarchy can be seen in Figure 3.

3.5.3 Communications and Granularity

Communication between the tasks always causes overhead, i.e. takes away time that could be spent

on the actual computation of the problem. Some communication is still required for load balancing.

11 of 23

Figure 3: A master-slave hierarchy

Master task

Slave task 1 Slave task 2 Slave task 3 Slave task 4

Master task

Slave task 1 Slave task 2 Slave task 3 Slave task 4

Page 12: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

3 Parallel Computing

In some cases communications is also required because there is data dependence between the tasks

and the tasks must send results to the other tasks regularly.

Granularity is defined as the ratio between computation and communication. A high granularity

(coarse-grain) reduces the communication overhead thus increasing performance. However,

efficient load balancing is easier in a fine-grain program where the granularity is lower because the

work pieces allocated to the tasks are smaller.

4 Performance Analysis

The scalability of a parallel program can be analysed by investigating how the execution times of

the program vary as the number of processors increases. Scalability is influenced by several factors

which are independent of each other. These include the memory-CPU bandwidths and network

communications of the hardware as well as the efficiency of parallel load balancing. The

performance of the program's algorithms is of course also very important when considering the

performance of a parallel program, and the respective serial program must demonstrate good

performance for the parallel performance to be good.

4.1 Amdahl’s Law

A fixed problem analysis can be done with Amdahl's law. The problem size is kept constant and the

speed-up is the ratio of the execution times on a single processor and on multiple processors. It is

limited by communication costs, idle time of processors and the sequential components of the

algorithm. Speed-up is defined as

S p=p

p 1− , (4.1)

where λ is the parallelization ratio 0≤≤1 and p is the number of processors.

When p∞ , S p1

1− . (4.2)

Amdahl's law doesn't always convey well the scalability of a parallel program. The reason for this is

the fact that a good speed-up can be achieved when the problem size is increased when the number

of processors increases. Also, the case is not always such that the program can be easily split into a

serial and a parallel part and a parallelization ratio obtained.

12 of 23

Page 13: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

4 Performance Analysis

4.2 Observed Speedup

Observed speedup is defined as

S p=W 1

W p, (4.3)

where W 1 is the wall-clock time of execution on one processor and W p is the wall-clock time of

execution on p processors.

4.3 Efficiency

Efficiency is defined as

e=S p

p, (4.4)

where S p is the speed-up and p is the number of processors.

The efficiency conveys how well the program is parallelized. When scaling is linear, the efficiency

is 1, but generally the efficiency decreases as the number processors increases due to parallel

overhead. Cache performance or the effectiveness of processor pipelining may also change and

affect the efficiency of the program.

5 Numerical Examples

5.1 The IBMSC Parallel Environment

The parallel program studied in this paper has been designed to run on the IBM eServer Cluster

1600 (IBMSC) at the Finnish IT centre for science (CSC). The IBMSC is a supercomputer with a

hybrid memory architecture containing 16 nodes each with 32 CPUs. Each node is a symmetric

multiprocessor (SMP) containing shared memory and there is a communication network between

the nodes. For more information on IBMSC, see Ref. [8]. For more information on CSC, see Ref.

[9].

13 of 23

Page 14: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

5 Numerical Examples

5.2 An MPI Program

The message passing model is a good choice for computers that have distributed memory, such as

IBMSC. The shared-memory programming model could also have been used, but this would have

restricted the execution of the program to the processors within a node thus restricting the scalability

of the program. The data-parallel model would also have been an option had HPF (High

Performance Fortran) been available on IBMSC. For the implementation of message passing MPI

(Message Passing Interface) has been used. PVM (Parallel Virtual Machine) was another possible

technology, but it is older and in many ways inferior to MPI. The main advantage of MPI is

portability.

The parallel program is designed in the Fortran language using the SPMD model. All tasks execute

the same program file and MPI together with control structures are used to identify the task number

and control the flow of the program accordingly.

At the start of the program, MPI must be initialized by all tasks with the subroutine MPI_INIT.

After this the tasks call the MPI_COMM_SIZE and MPI_COMM_RANK subroutines to determine

the size of the communicator and their own rank, respectively.

As described earlier, the workload of the parallel program can be divided using either equal

partitioning or dynamic work allocation with a scheduler. Both methods have been studied here.

The implementation in Fortran using the MPI library can be seen in the Appendix.

5.2.1 Static Load Balancing

The optimal control problems described in Section 2 can be divided to the computing tasks equally.

One task first reads the required parameters from an input file and then MPI_BCAST is used to

broadcast them to all other tasks.

The parallel program is very coarse grain with this method since the only communication occurs at

the start of the program when the parameters, including the total data size, are broadcast to the tasks.

Once the tasks are aware of their share of the workload, no more communication is required and

each task writes the results of the computation to specified files. In fact, even the broadcast routine

could be omitted if each task were to read the parameters from the input file. However, generally

I/O operations cost more computational resources than intertask communications.

14 of 23

Page 15: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

5 Numerical Examples

5.2.2 Dynamic Load Balancing

The execution of the program with the dynamic load balancing method is branched in two parts

depending whether the task is the master or a slave.

• Slave task: The slaves receive a work piece from the master one at a time. After this they do

the desired computation and notify the master upon completion. This process is repeated

until there are no more work pieces left and a stop code is received from the master.

• Master task: First, the master sends out one work piece to each task (assuming the number

of work pieces is greater than the number of tasks) after which it goes into a receiving mode.

Every time it receives a notification that a slave has finished a work piece, it sends out the

next work piece that has not been sent yet. After all work pieces have been sent, the master

sends stop codes to the tasks.

Sending and receiving messages between the tasks are done with the subroutines MPI_SEND and

MPI_RECV, respectively.

5.3 Results

5.3.1 Fixed Problem Analysis

The execution times using the static and dynamic load balancing methods were examined using

variable processor counts with a fixed problem size. The results can be seen in Table 1 and the

respective graphs in Figure 4. Equations (4.3) and (4.4) are used to calculate the observed speedup

and efficiency, seen in Figures 5 and 6.

Number of

ProcessorsWall-clock time / s

Dynamic Static1 11203,92 11004,8 5942,54 4062,5 3254,38 1724,8 1609,5

16 839,2 890,2Table 1: Wall-clock times with variable processor counts

15 of 23

Page 16: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

5 Numerical Examples

Figure 4: Wall-clock times

Figure 5: Observed speedup

As expected, when the number of processors is small, the wall-clock times are lower and the

observed speedup is greater using static load balancing. However, the gap narrows quickly and with

the processor count of 16 the dynamic load balancing method yields better results.

16 of 23

500

2500

4500

6500

8500

10500

1 6 11 16

Number of processors

Wal

l-clo

ck ti

me

DynamicStatic

02468

10121416

1 6 11 16

Processors

Obs

erve

d sp

eedu

p

DynamicStatic

Page 17: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

5 Numerical Examples

Figure 6: Efficiency

The efficiency varies dramatically between the two methods. The efficiency of dynamic load

balancing method is poor with small processor counts, but rises as the number of processors

increases. On the other hand, the efficiency with the static load balancing method decreases as the

number of processors goes up. Perhaps surprisingly the efficiency of the static method with 8

processors is relatively high. The most probable reason for this is the fact that the load balancing

with this number of processors was efficient by coincidence. Another thing to note is the fact that

the efficiency is always below 1, which is virtually always the case in real life.

5.3.2 Analysis with a Variable Problem Size

Scalability was also examined using a variable problem size analysis. Here, it was possible to

change the problem size by determining the number of initial states to be computed. In this analysis

dynamic load balancing was used and the program was executed using 32 processors. The results

can be seen in Figure 7. Note that the wall-clock times obtained in this analysis are not comparable

with those of the fixed problem analysis due to differences in compiler options.

17 of 23

0,5

0,6

0,7

0,8

0,9

1

1 6 11 16

Processors

Effi

cien

cy

DynamicStatic

Page 18: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

5 Numerical Examples

From the results it can be seen that as the problem size is doubled, the execution time also tends to

multiply by two, as expected.

6 Summary and Conclusions

In this paper parallel programming has been applied to the numerical solution of independent

optimal control problems. The computational problem could be classified as an embarrassingly

parallel problem and the scalability tests proved that the computational time could be efficiently

decreased using multiple processors at once. However, even though the wall-clock times might be

lower, it should be noted that the total CPU times, i.e. the sum of the wall-clock times, always tends

to be higher using multiple processors since the efficiency is below 1. This is the result of parallel

overhead and the impossibility for ideal load balancing.

In this paper domain decomposition was used in the partitioning of the computational problem. The

reason for this was the fact that an existing algorithm existed that was designed to solve one optimal

control problem at a time. However, functional decomposition could also have been used by

decomposing the solution of each optimal control problem and distributing the computational work

to different processors. This kind of approach to a similar computational problem has been studied

in Ref. [10], where the trajectory of an aircraft has been broken into phases and simulated in

parallel.

Here, the parallel program was developed to be used to compute the optimal control problems

described in Section 2 but with only minor changes the same program code can be used in other

cases where independent computations must be done for a set of data pieces. However, if there is

data dependence between the tasks a different and more complex parallel program is needed.

18 of 23

917,8

1975,5

3777,1

0

1000

2000

3000

4000

32 64 128

Problem size

Tim

e / s

Figure 7: Execution times with variable problem sizes

Page 19: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

6 Summary and Conclusions

Further development for the parallel program would include modelling the MPI communication

costs. This would require measuring the environment dependent parameters in Equation (7.1) and

Table 2. If the communication costs were known, it could be possible to adjust the granularity of the

program accurately to achieve a better balance of efficient load balancing and parallel overhead.

19 of 23

Page 20: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

7 Appendix

7 Appendix

7.1 MPI Communication Costs

The communication cost T of the point-to-point operations MPI_SEND and MPI_RECV can be

modelled using the equation

T=t s tw N , (7.1)

where t s is the start-up time, tw is the bandwidth dependent transfer time per word and N is the

message length in words.

The communication costs of global operations using hypercube communication algorithms on the

idealized multicomputer architecture are listed in Table 2, where P is the number of processors and

t op is the cost of a single reduction operation. Since global operations contain internal

communication, they can not be modelled with a simple linear model like point-to-point operations.

For large messages bandwidth costs can increase the costs of communication. See Ref. [11] for

more details.

Operation Cost (small N) Cost (large N)MPI_BARRIER t s log P t s log P

MPI_BCAST log P t stw N 2 t slog Ptw N

MPI_SCATTER t s log Ptw N t s log Pt w N

MPI_GATHER t s log Ptw N t s log Pt w N

MPI_REDUCE log P t stwt opN t s 2 log Pt w 2t opN

MPI_ALLREDUCE log P t stwt opN t s 2 log Pt w 2t opN

Table 2: MPI communication costs

The cost of MPI_INIT and MPI_FINALIZE depend on their implementation and can be high.

However, this is not significant since they are executed only once during the program.

20 of 23

Page 21: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

7 Appendix

7.2 Code Excerpts

7.2.1 Static Load BalancingSUBROUTINE balance_even() INTEGER :: & my_size, & my_start, & my_end data_size = num_states - ( first_state - 1) IF (data_size < ntask) THEN WRITE(*,*) 'Error: Data size < number of tasks!' RETURN END IF IF (my_id < MOD(data_size,ntask)) THEN my_size = data_size / ntask + 1 my_start = first_state + my_id * my_size ELSE my_size = data_size / ntask my_start = first_state + data_size - (ntask-my_id)* my_size END IF my_end = my_start + my_size - 1 WRITE (*,*) '[',my_id,'] data_size=',data_size,'ntask=',ntask,' & my_size=',my_size, ' my_start=' , my_start, ' my_end=' , my_end loop_state: DO ctr0 = my_start, my_end loop_player: DO ctr1 = first_player, num_players loop_support1: DO ctr2 = first_support1, num_support1 loop_support2: DO ctr3 = first_support2, num_support2

CALL doWork(ctr0,ctr1,ctr2,ctr3) END DO loop_support2 END DO loop_support1 END DO loop_player END DO loop_state END SUBROUTINE balance_even

7.2.2 Dynamic Load Balancing SUBROUTINE balance_dynamic() INTEGER :: & sendcount, & recvcount, & source_id,& dest_id, & num_slaves,& res,& workpiece_number INTEGER, PARAMETER :: & stop_task_code = -1, & message_tag = 50 INTEGER, DIMENSION(4) :: workpiece INTEGER, DIMENSION(:), ALLOCATABLE :: status INTEGER, DIMENSION(:,:,:,:), ALLOCATABLE :: work_done ALLOCATE(status(MPI_STATUS_SIZE)) sendcount = 0 recvcount = 0 IF (my_id /= root_id ) THEN ! Slave task DO CALL MPI_RECV(workpiece, 4, MPI_INTEGER, root_id, message_tag, & MPI_COMM_WORLD, status, rc) recvcount = recvcount + 1 IF (workpiece(1) == stop_task_code) THEN WRITE (*,*) my_id, ' - stop code received' EXIT END IF CALL doWork(workpiece(1),workpiece(2),workpiece(3),workpiece(4)) CALL MPI_SEND(res, 1, MPI_INTEGER, root_id, message_tag, & MPI_COMM_WORLD, rc)

21 of 23

Page 22: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

7 Appendix

sendcount = sendcount + 1 END DO ELSE !Master task num_slaves = ntask - 1 data_size = num_states*num_support1*num_support2*num_players IF (data_size < num_slaves) THEN WRITE(*,*) 'Error: Data_size < num_slaves' GOTO 50 END IF WRITE(*,*) my_id, ' - master loop starting' ALLOCATE(work_done(num_states,num_players,num_support1, & num_support2)) work_done(:,:,:,:) = 0 workpiece_number = 0 loop_support2: DO ctr3 = first_support2, num_support2 loop_support1: DO ctr2 = first_support1, num_support1 loop_player: DO ctr1 = first_player, num_players loop_state: DO ctr0 = first_state, num_states workpiece_number = workpiece_number + 1 IF (workpiece_number > num_slaves) THEN CALL MPI_RECV(res, 1, MPI_INTEGER, MPI_ANY_SOURCE, & message_tag, MPI_COMM_WORLD, status, rc) recvcount = recvcount + 1 source_id = status(MPI_SOURCE) dest_id = source_id ELSE dest_id = workpiece_number END IF workpiece = (/ctr0,ctr1,ctr2,ctr3/) CALL MPI_SEND(workpiece, 4, MPI_INTEGER, dest_id, & message_tag, MPI_COMM_WORLD, rc) sendcount = sendcount + 1 work_done(ctr0,ctr1,ctr2,ctr3) = dest_id END DO loop_state END DO loop_player END DO loop_support1 END DO loop_support250 CONTINUE DO i = 1, num_slaves workpiece(1) = stop_task_code CALL MPI_SEND(stop_task_code, 4, MPI_INTEGER, i, message_tag, & MPI_COMM_WORLD, rc) sendcount = sendcount + 1 END DO END IF DEALLOCATE(status) IF (my_id == root_id) THEN WRITE (*,*) work_done(:,:,:,:) DEALLOCATE(work_done) END IF END SUBROUTINE balance_dynamic

22 of 23

Page 23: Mat-2.108 Independent Research Project in Applied …salserver.org.aalto.fi/vanhat_sivut/Opinnot/Mat-2.4108/pdf-files/eruu05.pdfMat-2.108 Independent Research Project in Applied Mathematics

8 References

8 References

1: Shinar, J., Siegel, A. W., Gold, Y. I., On the Analysis of a Complex Differential Game Using Artificial Intelligence Techniques, Proceedings of the 27th IEEE Conference on Decision and Control, IEEE, 19882: Le Ménec, S., Bernhard, P., Decision Support System for Medium Range Aerial Duels Combining Elements of Pursuit-Evasion Game Solutions with AI Techniques, Annals of the International Society of Dynamic Games, 19953: Karelahti, J., Virtanen, K., Raivio, T., Game Optimal Support Time of a Medium Range Air-to-Air Missile, Systems Analysis Laboratory, Helsinki University of Technology, 20054: University of California, SETI@home, 2005, http://setiathome.ssl.berkeley.edu/5: Flynn, M. J., Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computing C-21, No. 9 , 1972, pp 948-9606: Duncan, R., A Survey of Parallel Computer Architectures, IEEE Computer , 1990, pp 5-167: Message-Passing Interface Forum, MPI: A Message-Passing Interface Standard, 1995, http://www.mpi-forum.org/docs/docs.html8: Uusvuori, R., Kupila-Rantala, T., IBMSC User's Guide, CSC - Scientific Computing, 20039: Kupila-Rantala, T., CSC User's Guide, 2000, http://www.csc.fi/oppaat/cscuser/cscuser.pdf10: Betts, J. T., Huffman, W. P., Trajectory Optimization on a Parallel Processor, J. Guidance, 199111: Foster, I., Designing and Building Parallel Programs, 1995, http://www-unix.mcs.anl.gov/dbpp/

23 of 23