mat-2.108 independent research project in applied...
TRANSCRIPT
Mat-2.108 Independent Research Project in Applied Mathematics
APPLYING PARALLEL COMPUTING IN THE NUMERICAL SOLUTION OF OPTIMAL CONTROL PROBLEMS
Sampsa RuutuSystems Analysis Laboratory
Helsinki University of Technology
Table of Contents1 Introduction...................................................................................................................................... 32 Air Combat Duel.............................................................................................................................. 3
2.1 Missile Support Time Game......................................................................................................42.2 Solution Methods.......................................................................................................................4
3 Parallel Computing...........................................................................................................................53.1 Overview................................................................................................................................... 53.2 When to Use Parallel Computing.............................................................................................. 63.3 Parallel Computer Architectures............................................................................................... 7
3.3.1 Flynn's Taxonomy.............................................................................................................. 73.3.2 Memory Architectures........................................................................................................ 8
3.4 Parallel Programming Models................................................................................................... 93.4.1 Message Passing Interface (MPI)....................................................................................... 9
3.5 Designing a Parallel Program.................................................................................................. 103.5.1 Partitioning....................................................................................................................... 103.5.2 Load Balancing.................................................................................................................113.5.3 Communications and Granularity.....................................................................................12
4 Performance Analysis.....................................................................................................................124.1 Amdahl’s Law......................................................................................................................... 124.2 Observed Speedup................................................................................................................... 134.3 Efficiency.................................................................................................................................13
5 Numerical Examples...................................................................................................................... 135.1 The IBMSC Parallel Environment.......................................................................................... 135.2 An MPI Program..................................................................................................................... 14
5.2.1 Static Load Balancing.......................................................................................................145.2.2 Dynamic Load Balancing................................................................................................. 15
5.3 Results..................................................................................................................................... 155.3.1 Fixed Problem Analysis....................................................................................................15
5.4 Analysis with a Variable Problem Size................................................................................... 186 Summary and Conclusions............................................................................................................. 187 Appendix........................................................................................................................................ 20
7.1 MPI Communication Costs..................................................................................................... 207.2 Code Excerpts..........................................................................................................................20
7.2.1 Static Load Balancing.......................................................................................................207.2.2 Dynamic Load Balancing................................................................................................. 21
8 References...................................................................................................................................... 23
2 of 23
1 Introduction
1 Introduction
In this study, parallel computing is applied in the numerical solution of optimal control problems. In
many cases situations arise where a large number of independent optimal control problems must be
solved for various initial states, and instead of using only one processor for the computation, the
workload can be distributed to different processors that operate simultaneously.
Here, the optimal control problems are a part of an air combat model in which the pilots must
decide the optimal support times of launched missiles. The support time refers to the time the
launching aircraft gives target information to the missile. Although the computational problem
considered here comprises the numerical solution of a set of optimal control problems, the same
methods of parallel programming can be used in various other applications where independent
computation must be done for a set of data pieces.
The numerical solution of optimal control problems generally takes a lot of computational resources
even though quite efficient algorithms exist that are designed for this purpose. Solving an optimal
control problem can be done with either direct or indirect methods. Indirect methods rely on solving
the first order necessary conditions of the optimum and finding a solution by using some root
finding algorithm. The disadvantage of indirect methods is that the converging region is small, so a
good initial guess is required. This makes them less robust than direct methods. In direct methods
the control and state variables are parameterized and the problem is solved using some efficient
non-linear programming solver.
2 Air Combat Duel
The air combat setting studied in this paper considers a duel of two fighter aircraft. Both are
equipped with medium range air-to-air missiles that they launch towards each other simultaneously.
The initial states of the aircraft are chosen in a way that they are too close to avoid the
confrontation.
In order to reach its target, a missile requires target information from either the launching aircraft or
its own radar. After an aircraft has launched a missile, it typically tracks the target aircraft with its
radar for a specific time and transmits the target information to the missile. After this, the launching
aircraft evades and the missile has to extrapolate the position of the target using prior information.
Finally, the missile tries to lock on the target by switching on its own radar. This happens when the
3 of 23
2 Air Combat Duel
missile has reached a certain lock-on distance to the estimated target position. The lock-on distance
is the maximum distance the missile radar is able to lock on to its target with a given probability.
Point-mass models with three degrees of freedom are used to model the dynamics of the aircraft and
missile. The aircraft can be controlled by adjusting the angle of attack, bank angle and thrust. The
missile autopilot uses a feedback guidance law called Proportional Navigation.
2.1 Missile Support Time Game
In the setting above, the pilots must decide the supporting times of the missile, i.e. the times that
they transmit target information to the missile. A longer support time will shorten the extrapolation
time of the missile, thus increasing the probability of the missile being able to lock on to its target
and eventually making a successful hit. However, prolonging the support time also increases the
probability of hit of the enemy missile since during this phase the aircraft must fly towards the
enemy aircraft.
In Refs. [1] and [2] the launch and evasion times are computed using artificial intelligence
techniques combined with pursuit-evasion game solutions based on the assumption that the
adversary follows a specific feedback rule. However, with these techniques only suboptimal
controls can be obtained. The system studied in this paper is based on a static game in which the
pay-off functions of both players are weighted averages of the probabilities of hit and their own
survival. The game and its solution methods are described in detail in Ref. [3].
2.2 Solution Methods
To be able to solve the optimal support time for an arbitrary initial state in real time, a set of optimal
control problems must be solved beforehand for various initial states. In these problems, the players
try to minimize the closing velocity of the approaching missile to minimize the probability of hit.
The support times of both players are restricted a set of discrete time intervals, denoted by
T B={tminB , t 2
B , , t nB} , (2.1)
T R={tminR , t 2
R , , t nR} , (2.2)
where
B = Blue player,
R = Red player.
4 of 23
2 Air Combat Duel
For each initial state, the solution of the optimal control problem is computed for a set of support
time pairs, denoted by
t sB , t s
R∈T B×T R . (2.3)
Once these optimal control problems have been computed, the values of the pay-off functions can
be calculated. The solution related to an arbitrary initial state can then be obtained in real time by
linearly interpolating from the previously computed results.
As such, it is not crucial that the optimal control problems solved off-line are computed fast.
However, since the optimal control problems related to the different initial states are independent,
with parallel computation it is possible to use existing computing resources more efficiently.
3 Parallel Computing
3.1 Overview
Computer programs are traditionally intended to be executed by a single processor. Serial programs
comprise a series of instructions that the processor executes sequentially one at a time. The
paradigm of parallel computing is different, since the idea is to use multiple computing resources
simultaneously. Whereas a serial computer such as an ordinary PC typically has only one processor,
a parallel computer consists of multiple processors that are able to work together. There are
different types of parallel computers ranging from a single computer containing several processors
to multiple computers in a network. A combination of these is also possible.
The main reason for using parallel computing is optimizing the use of time and computational
resources. The same problem can be solved faster if the workload is divided to several processors,
or alternatively a larger problem can be solved in a given timeframe. Another reason to use parallel
computing might be to overcome memory constraints since the lack of sufficient memory can be an
issue when solving a large problem on a single serial computer.
Over time the computing power of computers has increased rapidly, but at the same time also the
demand for greater computing power has increased. This is because the increase in computing
power allows complex systems to be modelled more accurately. Traditionally parallel computing
has been seen as the high end of computing with particularly science and engineering as its main
application areas. Important research areas that use parallel programming extensively range from
5 of 23
3 Parallel Computing
climate models to the research of chemical reactions and electronic circuits to name but a few.
However, applications in science and engineering are not the only areas that employ parallel
computing anymore since nowadays also commercial applications such as multimedia applications
and parallel databases require more and more computing resources.
With the current technology the performance increase of a single processor cannot continue to
increase indefinitely since physical limitations such as the speed of light ultimately pose constraints
to clock cycle times of processors. There are also limits to miniaturization, and the number of
transistors on a chip cannot increase forever. Also, even though it could be theoretically possible to
build faster processors, it is more economical to use multiple moderately fast processors to achieve
the same performance. The factors that affect the performance of a computer are the time required
to perform a basic operation and the number of basic operations that the computer can perform at
the same time, and with parallel computing it is possible to increase the latter. In summary,
parallelism seems to be the future of computing judging from the multiple processor architectures
that have emerged and faster networks that allow distributed computing.
3.2 When to Use Parallel Computing
A parallel computing approach is feasible when the computation can be divided into separate work
pieces that can be computed simultaneously. Parallelization will not be a good alternative if the
different tasks depend on each other's results heavily and need to be computed in a sequential order.
Good results are achieved especially in situations where the tasks are independent of each other and
the required communication between the tasks is therefore very little if any. These kinds of
problems are sometimes called embarrassingly parallel since their parallelization is fairly easy. For
the implementation a parallel supercomputer is not always required since the development of
computer networks has allowed the distribution of the computational work to multiple ordinary
computers over an Ethernet network or even the internet. SETI@home [4] is a good example of an
embarrassingly parallel problem in which the workload is distributed over millions of home
computer over the internet. Another example is the solution of the set of the aforementioned optimal
control problems, where the computational workload of the different initial states does not depend
on each other.
6 of 23
3 Parallel Computing
3.3 Parallel Computer Architectures
A computer system consists of one or more processors, memory, input/output devices and
communication channels. In the von Neumann architecture both program and data instructions are
stored in the memory. The CPU (central processing unit) gets the instructions and data from the
memory and executes them sequentially. These kinds of computers are called stored-program
computers and comprise virtually all computers today.
3.3.1 Flynn's Taxonomy
Computer architectures can be classified in a number of ways. Flynn's Taxonomy [5] [6], which
classifies the architectures depending on the number of instruction and data streams they have, is
one of the most widely used. The four classes according to Flynn's Taxonomy are described below.
• SISD (Single Instruction, Single Data): The most common type of computer which includes
serial computers such as most PCs. Since the CPU operates on only one instruction and data
stream, the execution is deterministic.
• SIMD (Single Instruction, Multiple Data): A type of parallel computer, including processor
arrays and vector pipelines, where each processor executes the same set of instructions but with
different data elements. It is well suited for problems with a high degree of regularity and its
advantages are that the execution is deterministic and the amount of logic required is small.
• MISD (Multiple Instruction, Single Data): A type of parallel computer in which each
processor does different operations on the same data elements. Usually it is more efficient to
have multiple data streams if there are multiple instruction streams, and because of this the
MISD architecture is very rare.
• MIMD (Multiple Instruction, Multiple Data): The most common type of parallel computer
nowadays including most supercomputers, networked computer grids and some PCs. Each
processor has a distinct set of instructions and data.
One addition to Flynn's taxonomy is SPMD (single program, multiple data) which is physically a
MIMD computer. However, each processor executes the same program code making the
programming simpler compared to MIMD. SPMD is also more flexible than SIMD since the
different tasks which are executed by different processors can be at different parts of the program at
a given time. This can be taken care of by the programmer with the use of control structures. In this
7 of 23
3 Parallel Computing
sense SPMD is more of a style of computing than a distinct architecture. Nevertheless, SPMD has
become very popular.
3.3.2 Memory Architectures
Computer architectures can also be classified according to the type of memory they have. Parallel
computers can have shared memory, distributed memory or hybrid architectures combining
elements of both.
In shared memory computers there is one global memory that all processors can access. The
advantage of shared memory is that data sharing between the tasks is fast. The main disadvantage is
poor scalability which results from the increase in traffic between the CPUs and the memory as the
number of CPUs increases. Scalability is also limited due to the fact that it is expensive to build a
shared-memory computer with a very large number of processors. The programmer must also take
care of synchronization to avoid undesired results when accessing the global memory.
Distributed memory computers on the other hand contain local memory for each processor, and
there is no global memory nor can they access each other's memories directly. There is a
communication network that is used to send data between the tasks with message passing.
Communication between the tasks must be taken care of by the programmer explicitly. The
advantages of distributed memory are scalability and the fact that other CPUs don't interfere when
the CPU accesses its memory. Also, distributed memory computers are cheaper to build since
multiple ordinary von Neumann computers can be joined together in a network. The drawback is
that communication between the tasks is more complicated than in shared memory computers.
The most powerful supercomputers today have hybrid architecture of shared and distributed
memory. They typically consist of multiple nodes that are connected in a network, with each node
containing multiple processors and a memory these processors share. Figure 1 depicts an example of
hybrid distributed-shared memory architecture.
Figure 1: Hybrid distributed-shared memory architecture
8 of 23
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
MemoryCPUCPUCPUCPU
3 Parallel Computing
3.4 Parallel Programming Models
Parallel programming models can be divided into three categories: shared memory, message passing
and data parallel. These models are not necessarily specific to the machine or memory architectures
described above, but some are better suited for certain types of architectures. The shared-memory
programming model, in which the tasks have a global address space, is easy to use for the
programmer on a shared-memory computer. An example of a shared-memory programming model
implementation is OpenMP. In the data parallel model there are different tasks each of which
operates on a part of a larger data structure. HPF (High Performance Fortran) is an example of an
implementation of the data parallel model.
In the message passing model each task has its own memory, and as the name implies, messages are
exchanged to send and receive data between the tasks. As such, it is well suited for parallel
computers with distributed memory architecture. However, message passing can also be used on a
shared-memory computer and is thus more flexible to use than the shared-memory programming
model. Message passing can be seen as a lower level model than the other two models mentioned,
and in hybrid models on a distributed-memory computer message passing can be the backbone of
the communications between the tasks, yet invisible to the programmer.
3.4.1 Message Passing Interface (MPI)
MPI (Message Passing Interface) is a specification for message passing libraries, which has become
the industry standard for message passing. It is supported on many platforms and portability is
therefore good. MPI specifications have been defined for C/C++ and Fortran languages. The MPI
standard can be found in Ref. [7].
The computing tasks communicate by calling library routines to send and receive messages. When a
MPI program is run, a fixed number of tasks specified by the user are created, one task per
processor. In Figure 2 communication between two tasks is shown.
For the tasks to be able to communicate to each other, a communicator is required. The default
communicator is MPI_COMM_WORLD. Each task has a rank (process ID) within the
communicator which can be used to control the program execution and to specify the source and
destination of the messages between the tasks.
9 of 23
3 Parallel Computing
Figure 2: Send - Receive between MPI tasks
3.5 Designing a Parallel Program
As described earlier, the main motivation in employing parallel computing is to optimize
computational resources to be able to save time. However, minimizing the computation time is not
the only criterion when designing a parallel program and other aspects such as portability and
scalability must also be taken into consideration. Scalability refers to the capability to increase
performance as the number of processors increases. For example, if the execution of a given parallel
program is restricted to a fixed number of processors, its scalability will be poor even though the
performance with the fixed processor count might be good.
There are tools available to automatically convert a serial program into a parallel program, such as a
parallelizing compiler or pre-processor. However, automatic parallelization is much less flexible
than manual parallelization and the results may not be as desired. This is why it is rarely a good
alternative to use.
3.5.1 Partitioning
Breaking down the problem at hand is one of the first steps when designing a parallel program. The
computational problem can be partitioned into smaller parts either by domain decomposition or
functional decomposition.
• Domain decomposition: The data associated with the computational problem is divided
between the tasks. After this each task can do similar operations on the portion of data.
• Functional decomposition: The computation is distributed among the tasks, and each work
on the same set of data.
10 of 23
Data
Task 1
Machine B
Receive (data)
Data
Task 0
Machine A
Send (data)
Data
Task 1
Machine B
Receive (data)
Data
Task 0
Machine A
Send (data)
3 Parallel Computing
3.5.2 Load Balancing
Once the problem has been decomposed into smaller parts, the work must be assigned to the
different tasks. If load balancing is not efficient, tasks will be idle some of the time thus decreasing
performance. There are two main ways to balance the workload among the tasks. The first and most
straightforward way is to divide the workload evenly. This method is efficient if the workload
specific to the different data pieces is equal or at least can be known beforehand, in which case each
task can be assigned the correct amount of computational work.
However, in many cases the computational work required for the data pieces varies and cannot be
known accurately beforehand. For these kinds of problems a more efficient way is to allocate the
work dynamically to the tasks. This can be done by using a scheduler – task pool in which one task
acts a master while the others act as slave tasks. In this approach the purpose of the master task is to
distribute the data pieces one at a time to the slaves, which do the computational work.
The advantage of the latter method is that the load balancing can be done efficiently even though the
work pieces allocated to the tasks are not of the same size. Also, if there are performance deviations
between the different processors, the method of dynamic work allocation can prove to be
advantageous. However, since the master task only schedules the work pieces and doesn't
participate in the actual computation, it will undoubtedly be idle some of the time. This can be an
issue if there are only a few processors available for the execution of the parallel program. However,
when the processor count increases one processor can easily be sacrificed to achieve a better load
balance for the whole system. A master-slave hierarchy can be seen in Figure 3.
3.5.3 Communications and Granularity
Communication between the tasks always causes overhead, i.e. takes away time that could be spent
on the actual computation of the problem. Some communication is still required for load balancing.
11 of 23
Figure 3: A master-slave hierarchy
Master task
Slave task 1 Slave task 2 Slave task 3 Slave task 4
Master task
Slave task 1 Slave task 2 Slave task 3 Slave task 4
3 Parallel Computing
In some cases communications is also required because there is data dependence between the tasks
and the tasks must send results to the other tasks regularly.
Granularity is defined as the ratio between computation and communication. A high granularity
(coarse-grain) reduces the communication overhead thus increasing performance. However,
efficient load balancing is easier in a fine-grain program where the granularity is lower because the
work pieces allocated to the tasks are smaller.
4 Performance Analysis
The scalability of a parallel program can be analysed by investigating how the execution times of
the program vary as the number of processors increases. Scalability is influenced by several factors
which are independent of each other. These include the memory-CPU bandwidths and network
communications of the hardware as well as the efficiency of parallel load balancing. The
performance of the program's algorithms is of course also very important when considering the
performance of a parallel program, and the respective serial program must demonstrate good
performance for the parallel performance to be good.
4.1 Amdahl’s Law
A fixed problem analysis can be done with Amdahl's law. The problem size is kept constant and the
speed-up is the ratio of the execution times on a single processor and on multiple processors. It is
limited by communication costs, idle time of processors and the sequential components of the
algorithm. Speed-up is defined as
S p=p
p 1− , (4.1)
where λ is the parallelization ratio 0≤≤1 and p is the number of processors.
When p∞ , S p1
1− . (4.2)
Amdahl's law doesn't always convey well the scalability of a parallel program. The reason for this is
the fact that a good speed-up can be achieved when the problem size is increased when the number
of processors increases. Also, the case is not always such that the program can be easily split into a
serial and a parallel part and a parallelization ratio obtained.
12 of 23
4 Performance Analysis
4.2 Observed Speedup
Observed speedup is defined as
S p=W 1
W p, (4.3)
where W 1 is the wall-clock time of execution on one processor and W p is the wall-clock time of
execution on p processors.
4.3 Efficiency
Efficiency is defined as
e=S p
p, (4.4)
where S p is the speed-up and p is the number of processors.
The efficiency conveys how well the program is parallelized. When scaling is linear, the efficiency
is 1, but generally the efficiency decreases as the number processors increases due to parallel
overhead. Cache performance or the effectiveness of processor pipelining may also change and
affect the efficiency of the program.
5 Numerical Examples
5.1 The IBMSC Parallel Environment
The parallel program studied in this paper has been designed to run on the IBM eServer Cluster
1600 (IBMSC) at the Finnish IT centre for science (CSC). The IBMSC is a supercomputer with a
hybrid memory architecture containing 16 nodes each with 32 CPUs. Each node is a symmetric
multiprocessor (SMP) containing shared memory and there is a communication network between
the nodes. For more information on IBMSC, see Ref. [8]. For more information on CSC, see Ref.
[9].
13 of 23
5 Numerical Examples
5.2 An MPI Program
The message passing model is a good choice for computers that have distributed memory, such as
IBMSC. The shared-memory programming model could also have been used, but this would have
restricted the execution of the program to the processors within a node thus restricting the scalability
of the program. The data-parallel model would also have been an option had HPF (High
Performance Fortran) been available on IBMSC. For the implementation of message passing MPI
(Message Passing Interface) has been used. PVM (Parallel Virtual Machine) was another possible
technology, but it is older and in many ways inferior to MPI. The main advantage of MPI is
portability.
The parallel program is designed in the Fortran language using the SPMD model. All tasks execute
the same program file and MPI together with control structures are used to identify the task number
and control the flow of the program accordingly.
At the start of the program, MPI must be initialized by all tasks with the subroutine MPI_INIT.
After this the tasks call the MPI_COMM_SIZE and MPI_COMM_RANK subroutines to determine
the size of the communicator and their own rank, respectively.
As described earlier, the workload of the parallel program can be divided using either equal
partitioning or dynamic work allocation with a scheduler. Both methods have been studied here.
The implementation in Fortran using the MPI library can be seen in the Appendix.
5.2.1 Static Load Balancing
The optimal control problems described in Section 2 can be divided to the computing tasks equally.
One task first reads the required parameters from an input file and then MPI_BCAST is used to
broadcast them to all other tasks.
The parallel program is very coarse grain with this method since the only communication occurs at
the start of the program when the parameters, including the total data size, are broadcast to the tasks.
Once the tasks are aware of their share of the workload, no more communication is required and
each task writes the results of the computation to specified files. In fact, even the broadcast routine
could be omitted if each task were to read the parameters from the input file. However, generally
I/O operations cost more computational resources than intertask communications.
14 of 23
5 Numerical Examples
5.2.2 Dynamic Load Balancing
The execution of the program with the dynamic load balancing method is branched in two parts
depending whether the task is the master or a slave.
• Slave task: The slaves receive a work piece from the master one at a time. After this they do
the desired computation and notify the master upon completion. This process is repeated
until there are no more work pieces left and a stop code is received from the master.
• Master task: First, the master sends out one work piece to each task (assuming the number
of work pieces is greater than the number of tasks) after which it goes into a receiving mode.
Every time it receives a notification that a slave has finished a work piece, it sends out the
next work piece that has not been sent yet. After all work pieces have been sent, the master
sends stop codes to the tasks.
Sending and receiving messages between the tasks are done with the subroutines MPI_SEND and
MPI_RECV, respectively.
5.3 Results
5.3.1 Fixed Problem Analysis
The execution times using the static and dynamic load balancing methods were examined using
variable processor counts with a fixed problem size. The results can be seen in Table 1 and the
respective graphs in Figure 4. Equations (4.3) and (4.4) are used to calculate the observed speedup
and efficiency, seen in Figures 5 and 6.
Number of
ProcessorsWall-clock time / s
Dynamic Static1 11203,92 11004,8 5942,54 4062,5 3254,38 1724,8 1609,5
16 839,2 890,2Table 1: Wall-clock times with variable processor counts
15 of 23
5 Numerical Examples
Figure 4: Wall-clock times
Figure 5: Observed speedup
As expected, when the number of processors is small, the wall-clock times are lower and the
observed speedup is greater using static load balancing. However, the gap narrows quickly and with
the processor count of 16 the dynamic load balancing method yields better results.
16 of 23
500
2500
4500
6500
8500
10500
1 6 11 16
Number of processors
Wal
l-clo
ck ti
me
DynamicStatic
02468
10121416
1 6 11 16
Processors
Obs
erve
d sp
eedu
p
DynamicStatic
5 Numerical Examples
Figure 6: Efficiency
The efficiency varies dramatically between the two methods. The efficiency of dynamic load
balancing method is poor with small processor counts, but rises as the number of processors
increases. On the other hand, the efficiency with the static load balancing method decreases as the
number of processors goes up. Perhaps surprisingly the efficiency of the static method with 8
processors is relatively high. The most probable reason for this is the fact that the load balancing
with this number of processors was efficient by coincidence. Another thing to note is the fact that
the efficiency is always below 1, which is virtually always the case in real life.
5.3.2 Analysis with a Variable Problem Size
Scalability was also examined using a variable problem size analysis. Here, it was possible to
change the problem size by determining the number of initial states to be computed. In this analysis
dynamic load balancing was used and the program was executed using 32 processors. The results
can be seen in Figure 7. Note that the wall-clock times obtained in this analysis are not comparable
with those of the fixed problem analysis due to differences in compiler options.
17 of 23
0,5
0,6
0,7
0,8
0,9
1
1 6 11 16
Processors
Effi
cien
cy
DynamicStatic
5 Numerical Examples
From the results it can be seen that as the problem size is doubled, the execution time also tends to
multiply by two, as expected.
6 Summary and Conclusions
In this paper parallel programming has been applied to the numerical solution of independent
optimal control problems. The computational problem could be classified as an embarrassingly
parallel problem and the scalability tests proved that the computational time could be efficiently
decreased using multiple processors at once. However, even though the wall-clock times might be
lower, it should be noted that the total CPU times, i.e. the sum of the wall-clock times, always tends
to be higher using multiple processors since the efficiency is below 1. This is the result of parallel
overhead and the impossibility for ideal load balancing.
In this paper domain decomposition was used in the partitioning of the computational problem. The
reason for this was the fact that an existing algorithm existed that was designed to solve one optimal
control problem at a time. However, functional decomposition could also have been used by
decomposing the solution of each optimal control problem and distributing the computational work
to different processors. This kind of approach to a similar computational problem has been studied
in Ref. [10], where the trajectory of an aircraft has been broken into phases and simulated in
parallel.
Here, the parallel program was developed to be used to compute the optimal control problems
described in Section 2 but with only minor changes the same program code can be used in other
cases where independent computations must be done for a set of data pieces. However, if there is
data dependence between the tasks a different and more complex parallel program is needed.
18 of 23
917,8
1975,5
3777,1
0
1000
2000
3000
4000
32 64 128
Problem size
Tim
e / s
Figure 7: Execution times with variable problem sizes
6 Summary and Conclusions
Further development for the parallel program would include modelling the MPI communication
costs. This would require measuring the environment dependent parameters in Equation (7.1) and
Table 2. If the communication costs were known, it could be possible to adjust the granularity of the
program accurately to achieve a better balance of efficient load balancing and parallel overhead.
19 of 23
7 Appendix
7 Appendix
7.1 MPI Communication Costs
The communication cost T of the point-to-point operations MPI_SEND and MPI_RECV can be
modelled using the equation
T=t s tw N , (7.1)
where t s is the start-up time, tw is the bandwidth dependent transfer time per word and N is the
message length in words.
The communication costs of global operations using hypercube communication algorithms on the
idealized multicomputer architecture are listed in Table 2, where P is the number of processors and
t op is the cost of a single reduction operation. Since global operations contain internal
communication, they can not be modelled with a simple linear model like point-to-point operations.
For large messages bandwidth costs can increase the costs of communication. See Ref. [11] for
more details.
Operation Cost (small N) Cost (large N)MPI_BARRIER t s log P t s log P
MPI_BCAST log P t stw N 2 t slog Ptw N
MPI_SCATTER t s log Ptw N t s log Pt w N
MPI_GATHER t s log Ptw N t s log Pt w N
MPI_REDUCE log P t stwt opN t s 2 log Pt w 2t opN
MPI_ALLREDUCE log P t stwt opN t s 2 log Pt w 2t opN
Table 2: MPI communication costs
The cost of MPI_INIT and MPI_FINALIZE depend on their implementation and can be high.
However, this is not significant since they are executed only once during the program.
20 of 23
7 Appendix
7.2 Code Excerpts
7.2.1 Static Load BalancingSUBROUTINE balance_even() INTEGER :: & my_size, & my_start, & my_end data_size = num_states - ( first_state - 1) IF (data_size < ntask) THEN WRITE(*,*) 'Error: Data size < number of tasks!' RETURN END IF IF (my_id < MOD(data_size,ntask)) THEN my_size = data_size / ntask + 1 my_start = first_state + my_id * my_size ELSE my_size = data_size / ntask my_start = first_state + data_size - (ntask-my_id)* my_size END IF my_end = my_start + my_size - 1 WRITE (*,*) '[',my_id,'] data_size=',data_size,'ntask=',ntask,' & my_size=',my_size, ' my_start=' , my_start, ' my_end=' , my_end loop_state: DO ctr0 = my_start, my_end loop_player: DO ctr1 = first_player, num_players loop_support1: DO ctr2 = first_support1, num_support1 loop_support2: DO ctr3 = first_support2, num_support2
CALL doWork(ctr0,ctr1,ctr2,ctr3) END DO loop_support2 END DO loop_support1 END DO loop_player END DO loop_state END SUBROUTINE balance_even
7.2.2 Dynamic Load Balancing SUBROUTINE balance_dynamic() INTEGER :: & sendcount, & recvcount, & source_id,& dest_id, & num_slaves,& res,& workpiece_number INTEGER, PARAMETER :: & stop_task_code = -1, & message_tag = 50 INTEGER, DIMENSION(4) :: workpiece INTEGER, DIMENSION(:), ALLOCATABLE :: status INTEGER, DIMENSION(:,:,:,:), ALLOCATABLE :: work_done ALLOCATE(status(MPI_STATUS_SIZE)) sendcount = 0 recvcount = 0 IF (my_id /= root_id ) THEN ! Slave task DO CALL MPI_RECV(workpiece, 4, MPI_INTEGER, root_id, message_tag, & MPI_COMM_WORLD, status, rc) recvcount = recvcount + 1 IF (workpiece(1) == stop_task_code) THEN WRITE (*,*) my_id, ' - stop code received' EXIT END IF CALL doWork(workpiece(1),workpiece(2),workpiece(3),workpiece(4)) CALL MPI_SEND(res, 1, MPI_INTEGER, root_id, message_tag, & MPI_COMM_WORLD, rc)
21 of 23
7 Appendix
sendcount = sendcount + 1 END DO ELSE !Master task num_slaves = ntask - 1 data_size = num_states*num_support1*num_support2*num_players IF (data_size < num_slaves) THEN WRITE(*,*) 'Error: Data_size < num_slaves' GOTO 50 END IF WRITE(*,*) my_id, ' - master loop starting' ALLOCATE(work_done(num_states,num_players,num_support1, & num_support2)) work_done(:,:,:,:) = 0 workpiece_number = 0 loop_support2: DO ctr3 = first_support2, num_support2 loop_support1: DO ctr2 = first_support1, num_support1 loop_player: DO ctr1 = first_player, num_players loop_state: DO ctr0 = first_state, num_states workpiece_number = workpiece_number + 1 IF (workpiece_number > num_slaves) THEN CALL MPI_RECV(res, 1, MPI_INTEGER, MPI_ANY_SOURCE, & message_tag, MPI_COMM_WORLD, status, rc) recvcount = recvcount + 1 source_id = status(MPI_SOURCE) dest_id = source_id ELSE dest_id = workpiece_number END IF workpiece = (/ctr0,ctr1,ctr2,ctr3/) CALL MPI_SEND(workpiece, 4, MPI_INTEGER, dest_id, & message_tag, MPI_COMM_WORLD, rc) sendcount = sendcount + 1 work_done(ctr0,ctr1,ctr2,ctr3) = dest_id END DO loop_state END DO loop_player END DO loop_support1 END DO loop_support250 CONTINUE DO i = 1, num_slaves workpiece(1) = stop_task_code CALL MPI_SEND(stop_task_code, 4, MPI_INTEGER, i, message_tag, & MPI_COMM_WORLD, rc) sendcount = sendcount + 1 END DO END IF DEALLOCATE(status) IF (my_id == root_id) THEN WRITE (*,*) work_done(:,:,:,:) DEALLOCATE(work_done) END IF END SUBROUTINE balance_dynamic
22 of 23
8 References
8 References
1: Shinar, J., Siegel, A. W., Gold, Y. I., On the Analysis of a Complex Differential Game Using Artificial Intelligence Techniques, Proceedings of the 27th IEEE Conference on Decision and Control, IEEE, 19882: Le Ménec, S., Bernhard, P., Decision Support System for Medium Range Aerial Duels Combining Elements of Pursuit-Evasion Game Solutions with AI Techniques, Annals of the International Society of Dynamic Games, 19953: Karelahti, J., Virtanen, K., Raivio, T., Game Optimal Support Time of a Medium Range Air-to-Air Missile, Systems Analysis Laboratory, Helsinki University of Technology, 20054: University of California, SETI@home, 2005, http://setiathome.ssl.berkeley.edu/5: Flynn, M. J., Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computing C-21, No. 9 , 1972, pp 948-9606: Duncan, R., A Survey of Parallel Computer Architectures, IEEE Computer , 1990, pp 5-167: Message-Passing Interface Forum, MPI: A Message-Passing Interface Standard, 1995, http://www.mpi-forum.org/docs/docs.html8: Uusvuori, R., Kupila-Rantala, T., IBMSC User's Guide, CSC - Scientific Computing, 20039: Kupila-Rantala, T., CSC User's Guide, 2000, http://www.csc.fi/oppaat/cscuser/cscuser.pdf10: Betts, J. T., Huffman, W. P., Trajectory Optimization on a Parallel Processor, J. Guidance, 199111: Foster, I., Designing and Building Parallel Programs, 1995, http://www-unix.mcs.anl.gov/dbpp/
23 of 23