reconfiguration of mpi processes at runtime in a numerical pde solver

7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

1/38

Reconfiguration of MPI Processesat Runtime in a Numerical PDE Solver

I M R A N A H M E D

Master of Science ThesisStockholm, Sweden 2010


2/38

Reconfiguration of MPI Processesat Runtime in a Numerical PDE Solver

I M R A N A H M E D

Masters Thesis in Numerical Analysis (30 ECTS credits)

at the Scientific Computing International Master Program

Royal Institute of Technology year 2010

Supervisor was Jarmo Rantakokko, Uppsala UniversityExaminer was Michael Hanke

TRITA-CSC-E 2010:147

ISRN-KTH/CSC/E--10/147--SE

ISSN-1653-5715

Royal Institute of Technology

School of Computer Science and Communication

KTH CSC

SE-100 44 Stockholm, Sweden

URL: www.kth.se/csc


3/38

Abstract

The simulation of complex phenomena, described by partial

differential equations, requires adaptive numerical methods and

parallel computers. In adaptive methods the computational grid

is automatically refined or coarsened to meet accuracy require-

ments in the solution. This leads to a dynamic change of work-

load. In a parallel computing context, the data must be redis-

tributed between the processors within run time.

Process Creation and Management a feature of MPI Standard-

2, which facilitates to create and terminate processes after an

MPI application started. In the thesis we design and implement

a PDE solver based on this advance feature of MPI, where pro-

cesses are created dynamically, and data is redistributing amongthe spawned processes at run time. Further we analyze the per-

formance of the PDE solver over different computer system.


4/38


5/38

Referat

Anpassning av MPI processer under krtiden i

en numerisk PDE-lsare

Simulering av komplexa fenomen, som beskrivs av partiel-

la differentialekvationer, krver adaptiva numeriska metoder och

parallelldatorer. I adaptiva metoder frfinas berkningsntet au-

tomatiskt fr att n freskrivna toleranser, vilket leder till dy-

namisk arbetsbelastning. I parallella berkningar mste data dr-

fr partitioneras om i krtid.

"Process Creation and Management", tillgngligt i MPI-standard2, underlttar skapande och avslutande av processer efter att ett

MPI-program startats. I detta examensarbete utvecklas en lsare

fr PDE, baserad p denna avancerade feature i MPI, med dy-

namisk lastbalansering och dynamisk kontroll ver processer i

krtid. Vi analyserar prestandan av denna lsare p olika sys-

tem.


6/38


7/38

Contents

1 Introduction 1

2 Preliminaries 32.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Process Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 PDE Solver 9

3.1 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Parent Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Child Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Child Processes Communication . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Parent-Child Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Practical Results 17

4.1 Static Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Dynamic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Fixed Processes Vs Varying Processes . . . . . . . . . . . . . . . . . . . . . 19

5 Conclusion 27

Bibliography 29


8/38


9/38

Chapter 1

Introduction

In science and engineering the physical systems that have more than one independent vari-able are modeled using partial differential equation (PDE), such as heat, electrostatic,electrodynamics and fluid flow. Simple equation can be solved analytically that is, we canobtain an equation that describes how the variables are evolved and the characteristics ofthe equation. However in most cases this is not possible, the analytical solution may notexist or is difficult to obtain. Therefore such equations are solved numerically using numer-ical methods. A consequent is that the solution is an approximation to the exact solution.The most widely used numerical methods are FDM (finite difference method) and FEM(finite element method). In this thesis we consider FDM.

In FDM the solution of differential equation is approximated by replacing the derivateexpression with finite difference equation. The discretization of domain plays a significantrole in accuracy of FDM. High grid resolution is valuable for the convergence of the solution

while solving the equation numerically, but this requires more computational time andcomputer resources. Refinement of grid is required only in areas where the accuracy is toolow and the solution does not converge for a particular step size. Consequently there is aneed of adaptive mesh refinement (AMR), which refine the mesh, from coarse to fine gridwhere required.

Parallelization of such an algorithm is not simple. Because of the dynamic feature ofadaptive numerical method its parallel implementation is harder compared to static gridcomputation. In the static case, the grid is evenly distributed among the processes. In theadaptive case, there is a frequent change in size of the grid. This leads to a load balancingproblem requiring the grid to be redistributed.

An MPI application where processes are created statically, the number of processes isfixed throughout the computation of numerical method. If the mesh is refined during the

computation from coarse to fine grid, the work load per process increases, consequently thetotal computational time rises. This time could be reduced at run time by creating moreprocesses which would share the workload. Sharing work load leads to redistribution ofdata. Therefore we require a dynamic creation of processes at run time. MPI-2 providesan advanced feature, where processes are created and terminated at run time after an MPIapplication has been started. We use this feature and develop a PDE solver where processesare created and the data is redistributed among processes as the mesh size increases. InChapter 3 we explain the implementation of such a PDE solver, which includes the function-ality and connection between parent child programs. Chapter 4 describes the performanceof the PDE solver on two different computer systems. How computer architecture andoperating systems influence the PDE Solver performance.

1


10/38

2 CHAPTER 1. INTRODUCTION

Further we emphasize, it is beneficial to use varying number of processes comparing to

fixed number of processes as the mesh size increases during run time, here varying numberof processes changing the number of processes as the mesh size increases & fixed means nochange of processes through out the computation. We also analyze our purpose idea on twodifferent computer architecture in terms of processors per node.


11/38

Chapter 2

Preliminaries

2.1 Parallel Computing

Numerical simulation of scientific and engineering problems requires immense computa-tional speed. One way to increase the computational speed is by using multiple processormachines instead of single processor. Therefore the sequential program is to be modified formultiple processors. Writing a program for this form of computation is known as parallelprogramming. The basic idea is to split the computational work load among the processorin such a way that it provides a significant increase in the performance. The followingfactors are considered while evaluating the performance of the parallel program[1].

2.1.1 Speed up

It is define by the following formula.Sp =

TsTP

wherep is the number of processors.Tsis the execution time of the sequential algorithm.Tpis the execution time of the parallel algorithm with p processors.When Sp = p we obtain linear speedup, which describe good scalability of the parallel

program.

2.1.2 Efficiency

It is useful to know how long processors are being used in computation, which can found

from the efficiency. Efficiency is define asEp =

Spp

Typically its value lies between zero and one.The efficiency of the algorithm heavily depends upon the architecture of the computer

systems, i.e cores per nodes. More cores per nodes requires less interconnection networks,consequently less communication time and the system efficiency increase.

2.2 Message Passing Interface

Message passing is a common technique for parallel processing on distributed memory mul-tiprocessors. Processes execute tasks on individual processors and communicate with each

3


12/38

4 CHAPTER 2. PRELIMINARIES

other by sending messages. In this way, processes can operate in a semi-autonomous man-

ner, performing distinct computations that form part of a larger job, sharing data andsynchronizing with each other when required.

Message-passing systems is distributed memory paradigm in which each process executesin a different memory space from the other processes. This scheme works equally well whenthe processors are part of the same computer or spread across a range of heterogeneousmachines spanning a network. Message-passing systems can be highly scalable, but to makethe best use of message passing, applications must be designed to exploit the parallelismavailable.

Message Passing Interface (MPI) is a specification for an API that allows many comput-ers to communicate with one another. MPI addresses primarily the message-passing parallelprogramming model, in which data is moved from the address space of one process to thatof another process through cooperative operations on each process. MPI is not a language,and all MPI operations are expressed as functions, subroutines, or methods, according tothe appropriate language bindings, which for C, C++, Fortran-77, and Fortran-95, are partof the MPI standard[3] .

Intent of MPI is to provide safe communication environment. In point to point commu-nication, one process sends a message and second process receives it, in contrast collectivecommunication is held between group of processes. Communicators are used in MPI mes-sage passing communications. A communicator is a communication domain that definesa set of processes that are allowed to communicate with one another. Each process has aunique rank within the communicator, an integer from 0 to p-1, where there are p processes.Two types of communicators are available, an intracommunicator for communicating withina group, and an intercommunicator for communication between groups, both communica-tors support point-to-point and collective communication. A process could be a part ofdifferent communicators, therefore process is identified by the pair of group and its rank in

the group. Beside communication MPI also provides routines for process topology creation,a convenient naming mechanism for processes in an intracommunicator. The two maintypes of topologies supported by MPI are Cartesian (grid) and Graph. There may be norelation between the physical structure of the parallel machine and the process topology.To create a simple virtual topology commonly used MPI routines are MPI_cart_createand MPI_cart_coords. Figure 2.1 shows mapping of processes in to a two dimensionalCartesian topology, where integers 0,1..15 represents the process rank and (0,0)..(3,3) ascoordinates in topology. [1, 2].

2.2.1 Process Creation

MPI is primarily concerned with communication rather than process or resource manage-ment. However, it is necessary to address these issues to some degree in order to definea useful framework for communication. MPI standard 2 provides routine for processescreation. There are two methods by which we can create MPI processes, statically anddynamically.

Static Process Creation

All the processes are specified before the execution and the system will execute a fixednumber of processes. The programmer usually explicitly identifies the processes prior toexecution by command line options.


13/38

2.2. MESSAGE PASSING INTERFACE 5

Figure 2.1: Cartesian Topology

Dynamic Process creation

In dynamic process creation, processes are created during the execution of other processes.MPI standard 2 provides portable routines for dynamic creation, portable means whichcan run under variety of job-scheduling and process management environment. One of theways in which MPI achieves scalability and performance is through the use of collectiveoperations that concisely describe an operation that involves large numbers of processes.Instead of creating new processes through individual requests to a job manager and processcreator, MPI allows the programmer to make a single request for a large number of processesthat will belong to a single group.

MPI Spawn Processes

In MPI new processes are spawned, when an existing MPI process executes MPI_COMM_SPAWN[4, 3]routine. In return of this routine another COMM_WORLD is established consistingof spawned processes. Here we describe the essential parameters required for successfulexecution of MPI_COMM_SPAWN.

MPI_COMM_SPAWN(command, argv, maxprocs, info, root, comm, intercomm,array_of_errcodes)input parameters

command : name of program to be spawnedmaxprocs: maximum number of processes to start

root: rank of parent processcomm: intracommunicator containing group of spawning processesoutput parameters

intercomm: intercommunicator between original group and newly spawned groupMPI_COMM_SPAWN start maxprocs identical copies of the MPI program specified

by command, establishing communication with them and returning an intercommunica-tor. The spawned processes are referred to as children. The children have their ownMPI_COMM_WORLD, which is separate from that of the parents.Figure 2.2 illustratesthis where P0 execute the call MPI_COMM_SPAWN resulting the creation of childrenCOMM_WORLD.

Parent process creates new child processes by calling MPI_COMM_SPAWN.


14/38

6 CHAPTER 2. PRELIMINARIES

Figure 2.2: Spawning in MPI

Child processes calls MPI_INIT() and create their own MPI_COMM_WORLD.

An intercommunicator is formed between parent and children COMM_WORLD.

Children call MPI_COMM_GET_PARENT, get intercommunicator for communi-cation with parent.

Figure 2.3 shows the sequence of MPI routines carried by parent and child process inorder to establish communication.

At the parent code variable num_child means we are created 4 child processes, child_commis the inter-communicator and child is the name of the child program, these parameters arepassed to MPI_COMM_SPAWN routine. MPI_RECV call is receiving a data from childprocesses having rank 0, using chil_comm as the communicator. In child program all childprocesses call MPI_Comm_Parent, which returns the communicator for parent process asparent . After this the parallel code for child processes is be added. In MPI_SEND callchild processes use the parent as the communicator.


15/38

2.2. MESSAGE PASSING INTERFACE 7

Figure 2.3: Parent and Child Programs


16/38


17/38

Chapter 3

PDE Solver

3.1 Parallel Implementation

We implement two program one program as parent program which instantiate one MPIprocess as parent process to manage the creation of other processes. Second program is thechild program, which solves the PDE over spawned processes created by parent process.The PDE we choose is 2D advection equation(3.1).

ut ux uy = F (3.1)

Leap frog scheme is used to solve the equation(3.1) numerically. Therefore equation(3.1)becomes

uk+1ij uk1ij

2ht=

uki+1,j uki1,j

2hx+

uki,j+1 uki,j1

2hy+ F (3.2)

where k = 2, 3, 4, N

i = 1, 2, 3 N x

j = 1, 2, 3 N y

3.2 Parent Program

The intent of this PDE solver is to share the work load if the mesh size increases during thecomputation of numerical method over new processes, which leads to redistribution of data.

As parent process creates new processes, it would be convenient to manage redistributionof data at parent process, but first it has to collect the previously computed data from thechild processes. Beside the collection and redistribution of data, the parent process has toperform interpolation between previous mesh and new fine mesh.

Pseudo code for parent program.

mesh_size [ 513x 513 , 1025x1025 ] /* Mesh sizes */

child_processes [2, 4] /* for each mesh size number of child processes */

for(i,mesh_size) /* for each mesh size */

9


18/38

10 CHAPTER 3. PDE SOLVER

Figure 3.1: Flow Diagram

create_child_processes (spawn( child_processes(i))

if(continuation of computation)Scatter innerpoints to the child processes

Gather inner points from child processes

Perform linear interpolation (mesh_size(i), mesh_size(i+1))

end

The parent program execute in the following way, for each mesh size it will spawn a groupof child processes, number of child processes could be vary, there is no need to scatter datafor the first time because it is beginning of numerical computation. Parent process waitsat Gather operation in order to receive data from child processes. After collecting dataParent processes perform the linear interpolation over receive data for the next mesh size.In subsequent execution of for loop the interpolated is scatter over new group of spawnedprocesses. Figure3.1 illustrates the above mention execution.


19/38

3.3. CHILD PROGRAM 11

Figure 3.2: Mesh Distribution

3.3 Child Program

Child program is a conventional MPI program, which solves the PDE on parallel processesby dividing the mesh grid among them. The division of mesh is depends upon the topologyof processes. It would be convenient to know processes topology information prior dividingthe mesh over processes. Two dimensional processes Cartesian grid for arbitrary number ofprocesses is created using MPI_DIMS_CREATE routine, after acquainted with processestopology information, partitioning of mesh would be easier. Let P Q represent the 2dimensional processes topology and N xN y as mesh size, then the local mesh size nnxnnyon each process would be nnx = Nx

p+ 2 and nny = Ny

Q+ 2. Addition of 2 is due to the

ghost points in each dimension figure 3.2Ghost point appears when a numerical method is implemented parallel because the

difference operator reaches out the local region and its computation depend upon the points

lying on the adjacent process. The figure 3.3 shows the decomposition of grid over 2processes P0 and P1, the computation of uijat process P0 depends upon i 1, i + 1,j 1, j + 1 points, all points except i + 1 lies on the same process P0. The point i + 1lies on the adjacent process P1. Therefore an extra column of points is introduce at eachprocess, called ghost points that hold the values of i + 1 from the adjacent process, resultinga communication among the processes.

Pseudo code for Child program.

calculate local mesh size nx ny

instantiate local arrays u(nx+2,ny+2)


20/38


Figure 3.3: Difference operator reaches out of local region

if(continuation of computation)Receive inner points from Parent process as a Corresponding Scatter operationif(start of computation)

initialize the mesh.for(k=2,N) /* for each time step*/

Evaluate difference operator away from boundaries

Transfer Ghost points

Evaluate difference operator near boundaries

end

Send innerpoints to Parent process as a corresponding Gather operation.

3.4 Child Processes CommunicationTransfer of ghost point leads communication among child processes. Processes lies at theedge of a process topology would require less communication comparing processes lies inthe middle. In figure the arrows shows such communication patterns.

Pseudo code for communication.

dims_x x dimension of 2 dimensional process topology

dim_y y dimension of 2 dimensional process topology

cordx is x coordinate of process in topology


21/38

3.5. PARENT-CHILD COMMUNICATION 13

Figure 3.4: Child Processes Communication Pattern

cordy is y coordinate of process in topology

p_left process at left

p_right process at right

p_down process in down direction

p_up process in up direction

if (cord_x>0) /*Initiate send to left neighbor*/Send(p_left)if (cord_x0) /*Initiate receive from left */Recv(p_left)if (cordy>0) /* Initiate send to up neighbor*/Send (p_up)if (cordy0) // Initiate receive from up Recv(p_up)

3.5 Parent-Child Communication

Parent process has to collect and redistribute data from child processes. Child processeshave their own COMM_WOLRD, which is not a part of parent COMM_WORLD. Forcommunication between two COMM_WORLD if we use the communicator which is es-tablished between parent and child processes at the time of spawning, allow us to performpoint to point communication. In that case the transfer of data would occur in the followingmanner.


22/38


root process in child COMM_WORLD will collect the data from other child processes

using MPI_gather as collective operation

then the root process send the collected data to Parent process as point to pointcommunication operation.

But if we can manage to perform a collective communication between two separate COMM_WORLD,which means all child processes send data directly to parent process as a collective oper-ation, then we would get ride off point-to-point communication operation between Parentand root child process. The established communicator does not support MPI_Gathherand MPI_Scatter operation, because rank 0 exist in both COMM_WOLD, for collec-tive operation all processes in communication must have unique ranks. Therefore we useMPI_Intercomm_merge , which , combines the two intracommunicator and returns an in-tercommunicator, where all processes have unique ranks. The third parameter is an integer

value, which is use to order the groups when a new communicator is created. The groupthat sets 0 has its processes ordered first in new communicator. The figure 3.5 illustrateboth scenarios with code, where comm1 and comm2 are two separate communicators mergetogether resulting R_comm as new communicator. In first scenario 3.5acomm1 set its in-teger value 0 and comm2 sets to 1. Scenario 2 shows when comm1 sets 1 and comm2 setto 0.


23/38

3.5. PARENT-CHILD COMMUNICATION 15

Figure 3.5: MPI_Intercom_merge

(a) Scenario 1

(b) Scenario 2


24/38


25/38

Chapter 4

Practical Results

In this chapter, we analyze the performance of the PDE Solver on different computer sys-tems. Specification of Computer systems is as follows.

Isis Cluster

Each node has 2 dual core AMD opteron 2220 processors, therefore each node compriseof 4 processors.

Grad Cluster

Each node has 2 quad cores Intel 2.66 GHz processors, therefore each node compriseof 8 processors.

4.1 Static Process

We analyze the speedup over Isis and Grad cluster by creating static processes. For analysis,each time we run the child program which has the parallel implementation of PDE, over 2,4, 6 and 8 processes with mesh size of 1025 1025. While performing test on Isis clusterwe use 2 nodes for 6 and 8 processes since each node on Isis has 4 processors, Grad clustersupports 8 processors per node, therefore we use only single node. Table 4.1 shows thespeedup analysis.

Figure4.1 shows linear speed up, speed up over Isis and grad cluster. Up till 4 processes,both machines show almost same speedup but above then 4 processes speedup over Isisstart degrading comparing to Grad cluster, because we are using 2 nodes over Isis clusterand node to node communication introduce more time delays.

Table 4.1: Static Processes

SpeedupIsis Cluster Grad Cluster

Sequential Runtime = 80.86 sec Sequential Runtime = 114.51 secProcesses Time in sec Speedup Processes Time in sec Speedup

2 41.86 1.94 2 59.22 1.934 22.58 3.58 4 30.97 3.696 15.99 5.05 6 19.99 5.728 11.95 6.76 8 15.12 7.56

17


26/38

18 CHAPTER 4. PRACTICAL RESULTS

Figure 4.1: Static Processes

4.2 Dynamic Process

Here we analyze the speedup by creating processes dynamically. We do not need to runthe program each time for 2, 4, 6 processes at command console, we define this informationin the Parent program and execute it which will create processes respectively at runtime.The mesh size is same as 1025 1025. We also have to acquire the resources prior to the

execution of program, so that each processor would execute single process. In this case weacquire at most 7 processors since 6 is the maximum number of child processes that wecreate and plus 1 the parent process.

On Isis cluster we acquire 2 nodes which give 8 processors. Figure 4.2shows a notabledegradation in speedup when 6 processes are created, which did not happen when processesare created statically. This problem requires some investigation of Job scheduler or processcreator behavior while processes are created dynamically. While investigation it revealsthat the Job scheduler is not allocating the resources efficiently to processes. When 2 and4 processes are created using 2 nodes it performs well, resources are allocated efficientlyand none of the node is overloaded with processes. But in case of 6 processes it overloadedthe one node with five processes, 4 children and one parent process is scheduled on onenode, rest of the 2 child processes are scheduled on second node. Due to excess number of


27/38

4.3. FIXED PROCESSES VS VARYING PROCESSES 19

Figure 4.2: Dynamic Processes over 2 Isis Nodes

processes and less number of processors on node, process switching occurs which introducesmore time delays. We rectify this problem by allocating more resources.

The speedup results over 3 nodes are presented in table 4.2and visually figure 4.3, whichis quite similar to the speedup we achieve in static case. However same Job schedulerproblem appear only when 8 processes are created.

Figure illustrates the performance of operating system while creating 2,4,6,8 and 10

child processes dynamically over 3 Isis nodes. P represents the parent process and c as childprocess, when 8 processes are created the node 1 is gets overloaded with process consist of1 parent and 4 child process.

4.3 Fixed Processes Vs Varying Processes

Now we will analyze the efficiency of the system when the mesh size is increased andwhether it is beneficial to use fixed number of processes or varying number of processes.Fixed number of processes means that during run time we do not change the number ofprocesses while the mesh is increasing, in contrast to varying number of processes where wewill change it for each mesh size.


28/38


Figure 4.3: Dynamic Processes over 3 Isis Nodes

Figure 4.4: Creation of Dynamic Process Over 3 Isis Nodes


29/38


Table 4.2: Dynamic Processes

SpeedupIsis Cluster

Sequential Runtime = 80.86 secProcesses Time in sec Speedup

2 41.98 1.924 22.16 3.646 15.79 5.128 71.05 1.13

10 9.56 8.45

We define number of child processes and mesh sizes at Parent program. e.gno_child_processes [2 , 4 ,6]mesh_sizes [29, 210,211]Such that for each dynamically created group of child processes, there is different mesh

size, the new mesh is attain by halving the previous mesh. The linear interpolation betweentwo mesh is performed at Parent process.

First we consider the Isis cluster. In table 4.3 T is time taken by group of fixed numberof child processes to solve the PDE for the respective grid size, summing each such timegives the total (Total_T ) computation time of PDE solver. The last column of the tablepresents the efficiency and E_av as the average efficiency.

Figure 4.5shows that using fewer processes such as 2 gives better efficiency because ofless communication, but with large amount of total run time, to reduce the total run timewe have to use large set of processes such as 4 or 6. When we use more processes, we

successfully reduce the total run time but with degraded average efficiency. What wouldbe the optimum solution to achieve good average efficiency with minimum total run time?One solution could be to change the number of processes gradually as the mesh sizes grows.

Table 4.4shows the result for using varying number of processes, for each grid size 513x513 , 1025 x 1025 and 2049 x 2049 we use different number of processes as 2 ,4, 6 respectively.

In Figure 4.5 the last column 2,4,6 represent the result achieve by using varying numberof processes during run time. We attain 91.8 % average efficiency with 93.5 sec total runtime. Comparing with 2 fixed number of processes, there is a significant reduction in totalrun time but with 6% of average efficiency degradation. In contrast there is 4.8% of gain inaverage efficiency with 13.5 sec of raise in time comparing with 6 fixed number of processes.Therefore using varying number of processes as the mesh size increase is advantageousin reducing the total run time comparing to using fewer processes and gainful in average

efficiency comparing using large set of processes.But what if we change the computer systems, a system having more processors per node;do we achieve a significant gain in average efficiency over such computer system? Thereforeto answer this we perform the same analysis over Grad cluster.

In case of Grad cluster we attain above then 94% average efficiency in all cases of fixednumber of processes Table4.5. Using varying number of processes we attain 95.8 % aver-age efficiency Table 4.6. Figure 4.6illustrates that we did not achieve a significant gain inaverage efficiency while using varying number of processes comparing to 6 fixed number ofprocesses where we achieve 94.8 % average efficiency. This reason would be Grad archi-tecture, which supports 8 processors per node and we use only one node comparing to Isiscluster. Consequently gives less communication delays due to no node-to-node communica-tion.


30/38


Table 4.3: Fixed Processes Isis Cluster

Processes1 2

Grid Size Time T in sec Speedup Efficiency513 x 513 20.40 10.48 1.94 97.3

1025 x 1025 80.89 41.65 1.94 97.32049 x 2049 322.43 165.64 1.94 97.3

Total_T=217.77 E_av=97.3

(a) 2 Processes

Processes1 4


1025 x 1025 80.89 21.85 3.7 92.52049 x 2049 322.43 88.98 3.62 90.58

Total_T=116.46 E_av=91.27

(b) 4 Processes

Processes1 2

Grid Size Time T in sec Speedup Efficiency513 x 513 20.40 3.95 5.16 86

1025 x 1025 80.89 15.62 5.17 86.22049 x 2049 322.43 60.48 5.33 88.8

Total_T=80.06 E_av=87.06

(c) 6 Processes

Table 4.4: Varying Processes Isis Cluster

Processes2 4 6

Grid Size Speedup Efficiency513 x 513 10.46 1.94 97.47

1025 x 1025 22.75 3.55 88.862049 x 2049 60.30 5.34 89.11

Total_T = 93.5 sec E_av = 91.82


31/38


Figure 4.5: Isis Cluster


32/38


Table 4.5: Fixed Number of processes Grad Cluster

Processes1 2


1025 x 1025 115.24 58.17 1.98 992049 x 2049 459.54 233.93 1.96 98.2

Total_T=306.59 E_av=98.96

(a) 2 Processes

Processes1 4


1025 x 1025 115.24 29.7 3.88 972049 x 2049 459.54 119.4 3.84 96.18

Total_T=156.52 E_av=96.99

(b) 4 Processes

Processes1 6

Grid Size Time T in sec Speedup Efficiency513 x 513 28.87 5 5.77 96.23

1025 x 1025 115.24 20.01 5.75 95.982049 x 2049 459.54 83.08 5.53 92.18

Total_T=108.09 E_av = 94.8

(c) 6 Processes

Table 4.6: Varying Processes Grad Cluster

Processes2 4 6

Grid Size Speedup Efficiency513 x 513 14.52 1.98 99.41

1025 x 1025 29.47 3.91 97.762049 x 2049 84.88 5.41 90.23

Total_T = 128.87 E_av =95.80


33/38


Figure 4.6: Grad Cluster Average Efficiency


34/38


35/38

Chapter 5

Conclusion

We have successfully designed and implemented a PDE solver using Process Creation andManagement MPI-2 standards feature, where the data is redistributed among the processesas the mesh size increases at run time. The job scheduler shows an unexpected behaviorwhen the processes are created dynamically, it did not allocate the resources efficiently whichaffects the performance of the PDE Solver; however the problem was resolve by allocatingmore resources. Using varying number of processes is beneficial comparing to fixed numberof processes, but it is also depended upon the computer system, such as on Isis cluster weachieve significant deference in efficiency comparing to Grad cluster.

In future we analyze the performance of the same system by opening openmp threadswithin the child processes. For example consider Isis cluster 4 processors each node, wewill use 3 nodes on each node there would be 1 child process which in turn create openmpthreads. Openmp threads solve the PDE over shared memory among the processes per

node. To guide the operating system in order to create 1 process on 1 node we will useMPI_INFO_OBJECT. In MPI_INFO_OBJECT there are key set values, where we candefine on which host you want to create processes.

27


36/38


37/38

Bibliography

[1] Barry Wilkinson & Michael Allen. Parallel Programming Techniques and ApplicationUsing Networked Workstations and Parallel Computers. Pearson Education Interna-tional, 2005.

[2] Ananth Grama, Anshul Gupta, George Karypis & Vipin Kumar. Introduction to ParallelComputing. Addison Wesley, 2003.

[3] MPI-Forum. MPI Standard Version 2.2 . http://www.mpi-forum.org/docs/mpi-2.2.

[4] William Grop, Ewing Lusk & Rajeev Thakur. Using MPI-2 Advance Features of theMessage Passing Interface. The MIT Press Cambridge, 1999.

29


38/38

TRITA-CSC-E 2010:147

ISRN-KTH/CSC/E--10/147--SE

ISSN-1653-5715

www.kth.se

reconfiguration of mpi processes at runtime in a numerical pde solver

Documents