reconfiguration of mpi processes at runtime in a numerical pde solver

Upload: xebit

Post on 14-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    1/38

    Reconfiguration of MPI Processesat Runtime in a Numerical PDE Solver

    I M R A N A H M E D

    Master of Science ThesisStockholm, Sweden 2010

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    2/38

    Reconfiguration of MPI Processesat Runtime in a Numerical PDE Solver

    I M R A N A H M E D

    Masters Thesis in Numerical Analysis (30 ECTS credits)

    at the Scientific Computing International Master Program

    Royal Institute of Technology year 2010

    Supervisor was Jarmo Rantakokko, Uppsala UniversityExaminer was Michael Hanke

    TRITA-CSC-E 2010:147

    ISRN-KTH/CSC/E--10/147--SE

    ISSN-1653-5715

    Royal Institute of Technology

    School of Computer Science and Communication

    KTH CSC

    SE-100 44 Stockholm, Sweden

    URL: www.kth.se/csc

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    3/38

    Abstract

    The simulation of complex phenomena, described by partial

    differential equations, requires adaptive numerical methods and

    parallel computers. In adaptive methods the computational grid

    is automatically refined or coarsened to meet accuracy require-

    ments in the solution. This leads to a dynamic change of work-

    load. In a parallel computing context, the data must be redis-

    tributed between the processors within run time.

    Process Creation and Management a feature of MPI Standard-

    2, which facilitates to create and terminate processes after an

    MPI application started. In the thesis we design and implement

    a PDE solver based on this advance feature of MPI, where pro-

    cesses are created dynamically, and data is redistributing amongthe spawned processes at run time. Further we analyze the per-

    formance of the PDE solver over different computer system.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    4/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    5/38

    Referat

    Anpassning av MPI processer under krtiden i

    en numerisk PDE-lsare

    Simulering av komplexa fenomen, som beskrivs av partiel-

    la differentialekvationer, krver adaptiva numeriska metoder och

    parallelldatorer. I adaptiva metoder frfinas berkningsntet au-

    tomatiskt fr att n freskrivna toleranser, vilket leder till dy-

    namisk arbetsbelastning. I parallella berkningar mste data dr-

    fr partitioneras om i krtid.

    "Process Creation and Management", tillgngligt i MPI-standard2, underlttar skapande och avslutande av processer efter att ett

    MPI-program startats. I detta examensarbete utvecklas en lsare

    fr PDE, baserad p denna avancerade feature i MPI, med dy-

    namisk lastbalansering och dynamisk kontroll ver processer i

    krtid. Vi analyserar prestandan av denna lsare p olika sys-

    tem.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    6/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    7/38

    Contents

    1 Introduction 1

    2 Preliminaries 32.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Process Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3 PDE Solver 9

    3.1 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Parent Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Child Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Child Processes Communication . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Parent-Child Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4 Practical Results 17

    4.1 Static Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Dynamic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Fixed Processes Vs Varying Processes . . . . . . . . . . . . . . . . . . . . . 19

    5 Conclusion 27

    Bibliography 29

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    8/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    9/38

    Chapter 1

    Introduction

    In science and engineering the physical systems that have more than one independent vari-able are modeled using partial differential equation (PDE), such as heat, electrostatic,electrodynamics and fluid flow. Simple equation can be solved analytically that is, we canobtain an equation that describes how the variables are evolved and the characteristics ofthe equation. However in most cases this is not possible, the analytical solution may notexist or is difficult to obtain. Therefore such equations are solved numerically using numer-ical methods. A consequent is that the solution is an approximation to the exact solution.The most widely used numerical methods are FDM (finite difference method) and FEM(finite element method). In this thesis we consider FDM.

    In FDM the solution of differential equation is approximated by replacing the derivateexpression with finite difference equation. The discretization of domain plays a significantrole in accuracy of FDM. High grid resolution is valuable for the convergence of the solution

    while solving the equation numerically, but this requires more computational time andcomputer resources. Refinement of grid is required only in areas where the accuracy is toolow and the solution does not converge for a particular step size. Consequently there is aneed of adaptive mesh refinement (AMR), which refine the mesh, from coarse to fine gridwhere required.

    Parallelization of such an algorithm is not simple. Because of the dynamic feature ofadaptive numerical method its parallel implementation is harder compared to static gridcomputation. In the static case, the grid is evenly distributed among the processes. In theadaptive case, there is a frequent change in size of the grid. This leads to a load balancingproblem requiring the grid to be redistributed.

    An MPI application where processes are created statically, the number of processes isfixed throughout the computation of numerical method. If the mesh is refined during the

    computation from coarse to fine grid, the work load per process increases, consequently thetotal computational time rises. This time could be reduced at run time by creating moreprocesses which would share the workload. Sharing work load leads to redistribution ofdata. Therefore we require a dynamic creation of processes at run time. MPI-2 providesan advanced feature, where processes are created and terminated at run time after an MPIapplication has been started. We use this feature and develop a PDE solver where processesare created and the data is redistributed among processes as the mesh size increases. InChapter 3 we explain the implementation of such a PDE solver, which includes the function-ality and connection between parent child programs. Chapter 4 describes the performanceof the PDE solver on two different computer systems. How computer architecture andoperating systems influence the PDE Solver performance.

    1

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    10/38

    2 CHAPTER 1. INTRODUCTION

    Further we emphasize, it is beneficial to use varying number of processes comparing to

    fixed number of processes as the mesh size increases during run time, here varying numberof processes changing the number of processes as the mesh size increases & fixed means nochange of processes through out the computation. We also analyze our purpose idea on twodifferent computer architecture in terms of processors per node.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    11/38

    Chapter 2

    Preliminaries

    2.1 Parallel Computing

    Numerical simulation of scientific and engineering problems requires immense computa-tional speed. One way to increase the computational speed is by using multiple processormachines instead of single processor. Therefore the sequential program is to be modified formultiple processors. Writing a program for this form of computation is known as parallelprogramming. The basic idea is to split the computational work load among the processorin such a way that it provides a significant increase in the performance. The followingfactors are considered while evaluating the performance of the parallel program[1].

    2.1.1 Speed up

    It is define by the following formula.Sp =

    TsTP

    wherep is the number of processors.Tsis the execution time of the sequential algorithm.Tpis the execution time of the parallel algorithm with p processors.When Sp = p we obtain linear speedup, which describe good scalability of the parallel

    program.

    2.1.2 Efficiency

    It is useful to know how long processors are being used in computation, which can found

    from the efficiency. Efficiency is define asEp =

    Spp

    Typically its value lies between zero and one.The efficiency of the algorithm heavily depends upon the architecture of the computer

    systems, i.e cores per nodes. More cores per nodes requires less interconnection networks,consequently less communication time and the system efficiency increase.

    2.2 Message Passing Interface

    Message passing is a common technique for parallel processing on distributed memory mul-tiprocessors. Processes execute tasks on individual processors and communicate with each

    3

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    12/38

    4 CHAPTER 2. PRELIMINARIES

    other by sending messages. In this way, processes can operate in a semi-autonomous man-

    ner, performing distinct computations that form part of a larger job, sharing data andsynchronizing with each other when required.

    Message-passing systems is distributed memory paradigm in which each process executesin a different memory space from the other processes. This scheme works equally well whenthe processors are part of the same computer or spread across a range of heterogeneousmachines spanning a network. Message-passing systems can be highly scalable, but to makethe best use of message passing, applications must be designed to exploit the parallelismavailable.

    Message Passing Interface (MPI) is a specification for an API that allows many comput-ers to communicate with one another. MPI addresses primarily the message-passing parallelprogramming model, in which data is moved from the address space of one process to thatof another process through cooperative operations on each process. MPI is not a language,and all MPI operations are expressed as functions, subroutines, or methods, according tothe appropriate language bindings, which for C, C++, Fortran-77, and Fortran-95, are partof the MPI standard[3] .

    Intent of MPI is to provide safe communication environment. In point to point commu-nication, one process sends a message and second process receives it, in contrast collectivecommunication is held between group of processes. Communicators are used in MPI mes-sage passing communications. A communicator is a communication domain that definesa set of processes that are allowed to communicate with one another. Each process has aunique rank within the communicator, an integer from 0 to p-1, where there are p processes.Two types of communicators are available, an intracommunicator for communicating withina group, and an intercommunicator for communication between groups, both communica-tors support point-to-point and collective communication. A process could be a part ofdifferent communicators, therefore process is identified by the pair of group and its rank in

    the group. Beside communication MPI also provides routines for process topology creation,a convenient naming mechanism for processes in an intracommunicator. The two maintypes of topologies supported by MPI are Cartesian (grid) and Graph. There may be norelation between the physical structure of the parallel machine and the process topology.To create a simple virtual topology commonly used MPI routines are MPI_cart_createand MPI_cart_coords. Figure 2.1 shows mapping of processes in to a two dimensionalCartesian topology, where integers 0,1..15 represents the process rank and (0,0)..(3,3) ascoordinates in topology. [1, 2].

    2.2.1 Process Creation

    MPI is primarily concerned with communication rather than process or resource manage-ment. However, it is necessary to address these issues to some degree in order to definea useful framework for communication. MPI standard 2 provides routine for processescreation. There are two methods by which we can create MPI processes, statically anddynamically.

    Static Process Creation

    All the processes are specified before the execution and the system will execute a fixednumber of processes. The programmer usually explicitly identifies the processes prior toexecution by command line options.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    13/38

    2.2. MESSAGE PASSING INTERFACE 5

    Figure 2.1: Cartesian Topology

    Dynamic Process creation

    In dynamic process creation, processes are created during the execution of other processes.MPI standard 2 provides portable routines for dynamic creation, portable means whichcan run under variety of job-scheduling and process management environment. One of theways in which MPI achieves scalability and performance is through the use of collectiveoperations that concisely describe an operation that involves large numbers of processes.Instead of creating new processes through individual requests to a job manager and processcreator, MPI allows the programmer to make a single request for a large number of processesthat will belong to a single group.

    MPI Spawn Processes

    In MPI new processes are spawned, when an existing MPI process executes MPI_COMM_SPAWN[4, 3]routine. In return of this routine another COMM_WORLD is established consistingof spawned processes. Here we describe the essential parameters required for successfulexecution of MPI_COMM_SPAWN.

    MPI_COMM_SPAWN(command, argv, maxprocs, info, root, comm, intercomm,array_of_errcodes)input parameters

    command : name of program to be spawnedmaxprocs: maximum number of processes to start

    root: rank of parent processcomm: intracommunicator containing group of spawning processesoutput parameters

    intercomm: intercommunicator between original group and newly spawned groupMPI_COMM_SPAWN start maxprocs identical copies of the MPI program specified

    by command, establishing communication with them and returning an intercommunica-tor. The spawned processes are referred to as children. The children have their ownMPI_COMM_WORLD, which is separate from that of the parents.Figure 2.2 illustratesthis where P0 execute the call MPI_COMM_SPAWN resulting the creation of childrenCOMM_WORLD.

    Parent process creates new child processes by calling MPI_COMM_SPAWN.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    14/38

    6 CHAPTER 2. PRELIMINARIES

    Figure 2.2: Spawning in MPI

    Child processes calls MPI_INIT() and create their own MPI_COMM_WORLD.

    An intercommunicator is formed between parent and children COMM_WORLD.

    Children call MPI_COMM_GET_PARENT, get intercommunicator for communi-cation with parent.

    Figure 2.3 shows the sequence of MPI routines carried by parent and child process inorder to establish communication.

    At the parent code variable num_child means we are created 4 child processes, child_commis the inter-communicator and child is the name of the child program, these parameters arepassed to MPI_COMM_SPAWN routine. MPI_RECV call is receiving a data from childprocesses having rank 0, using chil_comm as the communicator. In child program all childprocesses call MPI_Comm_Parent, which returns the communicator for parent process asparent . After this the parallel code for child processes is be added. In MPI_SEND callchild processes use the parent as the communicator.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    15/38

    2.2. MESSAGE PASSING INTERFACE 7

    Figure 2.3: Parent and Child Programs

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    16/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    17/38

    Chapter 3

    PDE Solver

    3.1 Parallel Implementation

    We implement two program one program as parent program which instantiate one MPIprocess as parent process to manage the creation of other processes. Second program is thechild program, which solves the PDE over spawned processes created by parent process.The PDE we choose is 2D advection equation(3.1).

    ut ux uy = F (3.1)

    Leap frog scheme is used to solve the equation(3.1) numerically. Therefore equation(3.1)becomes

    uk+1ij uk1ij

    2ht=

    uki+1,j uki1,j

    2hx+

    uki,j+1 uki,j1

    2hy+ F (3.2)

    where k = 2, 3, 4, N

    i = 1, 2, 3 N x

    j = 1, 2, 3 N y

    3.2 Parent Program

    The intent of this PDE solver is to share the work load if the mesh size increases during thecomputation of numerical method over new processes, which leads to redistribution of data.

    As parent process creates new processes, it would be convenient to manage redistributionof data at parent process, but first it has to collect the previously computed data from thechild processes. Beside the collection and redistribution of data, the parent process has toperform interpolation between previous mesh and new fine mesh.

    Pseudo code for parent program.

    mesh_size [ 513x 513 , 1025x1025 ] /* Mesh sizes */

    child_processes [2, 4] /* for each mesh size number of child processes */

    for(i,mesh_size) /* for each mesh size */

    9

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    18/38

    10 CHAPTER 3. PDE SOLVER

    Figure 3.1: Flow Diagram

    create_child_processes (spawn( child_processes(i))

    if(continuation of computation)Scatter innerpoints to the child processes

    Gather inner points from child processes

    Perform linear interpolation (mesh_size(i), mesh_size(i+1))

    end

    The parent program execute in the following way, for each mesh size it will spawn a groupof child processes, number of child processes could be vary, there is no need to scatter datafor the first time because it is beginning of numerical computation. Parent process waitsat Gather operation in order to receive data from child processes. After collecting dataParent processes perform the linear interpolation over receive data for the next mesh size.In subsequent execution of for loop the interpolated is scatter over new group of spawnedprocesses. Figure3.1 illustrates the above mention execution.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    19/38

    3.3. CHILD PROGRAM 11

    Figure 3.2: Mesh Distribution

    3.3 Child Program

    Child program is a conventional MPI program, which solves the PDE on parallel processesby dividing the mesh grid among them. The division of mesh is depends upon the topologyof processes. It would be convenient to know processes topology information prior dividingthe mesh over processes. Two dimensional processes Cartesian grid for arbitrary number ofprocesses is created using MPI_DIMS_CREATE routine, after acquainted with processestopology information, partitioning of mesh would be easier. Let P Q represent the 2dimensional processes topology and N xN y as mesh size, then the local mesh size nnxnnyon each process would be nnx = Nx

    p+ 2 and nny = Ny

    Q+ 2. Addition of 2 is due to the

    ghost points in each dimension figure 3.2Ghost point appears when a numerical method is implemented parallel because the

    difference operator reaches out the local region and its computation depend upon the points

    lying on the adjacent process. The figure 3.3 shows the decomposition of grid over 2processes P0 and P1, the computation of uijat process P0 depends upon i 1, i + 1,j 1, j + 1 points, all points except i + 1 lies on the same process P0. The point i + 1lies on the adjacent process P1. Therefore an extra column of points is introduce at eachprocess, called ghost points that hold the values of i + 1 from the adjacent process, resultinga communication among the processes.

    Pseudo code for Child program.

    calculate local mesh size nx ny

    instantiate local arrays u(nx+2,ny+2)

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    20/38

    12 CHAPTER 3. PDE SOLVER

    Figure 3.3: Difference operator reaches out of local region

    if(continuation of computation)Receive inner points from Parent process as a Corresponding Scatter operationif(start of computation)

    initialize the mesh.for(k=2,N) /* for each time step*/

    Evaluate difference operator away from boundaries

    Transfer Ghost points

    Evaluate difference operator near boundaries

    end

    Send innerpoints to Parent process as a corresponding Gather operation.

    3.4 Child Processes CommunicationTransfer of ghost point leads communication among child processes. Processes lies at theedge of a process topology would require less communication comparing processes lies inthe middle. In figure the arrows shows such communication patterns.

    Pseudo code for communication.

    dims_x x dimension of 2 dimensional process topology

    dim_y y dimension of 2 dimensional process topology

    cordx is x coordinate of process in topology

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    21/38

    3.5. PARENT-CHILD COMMUNICATION 13

    Figure 3.4: Child Processes Communication Pattern

    cordy is y coordinate of process in topology

    p_left process at left

    p_right process at right

    p_down process in down direction

    p_up process in up direction

    if (cord_x>0) /*Initiate send to left neighbor*/Send(p_left)if (cord_x0) /*Initiate receive from left */Recv(p_left)if (cordy>0) /* Initiate send to up neighbor*/Send (p_up)if (cordy0) // Initiate receive from up Recv(p_up)

    3.5 Parent-Child Communication

    Parent process has to collect and redistribute data from child processes. Child processeshave their own COMM_WOLRD, which is not a part of parent COMM_WORLD. Forcommunication between two COMM_WORLD if we use the communicator which is es-tablished between parent and child processes at the time of spawning, allow us to performpoint to point communication. In that case the transfer of data would occur in the followingmanner.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    22/38

    14 CHAPTER 3. PDE SOLVER

    root process in child COMM_WORLD will collect the data from other child processes

    using MPI_gather as collective operation

    then the root process send the collected data to Parent process as point to pointcommunication operation.

    But if we can manage to perform a collective communication between two separate COMM_WORLD,which means all child processes send data directly to parent process as a collective oper-ation, then we would get ride off point-to-point communication operation between Parentand root child process. The established communicator does not support MPI_Gathherand MPI_Scatter operation, because rank 0 exist in both COMM_WOLD, for collec-tive operation all processes in communication must have unique ranks. Therefore we useMPI_Intercomm_merge , which , combines the two intracommunicator and returns an in-tercommunicator, where all processes have unique ranks. The third parameter is an integer

    value, which is use to order the groups when a new communicator is created. The groupthat sets 0 has its processes ordered first in new communicator. The figure 3.5 illustrateboth scenarios with code, where comm1 and comm2 are two separate communicators mergetogether resulting R_comm as new communicator. In first scenario 3.5acomm1 set its in-teger value 0 and comm2 sets to 1. Scenario 2 shows when comm1 sets 1 and comm2 setto 0.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    23/38

    3.5. PARENT-CHILD COMMUNICATION 15

    Figure 3.5: MPI_Intercom_merge

    (a) Scenario 1

    (b) Scenario 2

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    24/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    25/38

    Chapter 4

    Practical Results

    In this chapter, we analyze the performance of the PDE Solver on different computer sys-tems. Specification of Computer systems is as follows.

    Isis Cluster

    Each node has 2 dual core AMD opteron 2220 processors, therefore each node compriseof 4 processors.

    Grad Cluster

    Each node has 2 quad cores Intel 2.66 GHz processors, therefore each node compriseof 8 processors.

    4.1 Static Process

    We analyze the speedup over Isis and Grad cluster by creating static processes. For analysis,each time we run the child program which has the parallel implementation of PDE, over 2,4, 6 and 8 processes with mesh size of 1025 1025. While performing test on Isis clusterwe use 2 nodes for 6 and 8 processes since each node on Isis has 4 processors, Grad clustersupports 8 processors per node, therefore we use only single node. Table 4.1 shows thespeedup analysis.

    Figure4.1 shows linear speed up, speed up over Isis and grad cluster. Up till 4 processes,both machines show almost same speedup but above then 4 processes speedup over Isisstart degrading comparing to Grad cluster, because we are using 2 nodes over Isis clusterand node to node communication introduce more time delays.

    Table 4.1: Static Processes

    SpeedupIsis Cluster Grad Cluster

    Sequential Runtime = 80.86 sec Sequential Runtime = 114.51 secProcesses Time in sec Speedup Processes Time in sec Speedup

    2 41.86 1.94 2 59.22 1.934 22.58 3.58 4 30.97 3.696 15.99 5.05 6 19.99 5.728 11.95 6.76 8 15.12 7.56

    17

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    26/38

    18 CHAPTER 4. PRACTICAL RESULTS

    Figure 4.1: Static Processes

    4.2 Dynamic Process

    Here we analyze the speedup by creating processes dynamically. We do not need to runthe program each time for 2, 4, 6 processes at command console, we define this informationin the Parent program and execute it which will create processes respectively at runtime.The mesh size is same as 1025 1025. We also have to acquire the resources prior to the

    execution of program, so that each processor would execute single process. In this case weacquire at most 7 processors since 6 is the maximum number of child processes that wecreate and plus 1 the parent process.

    On Isis cluster we acquire 2 nodes which give 8 processors. Figure 4.2shows a notabledegradation in speedup when 6 processes are created, which did not happen when processesare created statically. This problem requires some investigation of Job scheduler or processcreator behavior while processes are created dynamically. While investigation it revealsthat the Job scheduler is not allocating the resources efficiently to processes. When 2 and4 processes are created using 2 nodes it performs well, resources are allocated efficientlyand none of the node is overloaded with processes. But in case of 6 processes it overloadedthe one node with five processes, 4 children and one parent process is scheduled on onenode, rest of the 2 child processes are scheduled on second node. Due to excess number of

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    27/38

    4.3. FIXED PROCESSES VS VARYING PROCESSES 19

    Figure 4.2: Dynamic Processes over 2 Isis Nodes

    processes and less number of processors on node, process switching occurs which introducesmore time delays. We rectify this problem by allocating more resources.

    The speedup results over 3 nodes are presented in table 4.2and visually figure 4.3, whichis quite similar to the speedup we achieve in static case. However same Job schedulerproblem appear only when 8 processes are created.

    Figure illustrates the performance of operating system while creating 2,4,6,8 and 10

    child processes dynamically over 3 Isis nodes. P represents the parent process and c as childprocess, when 8 processes are created the node 1 is gets overloaded with process consist of1 parent and 4 child process.

    4.3 Fixed Processes Vs Varying Processes

    Now we will analyze the efficiency of the system when the mesh size is increased andwhether it is beneficial to use fixed number of processes or varying number of processes.Fixed number of processes means that during run time we do not change the number ofprocesses while the mesh is increasing, in contrast to varying number of processes where wewill change it for each mesh size.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    28/38

    20 CHAPTER 4. PRACTICAL RESULTS

    Figure 4.3: Dynamic Processes over 3 Isis Nodes

    Figure 4.4: Creation of Dynamic Process Over 3 Isis Nodes

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    29/38

    4.3. FIXED PROCESSES VS VARYING PROCESSES 21

    Table 4.2: Dynamic Processes

    SpeedupIsis Cluster

    Sequential Runtime = 80.86 secProcesses Time in sec Speedup

    2 41.98 1.924 22.16 3.646 15.79 5.128 71.05 1.13

    10 9.56 8.45

    We define number of child processes and mesh sizes at Parent program. e.gno_child_processes [2 , 4 ,6]mesh_sizes [29, 210,211]Such that for each dynamically created group of child processes, there is different mesh

    size, the new mesh is attain by halving the previous mesh. The linear interpolation betweentwo mesh is performed at Parent process.

    First we consider the Isis cluster. In table 4.3 T is time taken by group of fixed numberof child processes to solve the PDE for the respective grid size, summing each such timegives the total (Total_T ) computation time of PDE solver. The last column of the tablepresents the efficiency and E_av as the average efficiency.

    Figure 4.5shows that using fewer processes such as 2 gives better efficiency because ofless communication, but with large amount of total run time, to reduce the total run timewe have to use large set of processes such as 4 or 6. When we use more processes, we

    successfully reduce the total run time but with degraded average efficiency. What wouldbe the optimum solution to achieve good average efficiency with minimum total run time?One solution could be to change the number of processes gradually as the mesh sizes grows.

    Table 4.4shows the result for using varying number of processes, for each grid size 513x513 , 1025 x 1025 and 2049 x 2049 we use different number of processes as 2 ,4, 6 respectively.

    In Figure 4.5 the last column 2,4,6 represent the result achieve by using varying numberof processes during run time. We attain 91.8 % average efficiency with 93.5 sec total runtime. Comparing with 2 fixed number of processes, there is a significant reduction in totalrun time but with 6% of average efficiency degradation. In contrast there is 4.8% of gain inaverage efficiency with 13.5 sec of raise in time comparing with 6 fixed number of processes.Therefore using varying number of processes as the mesh size increase is advantageousin reducing the total run time comparing to using fewer processes and gainful in average

    efficiency comparing using large set of processes.But what if we change the computer systems, a system having more processors per node;do we achieve a significant gain in average efficiency over such computer system? Thereforeto answer this we perform the same analysis over Grad cluster.

    In case of Grad cluster we attain above then 94% average efficiency in all cases of fixednumber of processes Table4.5. Using varying number of processes we attain 95.8 % aver-age efficiency Table 4.6. Figure 4.6illustrates that we did not achieve a significant gain inaverage efficiency while using varying number of processes comparing to 6 fixed number ofprocesses where we achieve 94.8 % average efficiency. This reason would be Grad archi-tecture, which supports 8 processors per node and we use only one node comparing to Isiscluster. Consequently gives less communication delays due to no node-to-node communica-tion.

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    30/38

    22 CHAPTER 4. PRACTICAL RESULTS

    Table 4.3: Fixed Processes Isis Cluster

    Processes1 2

    Grid Size Time T in sec Speedup Efficiency513 x 513 20.40 10.48 1.94 97.3

    1025 x 1025 80.89 41.65 1.94 97.32049 x 2049 322.43 165.64 1.94 97.3

    Total_T=217.77 E_av=97.3

    (a) 2 Processes

    Processes1 4

    Grid Size Time T in sec Speedup Efficiency513 x 513 20.40 5.62 3.62 90.7

    1025 x 1025 80.89 21.85 3.7 92.52049 x 2049 322.43 88.98 3.62 90.58

    Total_T=116.46 E_av=91.27

    (b) 4 Processes

    Processes1 2

    Grid Size Time T in sec Speedup Efficiency513 x 513 20.40 3.95 5.16 86

    1025 x 1025 80.89 15.62 5.17 86.22049 x 2049 322.43 60.48 5.33 88.8

    Total_T=80.06 E_av=87.06

    (c) 6 Processes

    Table 4.4: Varying Processes Isis Cluster

    Processes2 4 6

    Grid Size Speedup Efficiency513 x 513 10.46 1.94 97.47

    1025 x 1025 22.75 3.55 88.862049 x 2049 60.30 5.34 89.11

    Total_T = 93.5 sec E_av = 91.82

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    31/38

    4.3. FIXED PROCESSES VS VARYING PROCESSES 23

    Figure 4.5: Isis Cluster

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    32/38

    24 CHAPTER 4. PRACTICAL RESULTS

    Table 4.5: Fixed Number of processes Grad Cluster

    Processes1 2

    Grid Size Time T in sec Speedup Efficiency513 x 513 28.87 14.49 1.99 99.6

    1025 x 1025 115.24 58.17 1.98 992049 x 2049 459.54 233.93 1.96 98.2

    Total_T=306.59 E_av=98.96

    (a) 2 Processes

    Processes1 4

    Grid Size Time T in sec Speedup Efficiency513 x 513 28.87 7.38 3.91 97.7

    1025 x 1025 115.24 29.7 3.88 972049 x 2049 459.54 119.4 3.84 96.18

    Total_T=156.52 E_av=96.99

    (b) 4 Processes

    Processes1 6

    Grid Size Time T in sec Speedup Efficiency513 x 513 28.87 5 5.77 96.23

    1025 x 1025 115.24 20.01 5.75 95.982049 x 2049 459.54 83.08 5.53 92.18

    Total_T=108.09 E_av = 94.8

    (c) 6 Processes

    Table 4.6: Varying Processes Grad Cluster

    Processes2 4 6

    Grid Size Speedup Efficiency513 x 513 14.52 1.98 99.41

    1025 x 1025 29.47 3.91 97.762049 x 2049 84.88 5.41 90.23

    Total_T = 128.87 E_av =95.80

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    33/38

    4.3. FIXED PROCESSES VS VARYING PROCESSES 25

    Figure 4.6: Grad Cluster Average Efficiency

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    34/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    35/38

    Chapter 5

    Conclusion

    We have successfully designed and implemented a PDE solver using Process Creation andManagement MPI-2 standards feature, where the data is redistributed among the processesas the mesh size increases at run time. The job scheduler shows an unexpected behaviorwhen the processes are created dynamically, it did not allocate the resources efficiently whichaffects the performance of the PDE Solver; however the problem was resolve by allocatingmore resources. Using varying number of processes is beneficial comparing to fixed numberof processes, but it is also depended upon the computer system, such as on Isis cluster weachieve significant deference in efficiency comparing to Grad cluster.

    In future we analyze the performance of the same system by opening openmp threadswithin the child processes. For example consider Isis cluster 4 processors each node, wewill use 3 nodes on each node there would be 1 child process which in turn create openmpthreads. Openmp threads solve the PDE over shared memory among the processes per

    node. To guide the operating system in order to create 1 process on 1 node we will useMPI_INFO_OBJECT. In MPI_INFO_OBJECT there are key set values, where we candefine on which host you want to create processes.

    27

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    36/38

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    37/38

    Bibliography

    [1] Barry Wilkinson & Michael Allen. Parallel Programming Techniques and ApplicationUsing Networked Workstations and Parallel Computers. Pearson Education Interna-tional, 2005.

    [2] Ananth Grama, Anshul Gupta, George Karypis & Vipin Kumar. Introduction to ParallelComputing. Addison Wesley, 2003.

    [3] MPI-Forum. MPI Standard Version 2.2 . http://www.mpi-forum.org/docs/mpi-2.2.

    [4] William Grop, Ewing Lusk & Rajeev Thakur. Using MPI-2 Advance Features of theMessage Passing Interface. The MIT Press Cambridge, 1999.

    29

  • 7/30/2019 Reconfiguration of MPI Processes at Runtime in a Numerical PDE Solver

    38/38

    TRITA-CSC-E 2010:147

    ISRN-KTH/CSC/E--10/147--SE

    ISSN-1653-5715

    www.kth.se