mapping parallel programs into hierarchical distributed computer systems

Mapping Parallel Programs into Hierarchical

Distributed Computer Systems

Prof. Victor G. Khoroshevsky and Mikhail G. Kurnosov

Computer Systems Laboratory,

The A.V. Rzhanov Institute of Semiconductor Physics of Siberian Branch of

Russian Academy of Sciences,

13 Lavrentyev ave., 630090 Novosibirsk, Russia

E-mail: [email protected]

4th International Conference on Software and Data Technologies (ICSOFT 2009)

Sofia, Bulgaria, 26 - 29 July, 2009

Mapping High-Performance Linpack into

hierarchical computer cluster:

Mapping by standard MPI-tools (mpiexec) –

execution time 118 sec. (44 GFLOPS)

Optimized mapping –

execution time 100 sec. (53 GFLOPS)

Mapping Parallel Programs into

Hierarchical Distributed Computer Systems

High-Performance Linpack task graph (NP=8, PMAP=1, BCAST=5)

Computer cluster withhierarchical organization

Two SMP-nodes: 2 x Intel Xeon 5150

2ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Level 1

Level 2

Related Work


1. Mapping parallel programs into computer systems (CS) with a

fixed network topology (hybercube, 3D-torus, mesh, etc). A parallel program

represented by a task graph:

(Yu, 2006), (Chen et al. 2006), (Bhanot et al. 2005), (Jose, 1999),(Ahmad, 1997), (Kalinowski, 1994), (Yau, 1993), (Ercal et al. 1990), (Lee, 1989),(Bokhari, 1981).

2. Mapping parallel programs into CSs with arbitrary topology. A parallel

program represented by unweighted task graph:

(Ucar et al., 2006), (Prakash et al., 2004), (Miquel et al., 2003), (Träff, 2002),(Moh, 2001), (Perego, 1998), (Lee, 1989).

Algorithms considering a hierarchical organization of modern distributed

computer systems are needed.

The objective of our research – is development of models and algorithms for

mapping parallel programs into modern hierarchical computer systems

(such as, multicore computer clusters).

Model of Hierarchical Organization of Distributed Computer System

Example of hierarchical organization of computer cluster:

N = 12; L = 3; n23 = 2; C23 = {9, 10, 11, 12}; g(3, 3, 4) = 2; z(1, 7) = 1

Denotations:

C = {1, 2, …, N} – is a set of processor cores;

L – is a number of levels in communication network;

nl – is a number of elements placed at level l ∈ {1, 2, …, L};

nlk – is a number of children of element k ∈ {1, 2, …, nl} at level l;

Сlk – is a set of processor cores belonging to the descendants of element k at level l; clk = |Clk|;

bl – is a bandwidth of communication channels at level l (bit/sec.).


Given a task graph G = (V, E) and a description of hierarchical organization of

computer system (CS):

• V = {1, 2, …, M} – is a set of parallel processes;

• E ⊆ V × V – is a set of inter-process communications;

• dij – is a volume of data transmitted between process i and j for a program execution time;

• bz(p, q) – is a bandwidth of communication channel between cores p and q;

Mapping – is a function f : V → C, which is defined by values of

Objective – is to minimize a program execution time T(X).

The Problem of Mapping Parallel Programs into Hierarchical


)(1 1 1),( minmax)(

ijx

M

j

N

p

N

qqpzijjqip

VibdxxXT →

⋅⋅= ∑∑∑= = =∈

,11

∑=

=N

j

ijx ,,...,2,1 Mi =

,11

∑=

≤M

i

ijx ,,...,2,1 Nj =

},1,0{∈ijx ,Vi ∈ .Cj ∈

Subject to the constraints:

≠

==

.)( else,0

;)( if,1

jif

jifxij


Task graph partitioning:

The Heuristic Algorithm TMMGP


b3b3

b2

b3

b1

1V ′

2V ′

3V ′

1LcMk =

Step 1 –Partitioning

Step 2 –Mapping

Task Graph Partitioning in the TMMGP algorithm


… …1. Coarse graph:Heavy Edge Matching (Karypis, Kumar, 1998)

2. Partition graph Gm into k subsets by recursive bisection (Schloegel et al. 2003)

3. Refine partition by FM heuristic(Fiduccia,

Mattheyses, 1982)

A computational complexity of TMMGP algorithm is O(|E|log2k + M)

Software Tools for Mapping MPI Programs


Experiments Organization


MPI programs:

• NAS Parallel Benchmarks (NPB);

• High-Performance Linpack (HPL).

Computer clusters:

• Cluster Xeon16: 4 nodes (2 x Intel Xeon 5150), interconnect: Gigabit/Fast Ethernet;

• Cluster Opteron10: 5 nodes (2 x AMD Opteron 248), interconnect: Gigabit/Fast Ethernet.

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4

HPL task graph:

16 processes, PMAP=0, BCAST=5

NPB Conjugate Gradient task graph:

16 processes, CLASS B

NPB Multigrid task graph:

16 processes, CLASS B

Experiment Results


The execution time of TMMGP algorithm on Intel Core 2 Duo 2.13 GHz processor is less then 1 sec.

Cluster interconnect

T(XRR), sec. T(XTMMGP), sec.Speedup

T(XRR) / T(XTMMGP)

High-Performance Linpack

Fast Ethernet 1108.69 911.81 1.22

Gigabit Ethernet

263.15 231.72 1.14

NPB Conjugate Gradient

Fast Ethernet 726.02 400.36 1.81

Gigabit Ethernet

97.56 42.05 2.32

NPB Multigrid

Fast Ethernet 23.94 23.90 1.00

Gigabit Ethernet

4.06 4.03 1.00

• T(XRR) – is the execution time of MPI benchmark with mapping by round robin algorithm

of mpiexec tool (MPICH2 1.0.6).

• T(XTMMGT) – is the execution time of MPI benchmark with mapping by TMMGP algorithm.

The execution time of MPI benchmarks on Xeon16 cluster

Conclusions and Future Works


Conclusions

• It is required to take into account a hierarchical organization of modern

computer systems and structures of parallel programs in mapping algorithms.

• The proposed algorithm TMMGP allows to reduce execution time of

MPI-programs on 40% in average.

• New algorithms for mapping parallel programs with full task graphs are

required.

Future Works

• Development of new algorithms for mapping parallel programs into arbitrary

subsystems of hierarchical distributed computer systems.

• Integrating the mapping algorithm TMMGP with mpiexec tool and resource

management systems (such as TORQUE).

• Application of the descried approach for optimizing MPI collective operations.

Mapping Parallel Programs into Hierarchical


Victor G. Khoroshevsky and Mikhail G. Kurnosov

Computer Systems Laboratory,

The A.V. Rzhanov Institute of Semiconductor Physics of

Siberian Branch of Russian Academy of Sciences,

13 Lavrentyev ave., 630090 Novosibirsk, Russia

E-mail: [email protected]

4th International Conference on Software and Data Technologies (ICSOFT 2009)

Sofia, Bulgaria, 26 - 29 July, 2009

Thank You For Your Attention

Backup Slides

ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

The k-way Graph Partitioning Problem


The example of 3-way graph partitioning:

V’ = {1, 2, …, 12}; k = 3; s = 3;

W(1, 2) = 3; W(1, 3) = 2; W(2, 3) = 4.

It is required to partition graph G’ = (V’, E’) into k disjoint subsets such that

maximal sum of edge-weights incident to any subset is minimized and |V’i| ≤ s.

• w(u, v) – is a weight of edge (u, v) ∈ E’ ;

• W(i, j) – is an additional weight for edges incident to subsets i and j;

• c(u, v, i, j) = w(u, v)W(i, j) – is a total weight of edge (u, v) incident to subsets i and j.

kVVV ′′′ ,...,, 21

The approximate partition:

edge-weights(V’1) = w(1, 5)W(1, 2) +

+ w(6, 8)W(1, 3) + w(2, 3)W(1, 3) = 22;

edge-weights(V’2) = 40

edge-weights(V’3) = 38

1V ′

2V ′

3V ′),,(/),,,( jiLguv bdjivuc =

Heavy Edge Matching algorithm


1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Coarser graphMatching (source graph)

HEM

Graph Bisection


5

3

5

6

4

2

53

2

1

4

Initial vertexBisection

mapping parallel programs into hierarchical distributed computer systems

Technology