mapping parallel programs into hierarchical distributed computer systems

16
Mapping Parallel Programs into Hierarchical Distributed Computer Systems Prof. Victor G. Khoroshevsky and Mikhail G. Kurnosov Computer Systems Laboratory, The A.V. Rzhanov Institute of Semiconductor Physics of Siberian Branch of Russian Academy of Sciences, 13 Lavrentyev ave., 630090 Novosibirsk, Russia E-mail: [email protected] 4 th International Conference on Software and Data Technologies (ICSOFT 2009) Sofia, Bulgaria, 26 - 29 July, 2009

Upload: mikhail-kurnosov

Post on 18-Dec-2014

92 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Mapping Parallel Programs into Hierarchical

Distributed Computer Systems

Prof. Victor G. Khoroshevsky and Mikhail G. Kurnosov

Computer Systems Laboratory,

The A.V. Rzhanov Institute of Semiconductor Physics of Siberian Branch of

Russian Academy of Sciences,

13 Lavrentyev ave., 630090 Novosibirsk, Russia

E-mail: [email protected]

4th International Conference on Software and Data Technologies (ICSOFT 2009)

Sofia, Bulgaria, 26 - 29 July, 2009

Page 2: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Mapping High-Performance Linpack into

hierarchical computer cluster:

Mapping by standard MPI-tools (mpiexec) –

execution time 118 sec. (44 GFLOPS)

Optimized mapping –

execution time 100 sec. (53 GFLOPS)

Mapping Parallel Programs into

Hierarchical Distributed Computer Systems

High-Performance Linpack task graph (NP=8, PMAP=1, BCAST=5)

Computer cluster withhierarchical organization

Two SMP-nodes: 2 x Intel Xeon 5150

2ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Level 1

Level 2

Page 3: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Related Work

3ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

1. Mapping parallel programs into computer systems (CS) with a

fixed network topology (hybercube, 3D-torus, mesh, etc). A parallel program

represented by a task graph:

(Yu, 2006), (Chen et al. 2006), (Bhanot et al. 2005), (Jose, 1999),(Ahmad, 1997), (Kalinowski, 1994), (Yau, 1993), (Ercal et al. 1990), (Lee, 1989),(Bokhari, 1981).

2. Mapping parallel programs into CSs with arbitrary topology. A parallel

program represented by unweighted task graph:

(Ucar et al., 2006), (Prakash et al., 2004), (Miquel et al., 2003), (Träff, 2002),(Moh, 2001), (Perego, 1998), (Lee, 1989).

Algorithms considering a hierarchical organization of modern distributed

computer systems are needed.

The objective of our research – is development of models and algorithms for

mapping parallel programs into modern hierarchical computer systems

(such as, multicore computer clusters).

Page 4: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Model of Hierarchical Organization of Distributed Computer System

Example of hierarchical organization of computer cluster:

N = 12; L = 3; n23 = 2; C23 = {9, 10, 11, 12}; g(3, 3, 4) = 2; z(1, 7) = 1

Denotations:

C = {1, 2, …, N} – is a set of processor cores;

L – is a number of levels in communication network;

nl – is a number of elements placed at level l ∈ {1, 2, …, L};

nlk – is a number of children of element k ∈ {1, 2, …, nl} at level l;

Сlk – is a set of processor cores belonging to the descendants of element k at level l; clk = |Clk|;

bl – is a bandwidth of communication channels at level l (bit/sec.).

4ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Page 5: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Given a task graph G = (V, E) and a description of hierarchical organization of

computer system (CS):

• V = {1, 2, …, M} – is a set of parallel processes;

• E ⊆ V × V – is a set of inter-process communications;

• dij – is a volume of data transmitted between process i and j for a program execution time;

• bz(p, q) – is a bandwidth of communication channel between cores p and q;

Mapping – is a function f : V → C, which is defined by values of

Objective – is to minimize a program execution time T(X).

The Problem of Mapping Parallel Programs into Hierarchical

Distributed Computer Systems

)(1 1 1),( minmax)(

ijx

M

j

N

p

N

qqpzijjqip

VibdxxXT →

⋅⋅= ∑∑∑= = =∈

,11

∑=

=N

j

ijx ,,...,2,1 Mi =

,11

∑=

≤M

i

ijx ,,...,2,1 Nj =

},1,0{∈ijx ,Vi ∈ .Cj ∈

Subject to the constraints:

==

.)( else,0

;)( if,1

jif

jifxij

5ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Page 6: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Task graph partitioning:

The Heuristic Algorithm TMMGP

6ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

b3b3

b2

b3

b1

1V ′

2V ′

3V ′

1LcMk =

Step 1 –Partitioning

Step 2 –Mapping

Page 7: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Task Graph Partitioning in the TMMGP algorithm

7ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

… …1. Coarse graph:Heavy Edge Matching (Karypis, Kumar, 1998)

2. Partition graph Gm into k subsets by recursive bisection (Schloegel et al. 2003)

3. Refine partition by FM heuristic(Fiduccia,

Mattheyses, 1982)

A computational complexity of TMMGP algorithm is O(|E|log2k + M)

Page 8: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Software Tools for Mapping MPI Programs

8ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Page 9: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Experiments Organization

9ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

MPI programs:

• NAS Parallel Benchmarks (NPB);

• High-Performance Linpack (HPL).

Computer clusters:

• Cluster Xeon16: 4 nodes (2 x Intel Xeon 5150), interconnect: Gigabit/Fast Ethernet;

• Cluster Opteron10: 5 nodes (2 x AMD Opteron 248), interconnect: Gigabit/Fast Ethernet.

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4

13 14 15 16

9 10 11 12

5 6 7 8

1 2 3 4

HPL task graph:

16 processes, PMAP=0, BCAST=5

NPB Conjugate Gradient task graph:

16 processes, CLASS B

NPB Multigrid task graph:

16 processes, CLASS B

Page 10: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Experiment Results

10ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

The execution time of TMMGP algorithm on Intel Core 2 Duo 2.13 GHz processor is less then 1 sec.

Cluster interconnect

T(XRR), sec. T(XTMMGP), sec.Speedup

T(XRR) / T(XTMMGP)

High-Performance Linpack

Fast Ethernet 1108.69 911.81 1.22

Gigabit Ethernet

263.15 231.72 1.14

NPB Conjugate Gradient

Fast Ethernet 726.02 400.36 1.81

Gigabit Ethernet

97.56 42.05 2.32

NPB Multigrid

Fast Ethernet 23.94 23.90 1.00

Gigabit Ethernet

4.06 4.03 1.00

• T(XRR) – is the execution time of MPI benchmark with mapping by round robin algorithm

of mpiexec tool (MPICH2 1.0.6).

• T(XTMMGT) – is the execution time of MPI benchmark with mapping by TMMGP algorithm.

The execution time of MPI benchmarks on Xeon16 cluster

Page 11: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Conclusions and Future Works

11ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Conclusions

• It is required to take into account a hierarchical organization of modern

computer systems and structures of parallel programs in mapping algorithms.

• The proposed algorithm TMMGP allows to reduce execution time of

MPI-programs on 40% in average.

• New algorithms for mapping parallel programs with full task graphs are

required.

Future Works

• Development of new algorithms for mapping parallel programs into arbitrary

subsystems of hierarchical distributed computer systems.

• Integrating the mapping algorithm TMMGP with mpiexec tool and resource

management systems (such as TORQUE).

• Application of the descried approach for optimizing MPI collective operations.

Page 12: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Mapping Parallel Programs into Hierarchical

Distributed Computer Systems

Victor G. Khoroshevsky and Mikhail G. Kurnosov

Computer Systems Laboratory,

The A.V. Rzhanov Institute of Semiconductor Physics of

Siberian Branch of Russian Academy of Sciences,

13 Lavrentyev ave., 630090 Novosibirsk, Russia

E-mail: [email protected]

4th International Conference on Software and Data Technologies (ICSOFT 2009)

Sofia, Bulgaria, 26 - 29 July, 2009

Thank You For Your Attention

Page 13: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Backup Slides

ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

Page 14: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

The k-way Graph Partitioning Problem

ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

The example of 3-way graph partitioning:

V’ = {1, 2, …, 12}; k = 3; s = 3;

W(1, 2) = 3; W(1, 3) = 2; W(2, 3) = 4.

It is required to partition graph G’ = (V’, E’) into k disjoint subsets such that

maximal sum of edge-weights incident to any subset is minimized and |V’i| ≤ s.

• w(u, v) – is a weight of edge (u, v) ∈ E’ ;

• W(i, j) – is an additional weight for edges incident to subsets i and j;

• c(u, v, i, j) = w(u, v)W(i, j) – is a total weight of edge (u, v) incident to subsets i and j.

kVVV ′′′ ,...,, 21

The approximate partition:

edge-weights(V’1) = w(1, 5)W(1, 2) +

+ w(6, 8)W(1, 3) + w(2, 3)W(1, 3) = 22;

edge-weights(V’2) = 40

edge-weights(V’3) = 38

1V ′

2V ′

3V ′),,(/),,,( jiLguv bdjivuc =

Page 15: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Heavy Edge Matching algorithm

ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Coarser graphMatching (source graph)

HEM

Page 16: Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Graph Bisection

ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

5

3

5

6

4

2

53

2

1

4

Initial vertexBisection