distributed communication-aware load balancing with ...caradhras/documents/... · distributed...

25
Distributed communication-aware load balancing with TreeMatch in Charm++ The 11th workshop of the Joint Laboratory for Petascale Computing, Sophia-Antipolis Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration with the Charm++ Team from the PPL : Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale June 9, 2014 Francois Tessier TreeMatch in Charm++ 1 / 19

Upload: others

Post on 10-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Distributed communication-aware load balancing withTreeMatch in Charm++

The 11th workshop of the Joint Laboratory for Petascale Computing,Sophia-Antipolis

Emmanuel Jeannot Guillaume Mercier Francois TessierIn collaboration with the Charm++ Team from the PPL :

Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale

June 9, 2014

Francois Tessier TreeMatch in Charm++ 1/ 19

Page 2: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Introduction

Scalable execution of parallel applications

Number of cores is increasing

But memory per core is decreasing

Application will need to communicate even more than now

Our solution

Process placement should take into account process affinityHere: load balancing in Charm++ considering :

CPU loadprocess affinity (or other communicating objects)topology : network and intra-node

Francois Tessier TreeMatch in Charm++ 2/ 19

Page 3: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Charm++

Features

Parallel object-oriented programming language based on C++

Programs are decomposed into a number of cooperating message-drivenobjects called chares.In general we have more chares than processing units

Chares are mapped to physical processors by an adaptive runtime system

Load balancers can be called to migrate chares

Charm++ is able to use MPI for the processes communications

Francois Tessier TreeMatch in Charm++ 3/ 19

Page 4: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Processes Placement

Why we should consider it

Many current and future parallel platforms have several levels of hierarchy

Application chares/processes do not exchange the same amount of data(affinity)The process placement policy may have impact on performance

Cache hierarchy, memory bus, high-performance network...

Switch

Cabinet Cabinet

... Node Node

... Processor Processor

Core Core Core Core

Francois Tessier TreeMatch in Charm++ 4/ 19

Page 5: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Problems

Given

The parallel machine topology

The application communication pattern

Map application processes to physical resources (cores) to reduce thecommunication costs (NP-complete)

5 10 15

510

15

zeus16.map

Sender rank

Rec

eive

r ra

nk

01

23

45

67

Francois Tessier TreeMatch in Charm++ 5/ 19

Page 6: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

TreeMatch

The TreeMatch Algorithm

Algorithm and environment to compute processes placement based onprocesses affinities and NUMA topologyInput :

The communication pattern of the applicationPreliminary execution with a monitored MPI implementation for staticplacementDynamic recovery on iterative applications with Charm++

A model (tree) of the underlying architecture : Hwloc can provide us this.Output :

A processes permutation σ such that σi is the core number on which wehave to bind the process i

TreeMatch can only work on tree topologies. How to deal with 3d torus ?

Francois Tessier TreeMatch in Charm++ 6/ 19

Page 7: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Network placement

libtopomap

T. Hoefler and M. Snir, "Generic Topology Mapping Strategies forLarge-Scale Parallel Architectures" Proc. Int’l Conf. Supercomputing(ICS), pp. 75-84, 2011.

Library that enables to map processes on various network topologies

Used in TreeMatchLB to consider the Blue Waters 3d torus

Figure: 3d Torus and a Cray Gemini router

Francois Tessier TreeMatch in Charm++ 7/ 19

Page 8: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Load balancing

Principle

Iterative applications

load balancer called at regular interval

Migrate chares in order to optimize several criteriaCharm++ runtime system provides:

chares loadchares affinityetc. . .

Constraints

Dealing with complex modern architectures

Taking into account communications between elements

Some other communication-aware load-balacing algorithms

[L. L. Pilla, et al. 2012] NUCOLB, shared memory machines

[L. L. Pilla, et al. 2012] HwTopoLB

Some "built-in" Charm++ load balancers : RefineCommLB,GreedyCommLB. . .

Francois Tessier TreeMatch in Charm++ 8/ 19

Page 9: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Several issues raised

Not so easy...

Several issues raised!

Scalability of TreeMatch

Need to find a relevant compromise between processes affinities and loadbalancing

What about load balancing time?

The next slides will present our load balancer relying on TreeMatch andlibtopomap which performs a parallel and distributed communication-awareload balancing.

Francois Tessier TreeMatch in Charm++ 9/ 19

Page 10: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Network Placement

First step : minimize communication cost on network

libtopomap reorders processes from a communicatorHow to use it to reorder groups of processes (or chares) ? Example :groups of chares on nodes

Charm++ uses MPI : full access to the MPI APINew MPI communicator with MPI_Comm_split

0 1 2 3

Network (3d torus, tree, …)

Nodes

New communicator

Francois Tessier TreeMatch in Charm++ 10 / 19

Page 11: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

3 6 8 9 12 14 15 16

Groups of chares assigned to nodes

CP

U L

oad

Network (3d torus, hierarchical, …)

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 12: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel Figure: Part of a 3d Torus attributed by

the resource manager

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 13: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

CP

U L

oad

Network (3d torus, hierarchical, …)

3 ...

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 14: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6

Chares

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 15: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6

Chares

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 16: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Strategy for Charm++ - Intra-node placement

TreeMatch load balancer

1st step : Remap groups of chareson nodes according to thecommunication on the network

libtopomap (example : part of3d Torus)

2nd step : Reorder chares insideeach node (distributed)

Apply TreeMatch on the NUMAtopology and the charescommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallel

0 2 4 6 0 2 4 6

Groups of chares assigned to cores

CP

U L

oad

Network (3d torus, hierarchical, …)

Francois Tessier TreeMatch in Charm++ 11 / 19

Page 17: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

kNeighbor

Benchmarks application designed to simulate intensive communicationbetween processes

Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIMClusterParticularly compared to RefineCommLB

Takes into account load and communicationMinimizes migrations

Dum

myL

B

Gre

edyC

omm

LB

Gre

edyL

B

Ref

ineC

omm

LB

TM

LB_T

reeB

ased

Exe

cutio

n tim

e (in

sec

onds

)

0

100

200

300

400

500

600

700

kNeighbor on 64 cores128 elements − 1MB message size

Dum

myL

B

Gre

edyC

omm

LB

Gre

edyL

B

Ref

ineC

omm

LB

TM

LB_T

reeB

ased

Exe

cutio

n tim

e (in

sec

onds

)

0

500

1000

1500

2000

kNeighbor on 64 cores256 elements − 1MB message size

Francois Tessier TreeMatch in Charm++ 12 / 19

Page 18: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

kNeighbor

Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) -PlaFRIM Cluster

1 MB messages - 100 iterations - 7-Neighbor

40

60

80

100

120

140

160

180

200

1 2 4 8 16

Avera

ge t

ime f

or

each

7-k

Neig

hb

or

itera

tion (

in m

s)

Number of chares by core

Execution time versus chares by core

DummyLBTreeMatchLB

Francois Tessier TreeMatch in Charm++ 13 / 19

Page 19: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

kNeighbor

Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) -Blue Waters Cluster

1 MB messages - 100 iterations - 7-Neighbor

Bad performances...

60

80

100

120

140

160

180

200

220

1 2 4 8 16

Avera

ge t

ime f

or

each

7-k

Neig

hb

or

itera

tion (

in m

s)

Number of chares by core

Execution time versus chares by core

DummyLBTreeMatchLB

Francois Tessier TreeMatch in Charm++ 14 / 19

Page 20: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

Stencil3D

3 dimensional stencil with regularcommunication with fixed neighbors

One chare per core : balance onlyconsidering communications

Only one load balancing step after 10iterations

Experiments on 8 nodes with 8 cores oneach (Intel Xeon 5550)

Dum

myL

B

Gre

edyC

omm

LB

Gre

edyL

B

Ref

ineC

omm

LB

TM

LB_T

reeB

ased

Exe

cutio

n tim

e (in

sec

onds

)

0

50

100

150

200

Stencil3D on 64 cores − 64 elements

Francois Tessier TreeMatch in Charm++ 15 / 19

Page 21: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

What about the load balancing time?

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Sequential

Distributed

Sequential

Distributed

Sequential

Distributed

Tim

e in s

eco

nd

s

Time repartition for each step of the load balancing process

InitializationTM Sequential

TM Parallel

1638481924096

Francois Tessier TreeMatch in Charm++ 16 / 19

Page 22: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

What about the load balancing time?

Comparison between the sequential and the distributed versions ofTreeMatchLB

The master node distributes the data to each node which will compute itsown chares placement. This data distribution can be done in parallel(around 20% of improvments)

7

6

5

4

3

2

1

0

Master

165.6 165.7 165.8 165.9 166 166.1 166.2

time

4096 Chares - reverse - Par

InitProcess results

DistributeCalculate

Return

Francois Tessier TreeMatch in Charm++ 16 / 19

Page 23: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Results

What about the load balancing time?

Linear trajectory while the number of chares is doubled

TreeMatchLB is slower than the other Greedy strategies

RefineCommLB which provides some good results forcommunication-bound applications is not scalable (fails from 8192 chares)

0.1

1

10

100

1000

10000

128 256 512 1024 2048 4096 8192

Execu

tion t

ime (

in m

s)

Number of chares

Execution time of load balancingstrategies (running on 128 cores)

GreedyCommLBGreedyLB

RefineCommLBTreeMatchLB

Figure: Load balancing time of the different strategies vs. number of chares for theKNeighbor application.

Francois Tessier TreeMatch in Charm++ 17 / 19

Page 24: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

Future work and ConclusionThe end

Topology is not flat!

Processes affinities are not homogeneous

Take into account these information to map chares give us improvement

Algorithm adapted to large problems (Distributed)

JLPC collaborations

10 days during August 2013 at the PPL

Paper accepted at IEEE Cluster 2013

Future work

Find a better way to gather the topology (Hwloc?)

Bad performances on Blue Waters... Need to understand why(Architecture ?)

Perform more large scale experiments

Hybrid architecture? Intel MIC?

Evaluate our solution on other applications

Compare to other load balancer (NUCOLB, application-specific LBs)

Francois Tessier TreeMatch in Charm++ 18 / 19

Page 25: Distributed communication-aware load balancing with ...caradhras/documents/... · Distributed communication-aware load balancing with TreeMatch in Charm++ The11thworkshopoftheJointLaboratoryforPetascaleComputing,

The End

Thanks for your attention !Any questions?

Francois Tessier TreeMatch in Charm++ 19 / 19