flow control: a prerequisite for ethernet scaling · 3 gromacs 3.3 gige speedup compared to other...

Flow control: a prerequisite for Ethernet scaling

• GROMACS and network congestion

• a network stress-test with MPI_Alltoall

• IEEE 802.3x = ???

• high-level fl ow control / ordered communication

Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller

GROMACS 3.3 GigE speedup Aquaporin-1 benchmark (80k atoms)

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 CPU/node 2 CPUs/node

Orca,Coral,Slomo,...

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

* benchmark done by Renate Dohmen, RZG

time step 1 time step 2

MPI_Alltoall

SR Coul SR Coul

PME PME

Where is the problem?

• delays of 0.25s in comm-intensive program parts (MPI_Alltoall, MPI_Allreduce, MPI_Bcast, consecutive MPI_Sendrecvs)

Network congestion

switch

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

Network congestion

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

switch

/* Initiate all send/recv to/from others. */for (i loops over processes) { MPI_Recv_init(from i ...); MPI_Send_init(to i ...);}/* Start all the requests. */MPI_Startall(...);/* Wait for them all. */MPI_Waitall(...);

/* do the communication -- post *all* sends and re-ceives: */for ( i=0; i<size; i++ ) { /* Performance fi x sent in by Duncan Grove <duncan@cs.adelaide.edu.au>. Instead of posting irecvs and isends from rank=0 to size-1, scatter the destinations so that messages don‘t all go to rank 0 fi rst. Thanks Duncan! */ dest = (rank+i) % size; MPI_Irecv(from dest)}for ( i=0; i<size; i++ ) { dest = (rank+i) % size; MPI_Isend(to dest)}/* ... then wait for *all* of them to fi nish: */MPI_Waitall(...);

LAM: MPICH:

switch

common MPI_Alltoall implementations order of message transfers is random!

A network stress-test with MPI_Alltoall

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

• MPI_Alltoall, a parallel matrix transpose

• each of N procs sends N-1 messages of size S

• N*(N-1) messages transferred in time Δt

• transmission rate T per proc: T = (N–1)S / Δt

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

message size [bytes]

4 CPUs81632

1 CPU/node

T averaged over 25 calls (bars: min/max)

LAM MPI_Alltoall

1 CPU/node

Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

101 102 103 104 105 10610−2

10−1

4 CPUs81632

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

103 104 105

4 CPUs81632

1 CPU/node

Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

22001408

1731847380043296

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

103 104 105

4 CPUs81632

101 102 103 104 105 10610−2

10−1

103 104 105

message size [bytes]tra

4 CPUs8163264

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

103 104 105

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

4 CPUs8163264

LAM MPICH

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

103 104 105

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

4 CPUs8163264

LAM MPICH49 Mb/s 36 Mb/s

theoretical limit Tmax = 58 Mb/s

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss!

switch

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

switch

• reduce packet loss

switch

PAUSE!

switch

PAUSE!

MPI_Alltoall performance (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

103 104 105

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

10 10 10message size [bytes]

103 104 105

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

101 102 103 104 105 10610−2

10−1

4 CPUs8163264

LAM MPICH

no fl ow control

withIEEE 802.3xfl ow control

GROMACS 3.3 GigE speedup

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

Sp16 = 9.1

Sp16 = 7.4

High-level fl ow control • low-level fl ow control does not guarantee good performance for 16+ CPUs

• implement own all-to-all

• idea: arrange communication in separated phases, each free of congestion

• experimental MPI prototype: CC-MPI2

2 Karwande et al., 2005, An MPI prototype for compiled communication on Ethernet switched clusters, J Parallel Distrib Comput 65

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

A B C D

for (i=0; i<ncpu; i++) /* loop over all CPUs */{ /* send to destination CPU while receiving from source CPU: */ dest = (cpuid+i) % ncpu; source = (ncpu+cpuid-i) % ncpu; MPI_Sendrecv(send to dest, recv from source); /* separate the communication phases: */ if (i<ncpu-1) MPI_Barrier(comm);}

High-level fl ow control ordered all-to-all for single-CPU nodes

High-level fl ow control ordered all-to-all for multi-CPU nodes

/* loop over all nodes */ for (i=0; i<nnodes; i++) { /* send to destination node while receiving from source node: */ destnode = (nodeid+i) % nnodes; sourcenode = (nnodes+nodeid-i) % nnodes; /* loop over CPUs on a node: */ for (j=0; j < cpus_per_node; j++) { /* source and destination CPU: */ destcpu =destnode*cpus_per_node + j; sourcecpu=sourcenode*cpus_per_node+j; MPI_Irecv(from sourcecpu ...); MPI_Isend(to destcpu ...); } /* wait for communication to fi nish: */ MPI_Waitall(...); MPI_Waitall(...); /* separate the communication phases: */ if (i<nnodes-1) MPI_Barrier(...);}

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264

withhigh-level

fl ow control

LAM, Sendrecv all−to−all, 1 CPU/nodeLAM, Sendrecv all−to−all, 1 CPU/node LAM, Isend/Irecv all−to−all, 2 CPUs/nodeLAM, Isend/Irecv all−to−all, 2 CPUs/node

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

103 104 105

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264

withhigh-level

fl ow control

68.2 Mb/s 49.0 Mb/s

69.6 Mb/s 45.3 Mb/s

117 Mb/s58 Mb/s

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264

withhigh-level

fl ow control

MPICH, Sendrecv all−to−all, 1 CPU/node MPICH, Isend/Irecv all−to−all, 2 CPUs/node

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264

withhigh-level

fl ow control

43.6 Mb/s 36.3 Mb/s

30.9 Mb/s 34.9 Mb/s

117 Mb/s58 Mb/s

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 CPU/node

Sendrecv-scheme, high-level fl ow control only

2 CPUs/node

Isend/Irecv scheme, high-level fl ow control only

8scheme,

MPI_Alltoall vs. ordered all-to-alls

Infiniband2.2 GHz

Opterons(RZG)

Xeons(Orca)

Infiniband2.2 GHz

Opterons(RZG)

Xeons(Orca)

flowcontrol

Infiniband2.2 GHz

Opterons(RZG)

Xeons(Orca)

flowcontrol

Conclusions

• when running GROMACS on >2 Ethernet-connected nodes, fl ow control is needed to avoid network congestion

• implemented a congestion-free all-to-all for multi-CPU nodes

• highest network throughput probably for combination of high+low level fl ow control

• fi ndings can be applied to other parallel applications that need all-to-all communication

Outlook • MPICH1-1.2.6 MPICH2-1.0.2• LAM-7.1.1 OpenMPI-1.0rc1

• allocation technique for PME/PP splitting on multi-CPU nodes

1 2 4 8 16 321

number of CPUs

gmx 3.3

1 2 4 8 16 321

number of CPUs

gmx 3.3

fcfc fcfc

Sp16 = 9.1 Sp16 = 7.4

LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2

flow control: a prerequisite for ethernet scaling · 3 gromacs 3.3 gige speedup compared to other...

Documents

carrier ethernet metro ethernet architectures · carrier...

what is industrial ethernet . industrial & optical ethernet...

ethernet technology - snia of ethernet technology ethernet...

horus hot chips 2004 pag - · pdf file•horus looks just...

the models … the network - leviton · serial cable...

ethernet & ethernet over sonet (eos)

[ppt]fast ethernet and gigabit ethernet -...

ethernet ethernet poe - irtrans.de · hardwarehandbuch...

distriton enclosures - oez · 2019. 11. 7. · distriton...

profinet ethernet/ip, and ethernet connectivity...

connect ethernet, fast ethernet, or gigabit ethernet

gary marret, rcdd what is ethernet? what is ethernet? what...

ethernet carrier ethernet

КИЇВСЬКИЙ НАЦІОНАЛЬНИЙ...

webinar carrier ethernet vs. enterprise ethernet

carrier ethernet and ethernet oam

ethernet technology...ethernet technology 10 ethernet...

the medium access control sublayer€¦ · • mac sublayer...

x-ethernet ethernet everywhere -...

sgh rzg mi ii amjijaiijm - dra