flow control: a prerequisite for ethernet scaling · 3 gromacs 3.3 gige speedup compared to other...
TRANSCRIPT
![Page 1: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/1.jpg)
1
Flow control: a prerequisite for Ethernet scaling
• GROMACS and network congestion
• a network stress-test with MPI_Alltoall
• IEEE 802.3x = ???
• high-level fl ow control / ordered communication
Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller
![Page 2: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/2.jpg)
2
GROMACS 3.3 GigE speedup Aquaporin-1 benchmark (80k atoms)
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 CPU/node 2 CPUs/node
Orca,Coral,Slomo,...
![Page 3: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/3.jpg)
3
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
* benchmark done by Renate Dohmen, RZG
*
![Page 4: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/4.jpg)
4
time step 1 time step 2
MPI_Alltoall
SR Coul SR Coul
PME PME
Where is the problem?
• delays of 0.25s in comm-intensive program parts (MPI_Alltoall, MPI_Allreduce, MPI_Bcast, consecutive MPI_Sendrecvs)
![Page 5: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/5.jpg)
5
Network congestion
0 1 2
3 4 5
FDX
switch
• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)
![Page 6: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/6.jpg)
6
Network congestion
• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)
0 1 2
3 4 5
FDX
switch
![Page 7: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/7.jpg)
7
/* Initiate all send/recv to/from others. */for (i loops over processes) { MPI_Recv_init(from i ...); MPI_Send_init(to i ...);}/* Start all the requests. */MPI_Startall(...);/* Wait for them all. */MPI_Waitall(...);
/* do the communication -- post *all* sends and re-ceives: */for ( i=0; i<size; i++ ) { /* Performance fi x sent in by Duncan Grove <[email protected]>. Instead of posting irecvs and isends from rank=0 to size-1, scatter the destinations so that messages don‘t all go to rank 0 fi rst. Thanks Duncan! */ dest = (rank+i) % size; MPI_Irecv(from dest)}for ( i=0; i<size; i++ ) { dest = (rank+i) % size; MPI_Isend(to dest)}/* ... then wait for *all* of them to fi nish: */MPI_Waitall(...);
LAM: MPICH:
0 1 2
3 4 5
FDX
switch
common MPI_Alltoall implementations order of message transfers is random!
![Page 8: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/8.jpg)
8
A network stress-test with MPI_Alltoall
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
• MPI_Alltoall, a parallel matrix transpose
• each of N procs sends N-1 messages of size S
• N*(N-1) messages transferred in time Δt
• transmission rate T per proc: T = (N–1)S / Δt
![Page 9: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/9.jpg)
9
LAM MPI_Alltoall
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
1 CPU/node
T averaged over 25 calls (bars: min/max)
![Page 10: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/10.jpg)
10
LAM MPI_Alltoall
1 CPU/node
T averaged over 25 calls (bars: min/max)
Aquaporin-1, PME grid 90x88x80=633600 fl oats
CPUsbyte to
each CPU2 6494404 1731846 738008 43296
16 1180832 2952
Guanylin, PME grid 30x30x21=18900 fl oats
CPUsbyte to
each CPU2 198004 56326 22008 1408
data transferred within the FFTW all-to-all:
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
![Page 11: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/11.jpg)
11
LAM MPI_Alltoall
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
1 CPU/node
T averaged over 25 calls (bars: min/max)
Aquaporin-1, PME grid 90x88x80=633600 fl oats
CPUsbyte to
each CPU2 6494404 1731846 738008 43296
16 1180832 2952
Guanylin, PME grid 30x30x21=18900 fl oats
CPUsbyte to
each CPU2 198004 56326 22008 1408
data transferred within the FFTW all-to-all:
22001408
1731847380043296
![Page 12: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/12.jpg)
12
LAM MPI_Alltoall
1 CPU/node 2 CPUs/node
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]tra
nsm
issi
on ra
te p
er C
PU [M
B/s]
4 CPUs8163264
![Page 13: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/13.jpg)
13
LAM vs. MPICH (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
LAM MPICH
![Page 14: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/14.jpg)
14
LAM vs. MPICH (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
LAM MPICH49 Mb/s 36 Mb/s
theoretical limit Tmax = 58 Mb/s
trans
mis
sion
rate
per
CPU
[MB/
s]
![Page 15: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/15.jpg)
15
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss!
0 1 2
3 4 5
FDX
switch
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
![Page 16: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/16.jpg)
16
0 1 2
3 4 5
FDX
switch
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
![Page 17: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/17.jpg)
17
0 1 2
3 4 5
FDX
switch
PAUSE!
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
![Page 18: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/18.jpg)
18
0 1 2
3 4 5
FDX
switch
PAUSE!
PAUSE!
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
![Page 19: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/19.jpg)
19
MPI_Alltoall performance (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
10 10 10message size [bytes]
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
4 CPUs8163264
LAM MPICH
no fl ow control
withIEEE 802.3xfl ow control
![Page 20: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/20.jpg)
20
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
1 CPU/node 2 CPUs/node
![Page 21: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/21.jpg)
21
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
1 CPU/node 2 CPUs/node
fc
Sp16 = 9.1
Sp16 = 7.4
![Page 22: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/22.jpg)
22
High-level fl ow control • low-level fl ow control does not guarantee good performance for 16+ CPUs
• implement own all-to-all
• idea: arrange communication in separated phases, each free of congestion
• experimental MPI prototype: CC-MPI2
2 Karwande et al., 2005, An MPI prototype for compiled communication on Ethernet switched clusters, J Parallel Distrib Comput 65
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
![Page 23: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/23.jpg)
23
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
0
1
23
4
0
1
23
4
0
1
23
4
0
1
23
4
A B C D
for (i=0; i<ncpu; i++) /* loop over all CPUs */{ /* send to destination CPU while receiving from source CPU: */ dest = (cpuid+i) % ncpu; source = (ncpu+cpuid-i) % ncpu; MPI_Sendrecv(send to dest, recv from source); /* separate the communication phases: */ if (i<ncpu-1) MPI_Barrier(comm);}
High-level fl ow control ordered all-to-all for single-CPU nodes
![Page 24: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/24.jpg)
24
High-level fl ow control ordered all-to-all for multi-CPU nodes
/* loop over all nodes */ for (i=0; i<nnodes; i++) { /* send to destination node while receiving from source node: */ destnode = (nodeid+i) % nnodes; sourcenode = (nnodes+nodeid-i) % nnodes; /* loop over CPUs on a node: */ for (j=0; j < cpus_per_node; j++) { /* source and destination CPU: */ destcpu =destnode*cpus_per_node + j; sourcecpu=sourcenode*cpus_per_node+j; MPI_Irecv(from sourcecpu ...); MPI_Isend(to destcpu ...); } /* wait for communication to fi nish: */ MPI_Waitall(...); MPI_Waitall(...); /* separate the communication phases: */ if (i<nnodes-1) MPI_Barrier(...);}
![Page 25: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/25.jpg)
25
High- vs. low level fl ow control – LAM
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 1 CPU/node, flow control
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 2 CPUs/node, flow control
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Sendrecv all−to−all, 1 CPU/node
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
LAM, Sendrecv all−to−all, 1 CPU/nodeLAM, Sendrecv all−to−all, 1 CPU/node LAM, Isend/Irecv all−to−all, 2 CPUs/nodeLAM, Isend/Irecv all−to−all, 2 CPUs/node
![Page 26: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/26.jpg)
26
High- vs. low level fl ow control – LAM
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 1 CPU/node, flow control
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 2 CPUs/node, flow control
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Sendrecv all−to−all, 1 CPU/node
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
68.2 Mb/s 49.0 Mb/s
69.6 Mb/s 45.3 Mb/s
117 Mb/s58 Mb/s
![Page 27: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/27.jpg)
27
High- vs. low level fl ow control – MPICH
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 1 CPU/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 2 CPUs/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Sendrecv all−to−all, 1 CPU/node
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs81632
4 CPUs8163264
4 CPUs81632
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
MPICH, Sendrecv all−to−all, 1 CPU/node MPICH, Isend/Irecv all−to−all, 2 CPUs/node
![Page 28: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/28.jpg)
28
High- vs. low level fl ow control – MPICH
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 1 CPU/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 2 CPUs/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Sendrecv all−to−all, 1 CPU/node
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs81632
4 CPUs8163264
4 CPUs81632
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
43.6 Mb/s 36.3 Mb/s
30.9 Mb/s 34.9 Mb/s
117 Mb/s58 Mb/s
![Page 29: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/29.jpg)
29
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
mod
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
mod
1 CPU/node
Sendrecv-scheme, high-level fl ow control only
2 CPUs/node
Isend/Irecv scheme, high-level fl ow control only
8scheme,
MPI_Alltoall vs. ordered all-to-alls
![Page 30: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/30.jpg)
30
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
![Page 31: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/31.jpg)
31
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
flowcontrol
![Page 32: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/32.jpg)
32
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
flowcontrol
1 CPU
![Page 33: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/33.jpg)
33
Conclusions
• when running GROMACS on >2 Ethernet-connected nodes, fl ow control is needed to avoid network congestion
• implemented a congestion-free all-to-all for multi-CPU nodes
• highest network throughput probably for combination of high+low level fl ow control
• fi ndings can be applied to other parallel applications that need all-to-all communication
![Page 34: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet](https://reader030.vdocuments.net/reader030/viewer/2022041215/5e03c126462567051027aabd/html5/thumbnails/34.jpg)
34
Outlook • MPICH1-1.2.6 MPICH2-1.0.2• LAM-7.1.1 OpenMPI-1.0rc1
• allocation technique for PME/PP splitting on multi-CPU nodes
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc fcfc
Sp16 = 9.1 Sp16 = 7.4
LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2
PP
PME
1 CPU/node 2 CPUs/node