bull exascale interconnect (bxi) un nouveau réseau pour le ... · atos, the atos logo, atos...

22
© Atos - Confidential Bull Exascale Interconnect (BXI) un nouveau réseau pour le calcul de haute performance Jean-Pierre Panziera 11-10-2016 journées méso-centres

Upload: others

Post on 14-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

© Atos - Confidential

Bull Exascale Interconnect (BXI)

un nouveau réseau pour le calcul de haute performance

Jean-Pierre Panziera

11-10-2016

journées méso-centres

Page 2: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

HPC applications

HPC applications are characterized by:

▶ X-large computing needs (TeraFlops, PetaFlops, ExaFlops…)

▶ X-large datasets

▶ large number of processors

▶ tight coupling between computing threads

▶ Many short MPI messages (latency)

▶ Large IO transfers (bandwidth)

2

Efficient HPC Interconnect

Page 3: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

HPC systems are highly parallel Petaflop class featuring 1000s CPU nodes

3

CPU CPU

CPU CPU

CPU

CPU CPU

CPU CPU

CPU

CPU CPU

CPU CPU

CPU

CPU CPU

CPU CPU

CPU

Fast Interconnect

100-10,000 compute nodes using CPUs, typically x86

Multiple storage tiers

Page 4: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

expecting 10-100 Pflops systems in 2016-17 … with HPC specific Processing Units

4

1000s-10,000s compute nodes CPUs, GPUs, HPC accelerators

Multiple storage tiers

Fast Interconnect HPC Interconnect

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU CPU

CPU HPC node

Page 5: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

HPC Interconnect

▶ performant

– low latency

– high message rate

– high bandwidth

▶ scalable

– 10,000s nodes

▶ reliable

– fault tolerant

– redundant

▶ efficient

– handle simultaneously different flow types – small & big messages - MPI & IO

– Adaptive routing

– small memory footprint

– link-level checking & retry, ECC protection

▶ Offload communications in Hardware

– HPC cores are many but slow(er)

5

Page 6: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI overview High Performance Interconnect for HPC

▶ BXI: High Performance Interconnect for HPC

– Lowest latency, Highest message rate at scale, Highest Bandwidth

▶ BXI full acceleration in hardware for HPC applications

– based on Portals 4 (Sandia), BXI provides full HW acceleration for:

• MPI and PGAS communications (send/recv, RDMA)

• High performance collective operations

▶ BXI highly scalable, efficient and reliable

– Exascale scalability 64k nodes (v1)

– Adaptive Routing, Quality of Service (QoS)

– End-to-end error checking + link level CRC + ECC in ASICs

▶ BXI co-designed with CEA

Page 7: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI NIC

BXI Switch

PCI Express 16x Gen 3

BXI Link 100 (4x25) Gb/s

48 ports BXI Link

NIC ASIC switch ASIC

MPI Latency ~1 µs

Issue rate >100 Mmsg/s

9600 Gb/s bandwidth

Lutetia Divio

BXI Network is based on 2 ASICs

7

Page 8: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

NIC main features 1/2

▶ Implements in hardware the Portals 4 communication primitive

– Overlapping communications and computations by offloading to NIC

– MPI two-sided messaging:

• HW acceleration of list management and matching on the NIC

– PGAS / MPI one-sided messaging:

• use fast path inside the NIC

▶ OS and application bypass

– Applications issue commands directly to the NIC, avoiding kernel calls

– Reception controlled by NIC without OS involvement

– Reply to a put or a get does not require activity on application side.

• Logical to physical ID translation

• Virtual to physical memory address translation. • Rendez-vous protocol in HW

8

Page 9: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

NIC main features 2/2

▶ Collective Operations offload in HW

– using Atomic and Triggered operations units

▶ End-to-End reliability recovery mechanism for transient and permanent failures

– message integrity, 32bits CRC are added to each message (or each message chunk for large transfers).

– message ordering required for MPI messages is checked with a 16 bit sequence number.

– message delivery a go-back-N protocol is used to retransmit lost or corrupted messages.

▶ Allocates Virtual Channels: Separating different type of messages to avoid deadlocks and to optimize network resources usage (load balancing and QoS)

▶ Offers performance and error counters for Applications performance analysis

9

Page 10: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

Isend

IRecv

compute

compute

Wait

Wait

BXI offloading MPI communication in HW

#include <mpi.h>

int MPI_Isend( const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)

int MPI_IRecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

int MPI_Wait(MPI_Request *request, MPI_Status *status)

address V2P

size rank L2P

message order

Page 11: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI offloading MPI communication in HW

11

Isend

IRecv

compute

compute

Wait

Wait

Wait

Wait

with HW offload

compute

compute

Isend

IRecv

time

Page 12: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI MPI AllReduce using Triggered and Atomic operations

12

P

Ch1

Ch2

CT

TrigAtomic(Data)

ME(Ready) ME(Data) +1 +1

= 3

CT ME(Result) +1

TrigPut(result)

TrigPut(result)

= 1

TrigPut(Ready) Put(Ready)

ch1 ch2

P

Page 13: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI Switch overview

▶ 48 ports, 192 SerDes @ 25Gb/s

– Total throughput : 9600 Gb/s

▶ Latency : 130ns

▶ Die : 22 x 23mm

▶ Package : 57.5 x 57.5mm

▶ Transistors : 5.5 billions

▶ TDP : 160W

– Min power : 60W

▶ Techno : TSMC 28nm HPM

13

Page 14: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI fabric features

▶ Scalable up to 64K NICs

▶ 100 Gb/s links (4 lanes x25,278125 GT/s)

▶ Reliable and ordered network (end to end + Link level)

▶ Flexible with full routing table

– Many topologies supported (Fat-Tree, Torus, Hypercube, All-to-All…)

– Ease routing algorithm optimization

▶ Adaptive routing

▶ Extensive buffering implementing 16 virtual channels preventing deadlock and efficiently balancing traffic

▶ Quality of Service (QoS) with weighted round robin arbitration

– highly configurable load balancing

– Segregation of flows per destination

– ensuring progress of short messages vs long messages

▶ High resolution time synchronization

▶ Out-of-band management

14

Page 15: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

Mngt node

Fabric Management Software

15

BXI Switch

ARM µc

Ethernet Management

Network

Fans, Sensors, ...

GbE

Mngt Node

Fabric Management

Software

10GbE

Routing → Up to 64k nodes

Supervision → Failures handling

Topology → Cable checking

Performance → Counters access

BXI switch

Page 16: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI Routing Online Mode Processing Time e.g. 64k nodes

16

routing table updates computed in < 1s on link failure on link recovery

Quintin, Vignéras; Fault-Tolerant Routing for Exascale Supercomputer: The BXI Routing Architecture. HiPINEB’15 Quintin, Vignéras; Transitively Deadlock-Free Routing Algorithms . HiPINEB’16

Page 17: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI Software compute stack

17

Page 18: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI PCI adapter card and 48p standalone switch

18

Optical cables (100Gb/s)

Redundant Power Supplies

Redundant Fans 1U

BXI port

48p BXI switch ASIC

Optical modules (PODs)

BXI NIC ASIC

Page 19: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

Compute nodes

L2 switches

L1 switches

L2 switches

Copper cable

backplane L1-L2

connection

Compute nodes

Fast Interconnect layout

NIC-L1 connection

“Sequana” – Embedded interconnect

19

Em

bed

ded

in

tercon

nect

L2

L1 ---12---

---12---

24

Nodes

24 24

Nd Nd

L2

L1

24

Nodes Nd IO/svc

Page 20: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

Sequana cells interconnection

20

... 48x ...

L3 L3

L3

L3 switches ...

L3

Direct connections

Fat-Tree

Page 21: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

| Bull / Atos | extreme computing | © Confidential

BXI wrap up

▶ BXI is Atos new High Performance Interconnect for HPC

▶ BXI offloads communication primitives into the NIC

▶ BXI boosts MPI communications in Hardware

▶ Highly scalable, up-to 64k nodes

▶ Fist BXI system installed in Q4-2016

▶ Large BXI deployment (8+K nodes system) in 2017

21

Page 22: Bull Exascale Interconnect (BXI) un nouveau réseau pour le ... · Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company,

Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Yunano, Zero Email, Zero Email Certified and The Zero Email Company are registered trademarks of the Atos group. May 2015. © 2015 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.

dd-mm-yyyy

© Atos - For internal use

Questions?