Download - KIRA™...2 KIRA-CS uses holistic and integrated methods to develop the world’s most complete, versatile and robust HPC systems. Rather than assembling cluster-like components and

KIRA™REDEFINE The SUPERCOMUTING

BIG DATA ANALYTICS, AI, TECHNICAL COMPUTING

(c) A3Cube

To enable the next generation HPC, High

Performance Data (HPD), Machine Learning (ML)

and Predictive Analysis (PA) revolution @ scale without any compromise

A3Cube KIRA™ The versatile supercomputer system

2

2

KIRA™ Versatile Supercomputer

Exascale Data Analytics

Advanced HPC Exascale AI

KIRA™ hardware and software innovations tackle system bottlenecks in processing, data

movement and I/O.

KIRA™ eliminates the distinction between clusters and supercomputers, providing a rich

software and system interconnect in different form factors.

KIRA™ allows for multiple processor and accelerator architectures and a choice of system

interconnect technologies and topologies, including our new RONNIEE Express interconnect.

KIRA™ is an entirely new design, created from the ground up to address these needs. Built to

be data-centric, it runs the fastest and most diverse workloads all at the same time.

Increasingly, high-performance computing systems today need to be able to handle massive

converged modeling, simulation, AI and analytics workloads.

2

KIRA-CS™ uses holistic and integrated methods to develop the world’s most complete, versatile and robust HPC systems.Rather than assembling cluster-like components and commodity networks that can degrade or act unpredictably at high node counts, A3Cube uses a ground-up design approach that includes deeply optimized fabric and topology architecture in combination with “system-centric” file system and storage orchestrator in combination with an innovative “composable” resource sharing built-in that guarantees unique linearly scaling in both computation, IO performance and coprocessor implemntation.Regardless of your application requirements, the KIRA-CS series scales across the performance spectrum of dimension and application types, from smaller footprint, lower-density configurations up to the most complex data Analytics challenges.

● Sustained and scalable application performance● Tight HPC optimization and integration● Upgradability by design● User productivity

2

KIRA™ Versatile Supercomputer

Four Architectural KEYs perfectly orchestrated

KIRA™ represents our exa-scale architecture for the most demanding tasks yet incredibly simple to configure. Manage and operate.

Processing

Data movement and I/O

System interconnect

Accelerators

2

Processing

Processing Element Architecture

CPU2CPU1

NIC

200 Gbit/s

Local Data Array Local Data Array

Processing Element Block Diagram

A B

C D

To the Router

High density processing element based on cutting edge CPU, Memory and I/O

AAMD 2x Epyc Processors (64 Cores 128 Threads total)

B

Eight (#8) Memory Channels per CPU (25% better bandwidth then x86 competitors)

C

Balanced low-latency high speed interconnect (with multiple protocol support)

D

Built-in storage sub system combined with global filesystem sharing for maximum data access performance and optimized for Data Intensive and ML operations

A3

Cu

be

2

Processing

KIRA™ Rack

KIRA Rack Configuration

AUp to 128 AMD Epyc Processors (4096 Cores 8192 Threads total)

B Up to 128 TByte Memory per Rack

C

DUltra low diameter interconnect for maximum performance, lowest latency and best bisection bandwidth possible at scale (<0.8 us 1000 nodes)

Up to 364 Local Disks for local data access or parallel converged storage

E

Support Multiple ProtocolsRDMA (OFED)TCP/IP Memory Semantic (*)

FOptimized Linux and software suite, preconfigured and specifically tuned(FORTISSIMO FOUNDATION)

G

Multi-level graphical system orchestrator with built-in multi-storage configuration and management, coprocessor management and allocation, system memory, interconnect and OS tuning

2

ProcessingData movement and I/O

Optimized Software and File System Architecture to deliver the best data-access at scale in any situation and with any application.

FORTISSIMO FOUDATION™ provides the best optimized Kernel for the system, including file system and drivers for IO.

Rhapsody System Manager™ Permits to combine different file systems at the same time with different configuration coexisting inside the same machine permitting to obtain incredible data access configuration flexibility.

Best data access at single node, best data access at scale with global distributed parallel file system and pervasive RDMA support

Support for convergent, diverged, aggregated and disaggregated storage configuration in the system permit to run multiple different workloads in the machine achieving always the best performance.

Unique I/O and Data access technology makes JIRA™ an unique supercomputer in the whole HPC scenarios.No configuration pain from the simplest application to the most multi-workload demanding applications concurrently running in the same system

A

B

C

D

2

ProcessingData movement and I/OSystem interconnect

Keeping low the latency is the new application’s mantra

Computational efficiency is a function of the efficiency of the interconnection.Low speed interconnection (read high latency) even with high bandwidth are not good for modern applications like big data and ML.Low efficiency means that to reach a goal you need to spend more time, adding more hardware, consuming more electricity, pay a bigger bill and having higher TCO than if you have a balanced system.

Dis

trib

ute

d P

erfo

rman

ce

Latency

Latency is the most critical performance factor in a system because it directly affects system data exchange time.In fact, latency means lost time; time that could have been spent more productively producing computational results, but it is instead spent waiting for the data.Latency is the “application stealth tax”, silently extending the elapsed times of individual computational tasks and processes, which then take longer to execute

Low High

2


A3Cube Hybrid Topology Fabric is a “direct” network, therefore avoiding the need for external top-of-rack switches.It is designed to scale to a large number of nodes and reduce the number of optical links required for a given global bandwidth compared to a traditional fabric, like fat-tree, showing the value of low-diameter networks.

A3Cube creates many small computing islands connected together using the shorter cabling as possible and connecting these islands in an ultra short diameter network, where the group as a whole could have enough links to directly connect to all of the other groups in the system. The result is an ultra low latency system with extremely high bisection bandwidth.

Computing

Islands

Computing

IslandsCompu

ting Islands

Computing

Islands

Computing

IslandsCompu

ting Islands

Computing

Islands

Computing

Islands

2


o 100 Giga bit/s x PE (200 Gbit/s Dual-Rail)o Up to 14.4 Tera bit/s Global Bandwidtho Up to 400 Millions MPI messages/second per PEo More than 14 Billions MPI Messages/second Routing

Capability

Latency performance example(*)o 128 nodes max 2 hops (Rack level)o 1024 nodes max 3 hops (16 Racks)o Island latency <690nso Max latency 128 nodes <800nso Max latency 129-1024 nodes <890ns

Lowest Latency at Scale

Latency @ 4 Byte Packets

2


Practical Implementation

A high-radix router is partitioned into 2/3 classes of links.These classes are used to create two main “network-object”:

a) Intra-group (Computing-Island) links (or local links)b) Inter-group links (or global links)

The Inter-group links can be divided into the first level and second level

The group is a collection of computing nodes directly connected to the router that creates what we call “Computing Islands”

Router

Computing-Islands

Low-cost electrical links are used to connect the NICs in each PE to their local router and the routers in a group The maximum group size is governed by the partitioning of the routers The first level of the group normally uses copper cables, the second level in most cases uses optical cables

Cable Connection

Inter-group links

Intra-group links

2


Intra-group “Computing-Island” Links

Inter-group Links

Each router provides both “intra-group” (Computing-Island) links that connect it to other computing nodes in its group and “inter-group” links (also known as global links) that connect it to other groups. The router in a group pools its inter-group links, enabling each group to be directly connected to all the other groups.

2


KIRA Complete Network Organization and Systems Architecture

GPU

GPU

GPU

GPU

KIRA CPU Computing Nodes

Computing IslandsKIRA

Storage

PE

PE

PE

StorageExample of KIRA configuration

2


Symmetric Scalable Architecture

o Fully symmetric network access to any distributed resource

o Same time to access to a dataindependently from the data position in the cluster

o Same latency to access to any CPU independent from physical location of the CPU

o Multiple concurrent dedicated interconnections links to any resource or group of resources

GPU

GPU

GPU

GPU

KIRA CPU Computing Nodes

9.6 Tbit/s Bisection Bandwidth

Computing IslandsKIRA

up to 1.6 Tbit/s per linkBi directional

Example of Bandwidth and Bisection Bandwidth that demonstrate the superiority of Kira compared with other commercial systems

The picture is just an example of a possible configuration of a KIRA system

2


Why all the computing nodes can access to the resources @ the same latency @ scale

The topology is realized with incremental direct connected multilevel-groups of islands of fixed number of nodes.(Resulting in k-ary n-flat, flattened network topology) e.g. a) Island 16 nodes (switch connected)b) 1st level #1 Group of #8 Islandsc) 2nd level #1 Group of # 8 1st level Groups

This topological architecture result in ultra-low diameterLower diameter = Higher interconnection performance

The k-ary n-flat topology can be mathematically derived from a k-ary n-fly.The incremental direct connected multilevel-groups are composed of #N of k radix k0 = n(k−1)+1 (*) routers where N is the size of the network. ((*)It is possible to derive this formula from the Euler equation used to solve the following series: S= sum(ab);

The routers are connected by channels in n0 = n − 1 dimensions, corresponding to the n−1 columns of inter-rank wiring in an equivalent butterfly topology.

It is possible to demonstrate that for regular increments (like the on in the picture) the available links of the

network links are: and the diameter is “d”L= 𝑑𝑑𝑁−1 𝑁

2

2



The number of nodes in a flattened butterfly is plotted as a function of a number of dimensions n0 and switch radix

The diameter of the network remain absolutely low scaling to very large number of nodes just using commodity single switches of radix k’:e.g. Island 16 nodes per island @n’=0 (router) P2P latency =0.6usUp to Nodes > 300 nodes @ n’=1 (max total latency (worst case) = 0.69)Up to Nodes > 1,900 nodes @ n’=2 (latency total latency (worst case) = 0.78us Up to Nodes > 16,000 nodes @ n’=3 (latency total latency (worst case) = 0.87us

2



Multiple Topologies Comparison @ 128 Nodes

Diameter1 2 3 4 5 6 7

1D Array

Ring

2D Mesh

2D Torus

Hypercube

A3Cube

Lower is Better

1D Array

Ring

2D Mesh

2D Torus

Hypercube

A3Cube

2 4 6 8 10 12 14 16 18 20 22 24 26 28 … 1200

Higher is Better

Bisection Bandwidth

Bisection Bandwidth

Diameter

Terabit/s

A3Cube Example Configuration = #16 nodes per island, #8 islands, #1 36 ports switch (d=1)

2

Accelerators

KIRA™ Hybrid

Support and integration for:

Multiple Hybrid Processing Nodes (CPU + Accelerators A variety of Accelerators(GPU any kind (P100, V100, GTX series, ATI)

Support CUDA and OpenCL

Support FPGA any type, Vector Processor including NEC Vector Processors (VE)

Multiple concurrent configurations are possible

Built in support (hardware and software) for A3Cube Composable multi master architecture (derived from GRIFO™)

Download - KIRA™...2 KIRA-CS uses holistic and integrated methods to develop the world’s most complete, versatile and robust HPC systems. Rather than assembling cluster-like components and

Top Related