KIRA™REDEFINE The SUPERCOMUTING
BIG DATA ANALYTICS, AI, TECHNICAL COMPUTING
(c) A3Cube
To enable the next generation HPC, High
Performance Data (HPD), Machine Learning (ML)
and Predictive Analysis (PA) revolution @ scale without any compromise
A3Cube KIRA™ The versatile supercomputer system
2
2
KIRA™ Versatile Supercomputer
Exascale Data Analytics
Advanced HPC Exascale AI
KIRA™ hardware and software innovations tackle system bottlenecks in processing, data
movement and I/O.
KIRA™ eliminates the distinction between clusters and supercomputers, providing a rich
software and system interconnect in different form factors.
KIRA™ allows for multiple processor and accelerator architectures and a choice of system
interconnect technologies and topologies, including our new RONNIEE Express interconnect.
KIRA™ is an entirely new design, created from the ground up to address these needs. Built to
be data-centric, it runs the fastest and most diverse workloads all at the same time.
Increasingly, high-performance computing systems today need to be able to handle massive
converged modeling, simulation, AI and analytics workloads.
2
KIRA-CS™ uses holistic and integrated methods to develop the world’s most complete, versatile and robust HPC systems.Rather than assembling cluster-like components and commodity networks that can degrade or act unpredictably at high node counts, A3Cube uses a ground-up design approach that includes deeply optimized fabric and topology architecture in combination with “system-centric” file system and storage orchestrator in combination with an innovative “composable” resource sharing built-in that guarantees unique linearly scaling in both computation, IO performance and coprocessor implemntation.Regardless of your application requirements, the KIRA-CS series scales across the performance spectrum of dimension and application types, from smaller footprint, lower-density configurations up to the most complex data Analytics challenges.
● Sustained and scalable application performance● Tight HPC optimization and integration● Upgradability by design● User productivity
2
KIRA™ Versatile Supercomputer
Four Architectural KEYs perfectly orchestrated
KIRA™ represents our exa-scale architecture for the most demanding tasks yet incredibly simple to configure. Manage and operate.
Processing
Data movement and I/O
System interconnect
Accelerators
2
Processing
Processing Element Architecture
CPU2CPU1
NIC
200 Gbit/s
Local Data Array Local Data Array
Processing Element Block Diagram
A B
C D
To the Router
High density processing element based on cutting edge CPU, Memory and I/O
AAMD 2x Epyc Processors (64 Cores 128 Threads total)
B
Eight (#8) Memory Channels per CPU (25% better bandwidth then x86 competitors)
C
Balanced low-latency high speed interconnect (with multiple protocol support)
D
Built-in storage sub system combined with global filesystem sharing for maximum data access performance and optimized for Data Intensive and ML operations
A3
Cu
be
2
Processing
KIRA™ Rack
KIRA Rack Configuration
AUp to 128 AMD Epyc Processors (4096 Cores 8192 Threads total)
B Up to 128 TByte Memory per Rack
C
DUltra low diameter interconnect for maximum performance, lowest latency and best bisection bandwidth possible at scale (<0.8 us 1000 nodes)
Up to 364 Local Disks for local data access or parallel converged storage
E
Support Multiple ProtocolsRDMA (OFED)TCP/IP Memory Semantic (*)
FOptimized Linux and software suite, preconfigured and specifically tuned(FORTISSIMO FOUNDATION)
G
Multi-level graphical system orchestrator with built-in multi-storage configuration and management, coprocessor management and allocation, system memory, interconnect and OS tuning
2
ProcessingData movement and I/O
Optimized Software and File System Architecture to deliver the best data-access at scale in any situation and with any application.
FORTISSIMO FOUDATION™ provides the best optimized Kernel for the system, including file system and drivers for IO.
Rhapsody System Manager™ Permits to combine different file systems at the same time with different configuration coexisting inside the same machine permitting to obtain incredible data access configuration flexibility.
Best data access at single node, best data access at scale with global distributed parallel file system and pervasive RDMA support
Support for convergent, diverged, aggregated and disaggregated storage configuration in the system permit to run multiple different workloads in the machine achieving always the best performance.
Unique I/O and Data access technology makes JIRA™ an unique supercomputer in the whole HPC scenarios.No configuration pain from the simplest application to the most multi-workload demanding applications concurrently running in the same system
A
B
C
D
2
ProcessingData movement and I/OSystem interconnect
Keeping low the latency is the new application’s mantra
Computational efficiency is a function of the efficiency of the interconnection.Low speed interconnection (read high latency) even with high bandwidth are not good for modern applications like big data and ML.Low efficiency means that to reach a goal you need to spend more time, adding more hardware, consuming more electricity, pay a bigger bill and having higher TCO than if you have a balanced system.
Dis
trib
ute
d P
erfo
rman
ce
Latency
Latency is the most critical performance factor in a system because it directly affects system data exchange time.In fact, latency means lost time; time that could have been spent more productively producing computational results, but it is instead spent waiting for the data.Latency is the “application stealth tax”, silently extending the elapsed times of individual computational tasks and processes, which then take longer to execute
Low High
2
ProcessingData movement and I/OSystem interconnect
A3Cube Hybrid Topology Fabric is a “direct” network, therefore avoiding the need for external top-of-rack switches.It is designed to scale to a large number of nodes and reduce the number of optical links required for a given global bandwidth compared to a traditional fabric, like fat-tree, showing the value of low-diameter networks.
A3Cube creates many small computing islands connected together using the shorter cabling as possible and connecting these islands in an ultra short diameter network, where the group as a whole could have enough links to directly connect to all of the other groups in the system. The result is an ultra low latency system with extremely high bisection bandwidth.
Computing
Islands
Computing
IslandsCompu
ting Islands
Computing
Islands
Computing
IslandsCompu
ting Islands
Computing
Islands
Computing
Islands
2
ProcessingData movement and I/OSystem interconnect
o 100 Giga bit/s x PE (200 Gbit/s Dual-Rail)o Up to 14.4 Tera bit/s Global Bandwidtho Up to 400 Millions MPI messages/second per PEo More than 14 Billions MPI Messages/second Routing
Capability
Latency performance example(*)o 128 nodes max 2 hops (Rack level)o 1024 nodes max 3 hops (16 Racks)o Island latency <690nso Max latency 128 nodes <800nso Max latency 129-1024 nodes <890ns
Lowest Latency at Scale
Latency @ 4 Byte Packets
2
ProcessingData movement and I/OSystem interconnect
Practical Implementation
A high-radix router is partitioned into 2/3 classes of links.These classes are used to create two main “network-object”:
a) Intra-group (Computing-Island) links (or local links)b) Inter-group links (or global links)
The Inter-group links can be divided into the first level and second level
The group is a collection of computing nodes directly connected to the router that creates what we call “Computing Islands”
Router
Computing-Islands
Low-cost electrical links are used to connect the NICs in each PE to their local router and the routers in a group The maximum group size is governed by the partitioning of the routers The first level of the group normally uses copper cables, the second level in most cases uses optical cables
Cable Connection
Inter-group links
Intra-group links
2
ProcessingData movement and I/OSystem interconnect
Intra-group “Computing-Island” Links
Inter-group Links
Each router provides both “intra-group” (Computing-Island) links that connect it to other computing nodes in its group and “inter-group” links (also known as global links) that connect it to other groups. The router in a group pools its inter-group links, enabling each group to be directly connected to all the other groups.
2
ProcessingData movement and I/OSystem interconnect
KIRA Complete Network Organization and Systems Architecture
GPU
GPU
GPU
GPU
KIRA CPU Computing Nodes
Computing IslandsKIRA
Storage
PE
PE
PE
StorageExample of KIRA configuration
2
ProcessingData movement and I/OSystem interconnect
Symmetric Scalable Architecture
o Fully symmetric network access to any distributed resource
o Same time to access to a dataindependently from the data position in the cluster
o Same latency to access to any CPU independent from physical location of the CPU
o Multiple concurrent dedicated interconnections links to any resource or group of resources
GPU
GPU
GPU
GPU
KIRA CPU Computing Nodes
9.6 Tbit/s Bisection Bandwidth
Computing IslandsKIRA
up to 1.6 Tbit/s per linkBi directional
Example of Bandwidth and Bisection Bandwidth that demonstrate the superiority of Kira compared with other commercial systems
The picture is just an example of a possible configuration of a KIRA system
2
ProcessingData movement and I/OSystem interconnect
Why all the computing nodes can access to the resources @ the same latency @ scale
The topology is realized with incremental direct connected multilevel-groups of islands of fixed number of nodes.(Resulting in k-ary n-flat, flattened network topology) e.g. a) Island 16 nodes (switch connected)b) 1st level #1 Group of #8 Islandsc) 2nd level #1 Group of # 8 1st level Groups
This topological architecture result in ultra-low diameterLower diameter = Higher interconnection performance
The k-ary n-flat topology can be mathematically derived from a k-ary n-fly.The incremental direct connected multilevel-groups are composed of #N of k radix k0 = n(k−1)+1 (*) routers where N is the size of the network. ((*)It is possible to derive this formula from the Euler equation used to solve the following series: S= sum(ab);
The routers are connected by channels in n0 = n − 1 dimensions, corresponding to the n−1 columns of inter-rank wiring in an equivalent butterfly topology.
It is possible to demonstrate that for regular increments (like the on in the picture) the available links of the
network links are: and the diameter is “d”L= 𝑑𝑑𝑁−1 𝑁
2
2
ProcessingData movement and I/OSystem interconnect
Why all the computing nodes can access to the resources @ the same latency @ scale
The number of nodes in a flattened butterfly is plotted as a function of a number of dimensions n0 and switch radix
The diameter of the network remain absolutely low scaling to very large number of nodes just using commodity single switches of radix k’:e.g. Island 16 nodes per island @n’=0 (router) P2P latency =0.6usUp to Nodes > 300 nodes @ n’=1 (max total latency (worst case) = 0.69)Up to Nodes > 1,900 nodes @ n’=2 (latency total latency (worst case) = 0.78us Up to Nodes > 16,000 nodes @ n’=3 (latency total latency (worst case) = 0.87us
2
ProcessingData movement and I/OSystem interconnect
Why all the computing nodes can access to the resources @ the same latency @ scale
Multiple Topologies Comparison @ 128 Nodes
Diameter1 2 3 4 5 6 7
1D Array
Ring
2D Mesh
2D Torus
Hypercube
A3Cube
Lower is Better
1D Array
Ring
2D Mesh
2D Torus
Hypercube
A3Cube
2 4 6 8 10 12 14 16 18 20 22 24 26 28 … 1200
Higher is Better
Bisection Bandwidth
Bisection Bandwidth
Diameter
Terabit/s
A3Cube Example Configuration = #16 nodes per island, #8 islands, #1 36 ports switch (d=1)
2
Accelerators
KIRA™ Hybrid
Support and integration for:
Multiple Hybrid Processing Nodes (CPU + Accelerators A variety of Accelerators(GPU any kind (P100, V100, GTX series, ATI)
Support CUDA and OpenCL
Support FPGA any type, Vector Processor including NEC Vector Processors (VE)
Multiple concurrent configurations are possible
Built in support (hardware and software) for A3Cube Composable multi master architecture (derived from GRIFO™)