nvidia dgx superpod: ddn a³i ai400x appliance

RA-10133-001 | October 2020

NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance

NVIDIA DGX A100 Reference Architecture

NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | i

Document History

RA-10133-001

Version Date Authors Description of Change

001 2020-09-10 Craig Tierney and Robert Sohigian Limited release

002 2020-10-01 Craig Tierney and Robert Sohigian General availability

NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | ii

Abstract

The NVIDIA DGX SuperPOD™ with NVIDIA DGX™ A100 systems is a next generation state-of-the-art artificial intelligence (AI) supercomputing infrastructure that delivers groundbreaking performance, deploys in weeks as a fully integrated system, and is designed to solve the world's most challenging AI problems.

The groundbreaking performance delivered by the DGX SuperPOD with DGX A100 systems enables the rapid training of deep learning (DL) models at great scale. The DGX SuperPOD set records for all eight MLPerf benchmarks at scale1. To support this record-breaking performance, the DGX SuperPOD must be paired with a fast storage system.

In this paper, the DDN® A3I AI400X appliance was evaluated for suitability for supporting DL workloads when connected to the DGX SuperPOD. AI400X appliances are compact and low-power storage platforms that provide incredible raw performance with the use of NVMe drives for storage, and InfiniBand or Ethernet (including RoCE) as its network transport. Each appliance fully integrates the DDN EXAScaler parallel filesystem, which is optimized for AI and HPC applications. EXAScaler is based on the open-source Lustre filesystem with the addition of extensive performance, manageability, and reliability enhancements. Multiple appliances can be aggregated into a single filesystem to create very large namespaces. A turnkey solution validated at scale with the DGX SuperPOD, the AI400X appliance is backed by DDN global deployment and support services.

Shared parallel filesystems such as the EXAScaler filesystem simplify data access and support additional use cases where fast data is required for efficient training and local caching is not adequate. With high single-threaded and multi-threaded read performance, they can be used for:

Training when the datasets cannot be cached locally in DGX A100 system memory or on the DGX A100 NVMe RAID.

Fast staging of data to local disk.

Training with large individual data objects (e.g. uncompressed or large lossless compressed images).

Training with optimized data formats such as TFRecord and RecordIO.

Workspace for long term storage (LTS) of datasets.

A centralized repository for the acquisition, manipulation and sharing of results via standard protocols like NFS, SMB, and S3.

https://www.nvidia.com/en-us/data-center/resources/nvidia-dgx-superpod-reference-architecture/

https://blogs.nvidia.com/blog/2020/08/14/making-selene-pandemic-ai/

https://blogs.nvidia.com/blog/2020/07/29/mlperf-training-benchmark-records/

NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | iii

1 Per chip performance arrived at by comparing performance at the same scale when possible. Per accelerator comparison using reported performance for MLPerf v0.7 with NVIDIA DGX A100 systems (eight NVIDIA A100 GPUs per system). MLPerf ID DLRM: 0.7-17, ResNet-50 v1.5: 0.7-18, 0.7-15 BERT, GNMT, Mask R-CNN, SSD, Transformer: 07-19, MiniGo: 0.7-20. Max Scale: All results from MLPerf v0.7 using NVIDIA DGX A100 systems (eight NVIDIA A100 GPUS per system). MLPerf ID Max Scale: ResNet-50 v1.5: 0.7-37, Mask R-CNN: 0.7-28, SSD: 0.7-33, GNMT: 0.7-34, Transformer: 0.7-30, MiniGo: 0.7-36, BERT: 0.7-38, DLRM: 0.7-17.

MLPerf name and logo are trademarks. See www.mlperf.org for more information.

http://www.mlperf.org/

http://www.mlperf.org/

NVIDIA CONFIDENTIAL NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | iv

Contents

Storage Overview ...................................................................................................................... 1

About the DDN AI400X Appliance .............................................................................................. 3

Testing Methodology and Results ............................................................................................. 5

I/O Performance by Thread ........................................................................................................... 6

Read Performance by System ...................................................................................................... 8

MLPerf Performance .................................................................................................................... 9

Summary ................................................................................................................................. 11

NVIDIA CONFIDENTIAL NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | 1

Storage Overview

Training performance can be limited by the rate at which data can be read and re-read from storage. The key to performance is the ability to read data multiple times, ideally from local storage. The closer the data are cached to the GPU, the faster they can be read. Storage needs to be designed considering the hierarchy of different storage technologies, either persistent or non-persistent, to balance the needs of performance, capacity, and cost.

The storage caching hierarchy of the DGX A100 system is shown in Table 1. Depending on data size and performance needs, each tier of the hierarchy can be leveraged to maximize application performance.

Table 1.DGX A100 system storage and caching hierarchy

Storage Hierarchy Level Technology Total Capacity Performance

RAM DDR4 1.5 TB per system > 200 GB/s per system

Internal Storage NVMe 15 TB per system1 > 50 GB/s per system

1. Upgradable to 30 TB if required.

Caching data in local RAM provides the best performance for reads. This caching is transparent once the data are read from the filesystem. However, the size of RAM is limited to 1.5 TB on a DGX A100 system and that capacity must be shared with the operating system, application, and other system processes. The local storage on the DGX A100 system provides 15 TB of very fast NVMe. While the local storage is fast, it is not practical to manage a dynamic environment with local disk alone.

Performance requirements for high-speed storage greatly depend on the types of AI models and data formats to be used. The DGX SuperPOD has been designed as a capability-class system that can handle any workload both today and in the future. However, if systems are going to focus on a specific workload, such as natural language processing, it may be possible to better estimate performance needs of the storage system.

Storage Overview

NVIDIA DGX SuperPOD: DDN A³I AI400X Appliance RA-10133-001 | 2

To allow for customers to characterize their own performance requirements, some general guidance on common workloads and datasets are shown Table 2.

Table 2: Charactering different I/O workloads

Storage Performance Level Required

Example Workloads Dataset Size

Good Natural language processing (NLP) Most all datasets fit in cache

Better Image processing with compressed images, ImageNet/ResNet-50

Many to most datasets can fit within the local system’s cache.

Best Training with 1080p, 4K, or uncompressed images, offline inference, ETL

Datasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once

Table 3 provides performance estimates for the storage system necessary to meet the guidelines in Table 2. To achieve these performance characteristics may require the use of optimized file formats, TFRecord, RecordIO, or HDF5.

Table 3. Guidelines for obtaining desired performance characteristic

Performance Characteristic Good, GB/s Better, GB/s Best, GB/s

140 system aggregated system read 50 140 450

140 system aggregated system write 20 50 225

Single SU aggregate system read 6 20 65

Single SU aggregate system write 2 6 20

Single system read 2 4 20

Single system write 1 2 5

The high-speed storage provides a shared view of an organization’s data to all systems. It needs to be optimized for small, random I/O patterns, and provide high peak system performance and high aggregate filesystem performance to meet the variety of workloads an organization may encounter.


About the DDN AI400X Appliance

The DDN AI400X appliance (Figure 1) meets the requirements above and provides several features that are important for maximizing the performance of the DGX SuperPOD and system management in data centers.

Figure 1. AI400X appliance

Some of the data center requirements include:

The AI400X appliance is a scalable building block which can be easily clustered into a single namespace that scales seamlessly in capacity, performance, and capability. It is fully integrated which streamlines deployment and simplifies management operations.

An AI400X appliance can be configured at several different capacities ranging from 30 TB to 240 TB for NVME and can be expanded with up to 5 PB of capacity disk for hybrid configuration. Multiple appliances can be aggregated into a single namespace that provides up to several hundreds of petabytes of useable capacity.

The AI400X appliance communicates with DGX SuperPOD clients using multiple HDR100 InfiniBand or 100/40 GbE network connections for performance, load balancing, and resiliency.

The DDN shared parallel protocol enables a single DGX client to access the full capabilities of a single appliance —over 50 GB/s and 3M IOPS. This performance is necessary for training image-based networks as image sizes grow to 1080p, 4K, and beyond. Multiple clients can access data on the appliance simultaneously, with performance available to all systems and distributed dynamically. Clients can be optimized for ultra-low latency workloads.

The all-NVME architecture of the AI400X appliance provide excellent random read performance, often as fast as sequential read patterns.

Native InfiniBand support is a key feature that maximizes performance and minimizes CPU overhead. Since the DGX SuperPOD compute fabric is InfiniBand, designing the storage fabric as InfiniBand means only one high-speed fabric type needs to be managed, simplifying operations.

About the DDN AI400X Appliance



Testing Methodology and Results

For the DGX SuperPOD, the storage system was designed to meet the “best” criteria discussed in Table 3. To accomplish this, a single filesystem was built with ten AI400X appliances. Expected aggregate read performance is over 500 GB/s. Total useable capacity of the filesystem is 2.5 PB.

The storage was attached to a separate Infiniband fabric on the DGX SuperPOD. Each system of the DGX SuperPOD was connected with two NVIDIA Mellanox® HDR Infiniband adapters, allowing for over 45 GB/s per system.

Filesystem tests were conducted two ways. First, microbenchmarks were run with IOR to determine the maximum capability of the filesystem. Second, some of the MLPerf benchmarks were used to evaluate performance of real applications. The MLPerf benchmarks are run on both the local NVMe RAID as well as the AI400X appliance.

The following sections highlight tests used to gauge overall capability of the AI400X appliance for DL workloads. The three tests are read performance by thread, read performance by system, and DL training performance. The read benchmarks best correlate to the requirements needed for DL applications. The DL training benchmark provides an example of how the different performance benefits of the NVMe based AI400X appliance translates into real application performance.

Note: All testing was done on a production filesystem. The additional load may have caused variability in the measurements taken for this report. Higher performance numbers are expected on a dedicated system.

https://github.com/hpc/ior

https://mlperf.org/



I/O Performance by Thread Read performance, both sequential and random, are important for DL training. Figure 2 shows the sequential read and random read performance of the AI400X appliance as thread count is increased.

Tests were run with Direct I/O, 128K, 1M and 16 MB block sizes, 16 GB per thread, each file operated on its own file, fsync on close, and cache was dropped between the write and read phases. Performance was measured with IOR.

Figure 2. Single node sequential write and random write performance

At each block size, both random and sequential write performance is similar. In addition, I/O performances for larger block sizes exceeds 40 GB/s. While write operations are not as common as read operations in many machine learning workflows, being able to write data at these rates minimizes data scientists time when pre-processing data and preparing new datasets.



Figure 3 shows that both sequential and random reads are close in performance at every thread count for large block sizes. Even for the smaller blocksize, 128 KB, sequential I/O scales linearly with the number threads and the number of IOPS that can be generated per thread. Peak single node performance exceeds 45 GB/s. With a larger I/O block sizes, it only takes 32 threads to achieve over 25 GB/s.

Figure 3. Sequential read and random read performance

Single node I/O from the DDN AI400X appliances is able to achieve over both read and write 45 GB/s for block sizes larger than 1 MB. That is approaching the line rate of the two 200 Gbps InfiniBand network adapters. In addition, performance of smaller I/O operations is still excellent providing the performance needed for the fastest training with smaller datatypes.



Read Performance by System Figure 4 shows the system verify the peak aggregate performance of a single AI400X appliances. As illustrated in the figure, large block sequential and random reads have similar performance. Performance exceeds 500 GB/s from 64 clients.

Performance was measured with IOR. Tests were run with Direct I/O, 16 MB block size, 8 GB per thread, each thread operated on its own file, fsync on close, and cache was dropped between the write and read phases.

Figure 4. Aggregate sequential read and random-read performance

Read performance scales linearly up to 16 systems, where total bandwidth of the entire filesystem is nearly achieved. Peak performance achieved was over 510 GB/s for both sequential and random reads.



MLPerf Performance While microbenchmarks such as IOR are important tools for characterizing filesystem performance, real application performance is the best metric. The MLPerf benchmarks provide a variety of different DL models with different I/O requirements that are good representations of DL workloads today.

In the tests below, MLPerf training is run on both the local NVMe RAID filesystem and the AI400X appliances. The local NVMe RAID filesystem is built from four NVMe drives and uses RAID0 in software for maximum performance. The local NVMe RAID can achieve over 50 GB/s of read performance with very fast metadata.

Image Classification (ResNet-50) ResNet-50 is the current standard for DL benchmarks. It is an I/O intense workload, processing over 20,000 images per second on a DGX A100 system. The average size of an image in the ImageNet database is 125 KB. This translates into needing to read data at approximately 3 GB/s.

Table 4 shows the training performance measured in images per second for ResNet-50 from MLPerf v0.7 of the AI400X appliance relative to local RAID. In this benchmark, the data are formatted as RecordIO files. The NVIDIA Data Loading Library (DALI™) I/O framework is used for reading the data. Both MMAP and buffered I/O are tested.

Table 4. Relative performance MLPerf v0.7 ResNet-50 of AI400X appliances to local RAID

Benchmark Case Local RAID (sec) AI400X (sec) Relative Performance1

1 System, MMAP I/O 2432.0 2412.0 100.8%

1 System, Buffered I/O 2408.0 2429.0 99.1%

2 Systems, MMAP I/O 1449.0 1451.0 99.9%

2 Systems, Buffered I/O 1436.0 1433.0 100.2%

96 Systems, MMAP I/O 96.8 99.4 97.3%

96 Systems, Buffered I/O 97.4 99.2 98.1%

1. Cache was purged on every system prior to starting the tests. Timings reported are the average of five runs.

The performance when training with the dataset on AI400X appliances is basically the same as reading from the local NVME RAID. When running on 96 systems, the relative difference is about 2-3%. The aggregate performance of the local RAID across 96 systems is nearly 5 TB/s with high metadata rates. The 2-3% difference for the largest run represents only 2 to 3 seconds.

https://github.com/NVIDIA/DALI



Language Modeling (BERT) The Bi-Directional Encoder Representations from Transformers (BERT) is a key language model that can be used in a variety of ways, including sentence completion, question and answer, and named entity recognition.

Table 5 shows the relative performance for training BERT between the use of the DDN AI400X appliance to local RAID on both one and two nodes.

Table 5. Relative performance of MLPerf v0.7 BERT of AI400X appliances to local RAID

Benchmark Case Local RAID (sec) AI400X (sec) Relative Performance1

1 System 3425 3427 99.9%

2 Systems 1867 1866 100.0%

32 Systems 193 195 99.0%

128 Systems 114 122 93.4%

1. Cache was purged on every system prior to starting the tests. Timings reported are the average of 5 runs.

Relative training performance between using pre-staged data on the local RAID vs reading from the shared AI400X appliances is similar for all but the largest system count. Even in that case, the difference in training time is only eight seconds. The aggregate performance of the local RAID on 128 systems exceeds 6 TB/s and metadata performance scale linearly since there is no sharing. Even with this comparison performance is close. Further tuning should close the gap.

Summary


Summary

The results in this paper show that the AI400X appliance is a great choice to pair with a DGX SuperPOD to meet current and future storage needs

As storage requirements grow, AI400X building blocks can be added to seamlessly scale capacity, performance, and capability. The combination of NVME hardware and DDN's parallel file system architecture provides excellent random read performance, often just as fast as sequential read patterns. With the 2U form factor, storage pools can be configured to meet or exceed the performance requirements of a DGX SuperPOD. Testing has validated that each AI400X can deliver read performance in excess of 20 GB/s and growing to an aggregated performance over 450 GB/s for an entire filesystem in a predictable manner, as workflows and infrastructure demands increase.

DGX SuperPOD customers should be assured that the AI400X will meet their most demanding I/O needs.

.

NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051 http://www.nvidia.com

Notice The information provided in this specification is believed to be accurate and reliable as of the date provided. However, NVIDIA Corporation (“NVIDIA”) does not give any representations or warranties, expressed or implied, as to the accuracy or completeness of such information. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This publication supersedes and replaces all other specifications for the product that may have been previously supplied. NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and other changes to this specification, at any time and/or to discontinue any product or service without notice. Customer should obtain the latest relevant specification before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer. NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this specification. NVIDIA products are not designed, authorized or warranted to be suitable for use in medical, military, aircraft, space or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk. NVIDIA makes no representation or warranty that products based on these specifications will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this specification. NVIDIA does not accept any liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this specification, or (ii) customer product designs. No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this specification. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA. Reproduction of information in this specification is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices. ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

Trademarks NVIDIA, the NVIDIA logo, NVIDIA Mellanox, NVIDIA DGX, and NVIDIA DGX SuperPOD are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright © 2020 NVIDIA Corporation. All rights reserved.

http://www.nvidia.com/

nvidia dgx superpod: ddn a³i ai400x appliance

Documents