hpc innovation exchange - dell emc isilon...to test the real-world benefits of the dac, three hpc...

HPC Innovation ExchangeA series that examines key trends, technology infrastructures and solutions within high-performance computing.

The Data Accelerator A detailed examination of how the University of Cambridge, Dell EMC and Intel solved HPC I/O bottlenecks by co-developing the world’s fastest open source, software-defined NVMe storage solution. 0

01

ContentsOutline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1 The HPC I/O bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1 Generic burst buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Cambridge “Data Accelerator” overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Freely available code repository and detailed “How-To”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Data Accelerator workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.3 Burst buffer lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.4 Burst buffer access modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.5 Data Accelerator use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

2.1 Server & storage hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

2.2 Placement of DACs within the OPA network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 NVMe and Lustre file system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Slurm and Orchestration software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Synthetic benchmark overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Motivation for DAC performance study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Synthetic benchmark tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Synthetic benchmark summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Synthetic benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Bandwidth scaling within a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Bandwidth scaling across multiple DACs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

4.3 Bandwidth limit of a single client node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

4.4 Bandwidth — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

4.5 Metadata performance of a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6 Metadata limit of a single client node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.7 Metadata scaling across multiple DACs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.8 Metadata — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.9 IOPS scaling within a single DAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

4.10 IOPS scaling across multiple DACs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.11 IOPS limit of a single client node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.12 IOPS — comparison to HDD Lustre system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Application use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 Large checkpoint restart and simulation output data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Relion — Cryo-EM structural biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3 Square Kilometre Array — Science Data Processor benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Traditional centralised HPC file systems, built from

spinning HDD, can create I/O bottlenecks and produce

unresponsive systems, significantly impacting return on

investment in HPC solutions.

As workloads become even more data intense, these

bottlenecks are only exacerbated. These artifacts

frustrate and restrain the art of the possible in the mind

of the researcher, limiting the imagination of the Data

Scientists, Engineers or Astronomers alike, that wish to

make tomorrow’s discoveries today.

The Cambridge Dell EMC Data Accelerator removes

these limitations. It is architected to integrate with

traditional HPC storage without the need for re-design,

to interact with commonly-used scheduling tools, be

completely open-source and extensible, and make

optimal use of modern SSD and fabric technologies.

With special thanks to:Paul Calleja, Matthew Raso-Barnett, Alasdair King, Jeffrey Salmond, Wojciech Turek, John Garbutt, John Taylor, Jonathan Steel

During the development and testing of the DAC solution, a comprehensive set of I/O performance

tests were undertaken to probe the scalability and efficiency of the solution. In addition, issues

such as maximum single client node performance and OPA network saturation are examined. A final

DAC configuration is thus described that provides a very high absolute performance with excellent

scalability and efficiency compared to the raw NVMe performance. The performance of the DAC is

summarised in the tables below:

Table 1 Single DAC node performance with 12 P4600 NVMe

ReadGB/s

WriteGB/s

Read 4K

IOPS

Write 4K

IOPS

Metadata

Create Stat Delete

Performance of all 12 NVMe drives within single DAC

22 14.7 5.6M 4M 83K 136K 69K

Efficiency relative to spec data for 12 drives

58% 93% N/A N/A N/A N/A N/A

Table 2 Multiple DAC node performance running across 24 DACs

ReadGB/s

WriteGB/s

Read IOPs

WriteIOPs

Metadata

Create Stat Delete

No of DAC nodes 24 24 24 24 24 24 24

Multi DAC performance 513 340 15M 28M 1.3M 2.9M 0.7M

Scaling efficiency 97% 96% N/A N/A N/A N/A N/A

To test the real-world benefits of the DAC, three HPC application use cases have been examined:

HPC simulations with large checkpoint restart requirements (the so-called Burst Buffer problem);

a contemporary structural biology Cryo-EM challenging data set; and a very challenging SKA

performance prototype. Executing on standard Parallel File Systems constructed with HDD, these

applications exhibited significant I/O overheads which were completely ameliorated using the DAC.

With the checkpoint restart use case alone, it is calculated that the cost of the DAC would be

recovered after one year of continuous usage in terms of increased throughput of the system.

Future work with the DAC will develop alternative file systems deployment using the DAC

Orchestrator, such as BeeGFS and XFS/ext4 via NVMe over fabrics. We will also investigate I/O

monitoring and profiling on a ‘per job’ basis at both the client and server. Finally, we will investigate

a wider range of application test cases within the materials, astronomy, engineering, bioinformatics,

medical imaging and AI domains — in order to build up an understanding of how to best use tiered

storage of this nature within a production HPC environment.

Outline

This paper describes the high-level architecture, comprehensive synthetic benchmarking, and

initial application testing of a performance-optimised prototype and reference implementation for

a solid-state storage Data Accelerator (DAC) subsystem which can be deployed as an add-on to

Slurm1-enabled HPC systems, significantly boosting the performance of data-intensive HPC and AI

workloads. The DAC described here is currently deployed as part of the Dell EMC Cumulus Cloud

HPC system2 at the University of Cambridge and is accessible from conventional HPC systems and

Slurm as a Service3 platforms atop of OpenStack-enabled infrastructure. The work was undertaken

as a co-design project involving the University of Cambridge, Dell EMC, Intel and StackHPC,

and represents the first demonstrator of an open source non-proprietary solid state burst buffer

implementation. This demonstrator implementation of the Cambridge Dell-EMC DAC was so

successful that it reached #1 in the June 2019 IO5004 world HPC storage ranking making this early

prototype system the fastest HPC storage solution in the world, with almost twice the performance

of the second-placed entry on the list. With a footprint of just one rack of storage, the solution was

able to deliver over 500 GB/s Bandwidth, 1.3 million file creates per second and 28M IOPS.

The motivation for the DAC is three-fold:

a. To alleviate the performance bottleneck often experienced with data-intensive HPC & AI

applications running on central networked file systems

b. To provide deterministic, high performance, schedulable I/O resources for these applications,

providing breakthrough I/O performance for data-intensive applications

c. To create an open source software solution, utilising infrastructure as code, cloud-native

technologies—built on readily available server and networking technology—with all the

configuration scripts and detailed build instructions being freely available thereby, making the

technology as accessible as possible promoting the further development and testing of solid

state I/O accelerators within the HPC domain.

From a hardware perspective, the DAC exploits modern SSD NVMe storage packaged in a form to

balance I/O connectivity providing a highly scalable storage subsystem. The current DAC provides

an on-demand Lustre parallel file system, created on a ‘per job’ basis and is fully integrated with

the open source Slurm HPC job scheduler, allowing stage-in/stage-out data to be defined at job

submission time, along with the size, performance and usage mode of the SSD Lustre file system

to be created. The creation and lifecycle management of the Lustre file system is achieved by a

new open source software component developed at the University of Cambridge called the “DAC

Orchestrator”.5

The DAC system deployed at Cambridge is within a 1,152-node Intel SkyLake Omni-Path (OPA)

cluster and comprises 24 so-called ‘DAC server nodes’ providing around 500TB of usable capacity

that can be presented as a Lustre parallel file system. A single DAC server node consists of a

Dell EMC R740xd server, 12 Intel P4600 1.6TB NVMe drives and two Intel Omni-Path low latency

high bandwidth network adapters.

4 5HPC Innovation Exchange | 001 The Data Accelerator HPC Innovation Exchange | 001 The Data Accelerator

Introduction

1 The HPC I/O bottleneckOver recent years, scientific and technical workloads have become ever more data-centric associated

with burgeoning volumes of data coupled to the emergence of data analytics and AI/machine

learning workloads which are also often bandwidth and latency (IOPS) sensitive. Traditional HPC

architectures commonly involve many thousands of compute cores attached to a single, or more

commonly now, multiple large centralised parallel file systems, with 100s of HPC applications running

concurrently placing simultaneously high demands on monolithic central parallel file systems. This

inevitably not only causes an I/O bottleneck, degrading HPC application performance as a result

of increased I/O wait time, but also during periods of high I/O intensity such file systems can be

rendered inoperable to normal file system operations.

Although the aggregate file system bandwidth of a large spinning disk HPC file system can be high,

the file system performance of a single HPC application spanning multiple files or a single shared file

is often much lower. This low single file performance on traditional disk-based parallel file systems is

due to file stripe patterns tending to cross only small elements of the total file system. There quickly

comes a point on traditional parallel file systems constructed with spinning disk, when performance

gains from increased striping are offset by increasing contention. As a result, it is common to find

large central file systems with hundreds or even thousands of GB/s total aggregate performance

only realising a fraction of this in practice due to such operational issues.

Another factor that may add to the I/O bottleneck on a large spinning disk parallel file system, is the

need for long-term data resiliency. The bigger the file system the more moving parts it contains that

can fail. To ensure that this large monolithic storage system does not fail, causing long downtimes

for the users of such system, data protection techniques such as RAID are used. Typically, RAID with

dual and even triple parity is used. While the RAID provides additional resiliency and protects from

disk failures it also introduces a significant performance hit. Additional resiliency is provided by using

hardware with redundant components which brings the TCO even higher. The Cambridge Dell EMC

DAC completely removes the need for resiliency by introducing the concept of short-lived transient

parallel file systems that are tightly bound with the user workload and controlled by the workload

manager (Slurm).

The poor application I/O performance obtained with spinning disk central HPC file systems reduces

investment returns from current-day HPC systems, and unresponsive interactive sessions deliver

poor user experiences. This situation continues to worsen as workloads become more data-intensive.

Even more pressing is that we now have more data-intensive use cases than we can exploit due to

lack of I/O capability. This restrains the art of the possible in the mind of the researcher, limiting

the imagination of the Data Scientists, Engineers or Astronomers wishing to make tomorrow’s

discoveries. The Cambridge Dell EMC DAC removes this limitation unbinding the art of the possible,

unlocking tomorrow’s discoveries today, whilst reducing time to discovery and improving ROI on

HPC / AI solutions.

1.1 Generic burst buffers The work presented here is motivated by the generalisation of the ‘Burst Buffer’ solution and the

adoption of state-of-the-art SSD technologies, high-performance networking, and software CI/CD

methodologies for ease of integration and extensibility. The use of SSD technology within large scale

HPC system architecture has emerged over recent years as supercomputing sites have attempted to

overcome the growing HPC I/O bottleneck and the rapid commoditisation of this technology-driven

by other markets. Early Burst Buffer work can be seen at Los Alamos National Lab with the Trinity

system6 initially for check-point/ restart of large running simulations. Another early use of flash

accelerated file systems for HPC was at the San Diego Supercomputing Centre (SDSC) with their

system called Flash Gordon7 which was primarily motivated by underlying application acceleration.

The work described here builds on this early Burst Buffer work and extends the functionality. As

a result, the term ‘Burst Buffer’ no longer seems appropriate, and the term “Data Accelerator” or

“DAC” was coined.

1.2 Cambridge “Data Accelerator” overviewTo mitigate these I/O-related performance issues and to enable the next generation of data-

intensive workflows, we describe here the high-level architecture, performance characterisation

and integration of an NVMe-based Data Accelerator within the Cambridge HPC cluster. The solution

uses the Intel PC4600, a PCIe-based SSD drive, combined with Intel HPC networking Omni-Path

packaged with the Dell EMC R740xd 14G server platform. In addition to this storage, server and

network hardware layer is a software layer allowing for the creation and deletion of an on-demand

per job Lustre parallel file system. The software-defined design approach exploits modern cloud

methodologies to promote a reproducible and extensible code base, making heavy use of Ansible

for infrastructure as code, together with etcd, a distributed key-value store used, for example, by

Kubernetes.

The DAC forms an integral part of the Cumulus Science Cloud system at the University of Cambridge

as seen in Figure 1 below:

High Performance Cumulus Science Cloud

High Performance Networking

High Performance Storage

OPA Network

Main Lustre F

ile System

(SS

D+

HD

D)

OpenStack IRIS HPC Cluster

CSD3-KNL Cluster

CSD3-GPU Cluster

CSD3-Skylake Cluster

OpenStack Clinical Cluster

All Flash DAC

HTE Network

Figure 1 Cumulus Cloud at the University of Cambridge showing how the DAC fits into the overall

hardware architecture.


The current DAC is integrated with the Slurm scheduling software with no changes required to the

base solution so that within the job submission command an individual per-job Lustre parallel file

system is created. The size and performance of each Lustre file system can be set at job submission

time. Because each job obtains its own freshly created ephemeral Lustre parallel file system no

RAID redundancy features are needed on file system maximising performance. The result is that

I/O bandwidth and IOPS performance become determinant and fully scalable in terms of both

single file and aggregate performance of the whole file system.

1.2.1 Freely available code repository and detailed “How-To”A GitHub code repository5 contains all the ansible scripts to build the solution, the orchestrator

software, relevant documentation and build instructions so that a skilled HPC engineer with the

relevant Lustre system administration knowledge can recreate the solution. In addition, all test

scripts and results are provided so the solution can be verified. All the software is open source and

freely available meaning that this extreme performance capability is now widely available and not tied

to expensive proprietary solutions.

1.2.2 Data Accelerator workflowsThe DAC orchestration has focused on integrating with Slurm’s existing burst buffer support.8 There

is currently only one plugin for this, contributed by Cray for its Cray DataWarp product, an example

of this can be seen with the NERSC Cori system.9 The DAC Orchestrator reuses Slurm’s Cray

burst buffer plugin but with a different underlying implementation. The following section lists the

functionality available to Slurm users when using the Lustre based DAC.

1.2.3 Burst buffer lifecycleWhen a Slurm burst buffer is created by the DAC, it creates a new parallel file system of the

requested size. The granularity of resources assigned to the parallel file system is that of a single

NVMe drive and Slurm “rounds-up” the requested size to the nearest quantity of NVMe disks. This

approach has been chosen to help deliver the most predictable levels of performance while allowing

many users to share the resources.

Each NVMe is split up into an LVM volume for storage and an LVM volume for metadata. This means

if a user needs to store more files than a single NVMe can hold, they simply request more storage

space to increase the number of supported files.

When creating a burst buffer there are two ‘life cycles’ that can be chosen:

y Per job burst buffer

- Job script specifies the required size of the buffer to be created. The system ensures the

buffer is ready to use before a job is assigned compute nodes and is deleted and cleaned up

after the job releases compute node resources.

y Persistent burst buffer

- A named buffer that can be used by multiple jobs and the user controls when the buffer

is deleted.

1.2.4 Burst buffer access modesA job can have zero or one per job buffer and/or zero or more persistent buffers. The job is supplied

with environment variables that instruct on the location of mount points on the compute nodes.

While the latest versions of Slurm can create buffers on login nodes, this is not currently supported

by the DAC.

There are currently two supported access modes of the DAC.

y The simplest is called “striped” and involves a single shared namespace being mounted on all

the compute nodes running a given job. This is currently the only supported access mode for a

persistent burst buffer.

y If using a per job burst buffer, a “private” access mode is also supported where each compute

node is given a different independent namespace. A per job burst buffer can support both the

“private” and “striped” access modes. The DAC implements these using separate directories for

each required namespace. This means any single mount point could consume all of the resources

assigned to that buffer.

A future access mode being considered is that of “transparent cache” using either a persistent or

per job burst buffer. It is also hoped to explore access modes using NVMe-over-fabrics rather than

a parallel file system to improve the performance of both a read-only data set mounted in several

locations, or splitting a NVMe up into namespaces to mount a separate file system on each compute

node, similar to the above private access mode.

Burst buffer data movementOne further aspect of orchestration is the data staging functionality that is available for per job burst

buffers. Before the file system is mounted, data can be copied into a buffer. Before the file system

is destroyed, data can be copied out to an external parallel file system. The copying of data happens

using DAC nodes and does not consume compute node wall clock time.

1.2.5 Data Accelerator use casesBuilding on the available lifecycles and access models the following use cases are expected to be

where users will see the biggest benefit from the Data Accelerator:

y Checkpoint and Restart

- a persistent buffer to help split very long-running jobs into shorter jobs that can recover state.

y Faster access to an existing data set

- some subset of a data set pre-staged into a per job buffer before the job starts.

- or “hot” data sets made available via read-only mount of a persistent buffer.

y High-speed scratch space

- with the option to copy results back to the capacity file system after the job completes.

- reduce compute node time being wasted doing purely I/O.

y Dealing with lots of small files

- DAC’s dedicated metadata helps reduce the impact of noisy neighbours on a capacity oriented

parallel file system.

y Swap

- for jobs with non-deterministic memory requirements.


2 Architecture

Conceptually, the high-level system architecture of the DAC can be viewed as consisting of 5 layers

(see Figure 2), each of which contributes to the performance profile and functionality of the DAC.

There is not a clean separation between these layers, but it is a useful construct to help segment and

understand how the various elements affect the attributes of the DAC.

Layer 1Orchestration layer

Orchestration software and Slurm integration

DAC Node

Network Fabric

LNET Local Filesystem

Operating System

NIC NICNVMe x 6

NVMe x 6

OPA NVMe

NUMA Node PCI lanes

NUMA Node PCI lanes

Lustre Parallel File System

LNET

Operating System

PCI

NIC

OPA

Lustre Mount

LNET

Operating System

PCI

NIC

OPA

Lustre Mount

Client Node

Layer 2

Layer 3

Layer 4

Layer 5

Figure 2 Conceptual layering of the DAC high-level architecture

Each layer needs to be optimised to achieve a high percentage of total theoretical performance

combined with the transparent application integration from the user perspective. These layers are

described below and during the development of the DAC, each of these layers was consistently

tuned to obtain optimal performance and functionality.

y Layer 1 — Orchestration: expose the system to users by dynamically creating new parallel file

systems of the requested size to meet users’ needs — consists of new orchestrator software

with integration into the Slurm HPC scheduling software.

y Layer 2 — Parallel file system: tune parallel file system server and client to make the best use of

available storage and network resources.

y Layer 3 — Operating System kernel and drivers tuned for optimal NVMe and OPA networking

performance.

y Layer 4 — OPA NICs and NVMe Disks are evenly spread between NUMA nodes,

i.e. to reduce PCI bus and CPU interconnect bottlenecks.

y Layer 5 — Network fabric configuration needs to be tuned to ensure bandwidth can be sustained

between clients and servers.

2.1 Server & storage hardware The Data Accelerator is built using Dell EMC R740xd 2U servers. Each DAC server has two 16 core

Intel Xeon Gold 6142 CPUs. Each of these servers contains two PLX PCIe gen-3 switches with 6

Intel P4600NVMe SSD10 connecting to each of the two PLX PCIe switches. This is summarised in

the table below together with the base performance specifications of the NVMe drives.

Processor 2x Intel Xeon Gold 6142 at 2.5GHz

Memory 24x 16GB 2666 MT/s DDR4

Networking 2x Intel OPA v1 100Gbps (12.5GiB) HFI

PCIe Storage 2x PLX PCIe switch for NVMe

NVMe 12x Intel P4600 (1.6TB)

Operating system Red Hat Linux 7.6

Intel P4600 SSD 1.6 TB Specs

Sequential Read (64k) (Up To) 3200 MB/s

Sequential Write (64k) (Up To) 1325 MB/s

Random Read (4k) (Up To) 559550 IOPS

Random Write (4k) (Up To) 176500 IOPS

The DAC test configuration consists of a total of 24 DAC servers each attached with two Intel OPA

links to the University of Cambridge production 1152 node Intel Skylake cluster, a component of the

UK Science Cloud in Cambridge called Cumulus. The DAC will be rolled out over the rest of the HPC

estate, including the Cumulus-KNL and GPU clusters together with the UK IRIS OpenStack cluster.

The DAC provides a total of around 0.5PB of storage using 288 NVMe drives.


2.2 Placement of DACs within the OPA networkThe 24 node DAC configuration described here is connected to the Cambridge Cumulus system via

its OPA network and can generate over 500 GiB/s of I/O bandwidth. To ensure that this data can

move around the cluster without bottlenecks requires careful consideration of the placement of the

DAC nodes within the fabric and optimisation of the OPA settings.

Cumulus employs a traditional oversubscribed (2:1) fat-tree topology, using 48 port Omni-Path edge

switches. Each edge switch has 32 ports connecting compute nodes and 16 ports connecting to the

core OPA network.

Our initial placement of DAC nodes was to connect the 24 DAC nodes to 3 dedicated edge switches,

8 DAC nodes per switch, 16 links down to the DAC nodes and 16 links up to the core OPA fabric. The

motivation here was to remove any oversubscription between the DAC nodes and the core network.

In this configuration, scalability across the DAC nodes was observed to be significantly impaired.

Further experimentation found that DAC scalability stopped when more than 3 DAC Server Nodes

were connected to a single leaf switch, even with congestion control turned on. We have yet to

understand the root cause of these issues.

To work around this issue, an alternative topology approach was used where the DAC nodes were

spread out over a much larger number of switches. Here we place one DAC node per edge switch

which connects client nodes to the core OPA fabric. Thus all 24 DAC nodes are spread out over 24

edge switches that also plug directly into client nodes. With this configuration full DAC bandwidth

scaling is observed.

Besides fabric topology, the DAC nodes and compute clients were tuned according to revision 13.0

of the Intel Omni-Path Fabric Performance Tuning User Guide.14 For all the benchmarks in this paper,

the DAC nodes were using the in-distribution inbox driver as part of Redhat Enterprise Linux 7.6. The

specific hfi1 driver parameters used for benchmarking on both servers and clients were:

options hfi1 krcvqs=8 pcie _ caps=0x51 rcvhdrcnt=8192

Compute nodes were using the Intel Fabric Software (IFS) version 10.8 driver.

2.3 NVMe and Lustre file system configurationThe current DAC deploys Lustre on top of the ldiskfs backend file system. Lustre is a long-standing

high-performance parallel file system and is open-sourced under the GPL v2 licence.

Each OST or MDT is built from a single physical volume with no RAID redundancy. This is because

the design of the DAC is optimised for performance and not long-term resilience as all the data to be

stored on the DAC is also held on long term storage and staged in and out continuously. This is one

of the key design elements that allow maximum performance to be obtained.

Each NVMe device is formatted with a 4KiB sector size using the Intel Solid-State Drive Data Centre

Tool. We then create an LVM physical volume for each device, which the DAC Orchestrator will use

to partition into MDT and OST logical volumes.

For the benchmarks in this paper, we are using Lustre 2.12.2 on the server which contains

improvements for flash storage systems ensuring optimum performance.

Lustre 2.12.2 is also used on clients. Lustre was built and installed as described in the Lustre

documentation.15

Both servers and clients were using Redhat Enterprise Linux 7.6, using

kernel 3.10.0-957.10.1.

During the benchmarking tests of this paper, we reliably hit a problem with Lustre hitting RDMA

timeout errors particularly during IOR read tests. We have a bug report regarding this problem in12

and were advised to try reverting one commit on top of 2.12.2 as a workaround until the issue is

resolved. We are still working with the developers to root cause this issue, but the benchmarks in this

paper were run with 2.12.2 plus this commit reverted.

Tunables for the Lustre file system are provided in the DAC ansible configuration in the code

repository,5 but a summary is given below:

Lustre server tunables

Increase MDT maximum number of modifying RPCs in-flight allowed per client:

mds$ echo 127 > /sys/module/mdt/parameters/max _ mod _ rpcs _ per _ client

Increase maximum RPC size to 16MiB:

oss$ lctl set _ param obdfilter.*OST*.brw _ size=16

Lustre client tunables

Disable client LNET data integrity checksums:

client$ lctl set _ param osc.*.checksums=0

Increase client maximum RPC size to 16MiB:

client$ lctl set _ param osc.*OST*.max _ pages _ per _ rpc=16M

Increase client max_rpcs_in_flight and max_mod_rpcs_in_flight to match server:

client$ lctl client$ lctl set _ param mdc.*.max _ rpcs _ in _ flight=128client$ lctl set _ param osc.*.max _ rpcs _ in _ flight=128client$ lctl set _ param mdc.*.max _ mod _ rpcs _ in _ flight=127 osc.*OST*.max _ pages _ per _rpc=16M

Increase client maximum amount of data readahead on a file, and the global limit:

client$ lctl set _ param llite.*.max _ read _ ahead _ mb=2048client$ lctl set _ param llite.*.max _ read _ ahead _ per _ file _ mb=256

Increase the amount of dirty data per OST the client will store in the pagecache:

client$ lctl set _ param osc.*OST*.max _ dirty _ mb=512


2.4 Slurm and Orchestration softwareThe DAC uses an open source orchestration tool developed at the University of Cambridge with

StackHPC that integrates with Slurm. The DAC Orchestrator is currently configured to deploy a

Lustre parallel file system per job as directed by Slurm. It is planned that the orchestrator could be

extended to use other file systems such as BeeGFS, but here we only discuss Lustre.

Slurm already has support for burst buffer via burst buffer plugins. However, there is currently

only one working plugin, the Cray DataWarp plugin. To integrate with Slurm the DAC Orchestrator

has chosen to reuse both the user and orchestrator interfaces Slurm has created to expose Cray

DataWarp to Slurm users. As such, the standard Slurm documentation is valid for DAC Orchestrator-

provided burst buffers.8

Requesting a Burst Buffer from SlurmUsers interact with burst buffers by adding special directives to their job scripts. For example, to

create a burst buffer that can be attached to multiple different jobs, you can create a persistent

burst buffer using the following job script:

#BB create _ persistent name=alpha capacity=2TB access=striped type=scratch

In this second example we look at the directives you add if you want to use the persistent buffer

created in the previous example. In addition, we ask Slurm to create a per job burst buffer, copy an

input file into that per job buffer before the job starts, and copy out a directory of files after the job

as completed. Finally, we also request that additional swap is added to every compute node.

#DW persistentdw name=alpha#DW jobdw capacity=1400GB access _ mode=striped,private type=scratch#DW stage _ in source=/mnt/user1/input/file1 destination=\$DW _ JOB _ STRIPED/file1 type=file#DW stage _ out source=\$DW _ JOB _ STRIPED/outdir destination=/mnt/user1/out/run3 type=directory#DW swap 1TB

When the user’s job executes, Slurm uses the burst buffer plugin to ensure the buffer has been

created and mounted on the compute nodes before the job is started. The user’s job script is

supplied with environment variables that tell the user where the buffer has been mounted.

To better understand what is available to the users, let us consider the ‘DW jobdw’ directive more

carefully. The type is defined to be scratch. Currently, this is the only valid value, although it is hoped

to add a transparent cache mode in the future.

The capacity requested will be “rounded-up” to the nearest level of granularity that is available in

the requested burst buffer pool (by default we are requesting from the default pool). The size of the

NVMe disks currently determines the granularity that is reported by the DAC Orchestrator.

Finally, we should note that two access modes are supported: striped and private. “Striped” means

a buffer is exposed as a single namespace to all compute nodes. “Private” means each compute

node receives its own dedicated namespace. In the example above, the buffer is dynamically shared

between both access modes.

The DAC Orchestrator implements this by creating a new parallel file system for each burst buffer.

This parallel file system is mounted on all the compute nodes, with a directory created for each of

the required namespaces. In addition, the same parallel file system is extended to include the space

needed for any compute node swap files, that are also stored on the same parallel file system as

shown in Figure 3.

Host1

/dac/name/global

/dac/name/private

swap on

Host2

/dac/name/global

/dac/name/private

swap on

Global Host1 Private Host1 Swap

Bu�er: Name

Host2 Private Host2 Swap

Figure 3 Mount points and namespaces within a per job buffer


DAC Orchestrator Implementation

The DAC Orchestrator is composed of three key components, written in golang and ansible:

y dacctl

- command line tool called by Slurm’s Cray DataWarp burst buffer plugin

y dacd

- a service that runs on each DAC node and creates the requested parallel file system

y fs-ansible

- ansible scripts used by dacd to orchestrate the creation and deletion of the requested parallel

file systems.

dacctl and dacd communicate with each other via the etcd datastore: as shown in Figure 4.

dacctl etcd

Slurm

Hardware

Ansible

Cray DataWarp

dacd

Figure 4 Data Accelerator Orchestrator’s Architecture

The Slurm plugin defines the workflow for management of the burst buffers. Figure 5 shows at each

step the evolution of the user’s request. At each point along the line, there is input from Slurm via

dacctl as the buffer is created, consumed and removed. Slurm also delegates to the DAC Orchestrator

the responsibility for copying users’ files, so that compute resources are not tied up during data

copy. The mount and unmount does happen within the compute nodes assigned to the job, but that is

designed to be a relatively quick process and does not overly impact the time spent in execution.

Mount UnmountExecuteSetup Bu�er

Validate Request

Chosen Compute Nodes

Release Bu�erStage-In Data Stage-Out Data

Figure 5 Lifecycle of a buffer requested by a user’s job submission

For further guidance on setting up and configuring the Orchestrator please see the Orchestrator

installation guide.5

3 Synthetic benchmark overview

A detailed set of synthetic benchmark studies were undertaken to probe I/O scaling both within a

single DAC and across multiple DAC units, that are connected to the large scale (1152 node) Intel

Skylake HPC cluster deployed at the University of Cambridge. This system is part of the “Cumulus”

science cloud, the largest academic HPC system in the UK from 2017 -2019.

3.1 The motivation for the DAC performance study

Characterise DAC performance as a design toolA key design goal of the DAC is to build a storage server with the highest possible performance for

the lowest possible cost, with the system being balanced in terms of read and write I/O metrics,

including sequential bandwidth, metadata and IOPS. To achieve this, the Dell-EMC R740xd server

platform was adopted, which provided a flexible PCIe sub-system that can be populated with variable

amounts of NVMe and OPA network cards allowing a wide range of cost-effective flexible system

configurations to be explored.

Given the flexibility of the hardware unit, it was necessary to maintain a rigorous testing environment

to ensure the DAC configuration was optimised to provide the highest performance possible and to

verify that the performance was maintained both within a single DAC server and across multiple DAC

servers connected across the OPA switch fabric. In order to do this a series of synthetic benchmarks

were undertaken to probe performance and scalability up through the data path, within a single DAC,

across multiple DAC units and out to remote clients.

Link application I/O requirements to DAC I/O performanceThe questions we look to answer in respect of performance optimisation are:

y What is the maximum performance you can extract from a single DAC? Are we obtaining all the

performance we can expect? Where are the bottlenecks?

y Given my application I/O requirements, how many NVMe drives should I have in each DAC node?

y How many DAC nodes are needed if the performance required is more than a single DAC

can deliver?

When trying to understand how application I/O will perform and scale on the DAC we have

undertaken a basic study looking at a range of I/O operations that can be characterised by simple

synthetic benchmark metrics. These synthetic benchmark metrics can then be run on the DAC

allowing the results to be related back to application I/O operation requirements. The Table below

lists the metrics associated with these operations:

I/O Pattern Metric

Bulk File Read/Write Bandwidth — IOR13

File Creation and Deletion Metadata — mdtest13

Small File Read/Write IOPS — IOR with 4k transfer size


We have probed the performance of these metrics as you scale across NVMe drives within a single

DAC and then as you scale across multiple DACs. By understanding an application’s bandwidth,

metadata or IOPS requirements and knowing how the performance of these metrics scale within

and across the DAC units, it is possible to estimate the performance of a particular sized DAC in

terms of NVMe drives or multiple DAC server nodes. The synthetic benchmark regime used here is

described in the section below and provides results for an idealised application. As such, it explores

the uppermost performance an application could deliver.

3.2 Synthetic benchmark toolsThe synthetic benchmarks used in this work were taken from the recently developed IO-500

benchmark suite. The IO-500 creates a single score from a range of different benchmarks. The

O-500 is used as a worldwide HPC storage ranking that takes place twice a year.

For the most recent (June 2019) submission, we worked in close collaboration with engineers from

Whamcloud, the primary developers of Lustre, to get the best possible performance we could from

the system, with the result that the DAC came top of the list, with a score almost twice that of the

second-place entry.

The DAC result in the June 2019 IO500 submission used the very latest tip of the development

version of Lustre, and included additional patches developed by Whamcloud that provided significant

performance improvements to a number of the tests in the benchmark. Since these improvements

are still landing in an official stable release of Lustre, we chose to use the latest stable version of

Lustre (2.12.2) for the benchmarks in this paper to provide a more easily reproducible configuration.

IO-500

This is the official ranked list for ISC HPC 2019.

Please see also the 10 node challenge ranked list.

The list shows the best result for a given combination of the system/institution/filesystem.

# information io500

institution system storage vendor

filesystem type

client nodes

client total procs

data score bw md

GiB/s kIOP/s

1 University of Cambridge Data Accelorator Dell EMC Lustre 512 8192 zip 620.69 162.05 2377.44

2 Oak Ridge National Laboratory

Summit IBM Spectrum Scale

504 1008 zip 330.56 88.20 1238.93

3 JCAHPC Oakforest-PACS DDN IME 2048 2048 zip 275.65 492.06 154.41

4 Korea Institute of Science and Technology

Information (KISTI)

NURION DDN IME 2048 4096 zip 156.91 554.23 44.43

Figure 6 June 2019 IO-500 Listing with the Cambridge DAC 1st

Figure 7 below, shows all the entries made in the three submission rounds since the IO-500 began,

each submission round having a different coloured data point. By looking at the different tests within

the IO-500 benchmark suite, it is clear that the DAC’s high metadata and IOPS performance mean it

stands head and shoulders above the pack.

IO-5

00

Sco

re

Position in the list

0 10 20 30 40 50

Cambridge 1st Place2 x scond place – 4 x most of the pack

Dell – R740xd NVMe – Lustre

Summit – IBM Spectrum Scale

Oakforest PACS DDN IME

Weka – Weka-IO Matrix

The rest of the pack

0

200

400

600

Year of submission 2018-06 2018-11 2019-06

Figure 7 IO-500 entries coloured by submission round

For the work in this paper, we have taken several of the individual tests within the set of benchmarks

that make up the complete IO-500 suite, in order to probe the I/O performance metrics of interest

to us. The use of these subcomponents provides a well thought out and well-tested benchmark

regime which also allows the results generated to be compared with a wider set of results, from

other storage systems, already obtained that are stored within the IO-500 database.

For the purposes of this paper, we have used the “ior-easy” and “mdtest-easy” tests from the

IO-500 since they probe the maximum performance values possible on the hardware, and thus

provide the upper bound in performance obtained from an idealised application.


3.3 Synthetic benchmark summaryThe following provides a high-level summary of the synthetic benchmark results:

y Bandwidth

- Single DAC bandwidth.

> Max performance across all 12 NVMe, read 22 GiB/s, write 14 GiB/s.

> Single NVMe gives 3.1 GiB/s read, 1.3 GiB/s write, 95% spec value.

> Scales linearly across NVMe drives within a single DAC subject to internal PCI switch and

Lustre LNET data transfer limits via the OPA.

- Multiple DAC bandwidth.

> Max bandwidth across 24 DACs, 513 GiB/s read and 340 GiB/s write.

> Scaling across DACs is linear with over 97% efficiency.

- Single client node bandwidth.

> The IOR easy benchmark from IO-500 demonstrates that a single Client node is able to

reach R/W bandwidths of 10GiB/s. This is very close to the maximum bandwidth lustre can

support to a single OPA card of 11 GiB/s.

- Comparison to traditional SAS-HDD based Lustre file system with 36 OSTS built from

RAID6 arrays.

> Maximum performance of busy file system under load, when striped overall 36 OSTs was

approximately 10 GiB/s write, 15 GiB/s read.

y Metadata

- Single DAC metadata performance with 1 MDT.

> Maximum performance appeared to peak around 12 clients, with 83k file create IOPS, 136k

stat IOPS, 69k delete IOPS.

> Maximum single client performance of 21k create IOPS, 71k stat IOPS,

58k delete IOPS.

- Multiple DAC metadata performance scaling.

> Good scalability was observed using Lustre’s DNE2 striped directories as we scaled the

number of MDTs up to 48 in the file system. At 24 DACs with 128 clients, we reached a

peak of 1.3M create IOPS, 2.9M stat IOPS, and 0.7M delete IOPS.

- Comparison to traditional SAS-HDD based Lustre file system with 1MDT on RAID10 array.

> Maximum performance appears to peak around 6-8 clients, 22K file create IOPS, 150K stat

IOPS, 29K delete IOPS.

> Maximum single client performance of 18K create IOPS, 48K stat IOPS,

19K delete IOPS.

y 4K IOPS

- Single DAC 4k IOPS performance with Direct I/O off (buffered IO).

> Maximum performance with 32 clients is 5.6M read and 4M write IOPS.

- Single DAC 4k IOPS performance with Direct I/O on.

> Maximum performance with 32 clients is 250K read and 70K write IOPS.

- Multiple DAC 4k IOPS performance scaling — Direct I/O off.

> Maximum performance with 32 clients is 15M read and 28M write IOPS.

- Multiple DAC 4k IOPS performance scaling — Direct I/O on.

> Maximum performance with 32 clients is 1.6M read and 1.6M write IOPS.

- Comparison to a traditional HDD-based Lustre file system with 36x RAID6-based

OSTS — Direct I/O Off.

> Maximum performance with 32 clients is 1.7M read and 2.3M write IOPS.

- Comparison to a traditional HDD-based Lustre file system with 36x RAID6-based

OSTS — Direct I/O On.

> Maximum performance with 32 clients is 372K read and 15K write IOPS.

4 Synthetic benchmark results

This section provides a detailed description of the synthetic benchmark results that were undertaken

to profile the performance scaling within and across multiple DAC units as they were configured

within the Cumulus OPA cluster. All the test scripts and raw data can be found in the code repository.

4.1 Bandwidth scaling within a single DACHere we explore a single DAC node’s bandwidth performance as we scale the Lustre file system

across an increasing number of NVMe drives within a single DAC node using the IO-500 ior_easy

test with 12 client nodes.

Ban

dwid

th G

B/s

Number of NVMe drives in Lustre pool

1 2 3 4 5 6 7 8 9 10 11 12

5.00

10.00

15.00

20.00

25.00

0.00

Single DAC R/W Bandwidth Scaling

Write bandwidth Read bandwidth

Figure 8 Read / Write performance as measured from the ‘ior_easy’ phases of the IO-500 benchmark.

The number of IOR clients is fixed at 12, each with 32 MPI ranks and varying the number of NVMe

OSTs in the Lustre file system being tested. All NVMe drives are housed in a single DAC server.

Single NVMe performanceFirst, let’s consider the results for a file system consisting of a single NVMe. Here we see a read

performance of 3.1 GB/s and a write performance of 1.3 GB/s for a single NVMe which is very close

to published spec values of the NVMe disks which have a maximum value of read 3.2 GB/s and write

of 1.3 GB/s.10 This demonstrates that the Lustre parallel file system and the underlying configuration

of the NVMe drive, OS and drivers are able to obtain the full read/write performance of the NVMe

drives through the file system and out across the network to a remote client node.

Scaling across multiple NVMeAs more NVMe OSTs are added to the Lustre file system, we see the write performance scale

linearly all the way to 12 disks, with an overall value of 14.6 GiB/s and a write efficiency of 93%,

compared to the theoretical maximum for 12 NVMe disks. The read performance scales with a more

complex pattern across the drives. We see linear read performance scalability up to NVMe #4 with

a delivered bandwidth of 12.5 GiB/s and a read efficiency across the 4 drives of 97% compared to

the theoretical maximum value. Adding disk #5 only increases the bandwidth by 0.5 GiB/s reaching a

maximum value of 13 GiB/s at 5 drives.


The performance then plateaus until drive #7 at which point the performance starts to linearly scale

again until plateau again at drive #10, with a maximum read performance across all 12 NVMe drives of

22.3 GiB/s, an overall read efficiency of 58%.

Understanding the read and write behaviourThis complex read scaling behaviour is a result of the bandwidth limit within the data path— from

NVMe drive to remote client — which reduces the read performance down from its maximum potential

value for 12 drives of 38 GiB/s to 22 GiB/s which is in line with a network bandwidth limit of two OPA

cards. The write performance scaling is straightforward, here scaling is linear — reaching 97% of

the maximum potential value. This is because the write performance per disk is lower than the read

performance. And as more disks are added, it remains below the bandwidth limit of the OPA cards.

The read scaling needs to be viewed in the context of the architecture of the DAC server. Each DAC

node consists of a Dell R740xd, configured with two OPA NIC’s, which are presented as a single

“bonded pair” using Lustre’s LNET Multi-Rail feature. The NVMe disks are evenly spread across two

internal server PCIe switches. The Dell R740xd two-socket configuration is balanced by having each

CPU socket connected to one Onmi-Path adapter and one PCIe switch.

LNET self-test was used to establish a maximum bandwidth of 22 GiB/s for a single node. In a similar

way, IOR was used to establish a maximum bandwidth through a single PCIe switch to be 13.6GiB/s.

Thus, from a data path perspective, the first I/O bandwidth limit is the 13.6 GiB/s cap on the PCIe

switch, this will limit read bandwidth at disk #5 when the bandwidth of the disk pool exceeds

13.6GiB/s. Each disk provides 3 GiB/s bandwidth thus 5 disks cross this limit. This is observed

and shown in Figure 8. This limit is removed once the system uses the disk from the second PCIe

switch which happens at disk #7 and again this is confirmed as shown in Figure 8. The next cap in

bandwidth, as shown further in Figure 8, is introduced by the 22 GiB/s bandwidth limit of the dual

OPA cards and observed at disk #11.

Bandwidth scaling over 95% of the expected maximum valueWe can see that the bandwidth scaling of the file system in terms of the theoretical maximum value

obtained from the underlying NVMe drives is good, reaching over 95% of the maximum value when

internal bandwidth limitations of the internal PCIe switch and OPA network cards are considered.

This helps us understand that the tunings made at the NVMe firmware level, LVM level, Lustre file

system and OPA network levels are all optimal. We now understand the internal hardware limits of

the server, and how this affects the performance of NVMe drives as they are scaled within a DAC

node, and we also understand how to obtain linear scaling of a Lustre file system as it is scaled

across NVMe drives within a single DAC node.

Test our understanding and obtain linear scaling for both read and writeTo help prove our understanding of bandwidth scaling and to obtain a linear scaling curve for both

read and write the tests were repeated, but this time balancing the disks between the two PCIe

switches as we scale up. The results are shown in Figure 9.

Ban

dwid

th G

B/s


1 2 3 4 5 6 7 8 9 10 11 12

5.00

10.00

15.00

20.00

25.00

0.00

Single DAC R/W Bandwidth ScalingPCI Aware Disk Layout


Figure 9 Read/Write performance as NVMe OSTs within a single DAC node are added in a PCIe

aware fashion to the lustre OST pool

As expected, we see the same write bandwidth scaling, but a different pattern in the read bandwidth.

With a careful balancing of the NVMe between PCIe bridges, we see that the limit on read bandwidth

is now only the OPA bandwidth limit. With a smooth curve growing at 3GiB/s per NVMe drive until

hitting the limit of the OPA links out of the server at 22 GiB/s.

To further prove the limit is related to a single host and not some fundamental limitation in a single

Lustre file system, we can remove the PCIe and OPA bottleneck by spreading the NVMe evenly

between three servers. (By default, this is what the orchestrator will do when creating a file system

across three DAC servers).

Ban

dwid

th G

B/s


1 2 3 4 5 6 7 8 9 10 11 12

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

0.00

Bandwidth Scaling as NVMe Drives added across DAC Units By Orchestrator


Figure 10 Bandwidth scaling as NVMe drives spread evenly across 3 DAC nodes


4.2 Bandwidth scaling across multiple DACsIn this section, we show the linear scaling of read and write bandwidth across all 24 DACs in our

configuration. Using ior_easy in the same way as the previous test, we grow the size of the Lustre

file system one DAC at a time until all 24 DAC units are used.B

andw

idth

GB

/s

Number of DAC units in Lustre pool

0 5 10 15 20 25 30

100

200

300

400

500

600

0

Multiple DAC Bandwidth Scaling


Figure 11 R/W IOR bandwidth scaled across 1-24 DAC units

Here we see the single DAC bandwidth performance of 22 GiB/s read and 14.7 GiB/s write scale

linearly as we increase the number of DAC servers, giving a total of 513 GiB/s read and 340 GiB/s

write across 24 DACs. The number of clients in the test is scaled with the number of DACs to ensure

all the bandwidth can be consumed. At 24 DAC nodes, we are using 160 client nodes to fully saturate

the bandwidth.

4.3 Bandwidth limit of a single client node Running across all 24 DAC nodes, with a range of different client-node counts, using the same

ior_easy test from the IO-500 that was used in the last section, we see that individual client nodes

can sustain a peak of 10 GB/s for both read and write. The single OPA link within each client has

a maximum throughput with Lustre via the LNET protocol of 11 GB/s hence 10 GB/s is a high

percentage of max performance.

Thus, we now understand the bandwidth behaviour of the DAC I/O server and the bandwidth that

can be consumed by a single client, providing us with a complete picture of the upper bound of

bandwidth performance as seen by an idealised application.

4.4 Bandwidth — comparison to HDD Lustre systemCurrent HDD-based Lustre systems in Cambridge are built on Dell EMC MD direct-attached SAS

storage arrays. We currently run multiple 2.4PiB file systems instead of one large file system as this

allows us more management flexibility and creates smaller failure domains. Each file system is built

in a highly-available configuration with pairs of storage arrays, each connected in a fully redundant

configuration to two storage servers. We utilise the MD3460 array for the OSTs, which contains 60

disks, and we arrange this into six 10-disk RAID6 volumes as the OST devices. Most of our production

file systems are composed of six MD3460 arrays, giving us 36 OSTs in total per file system.

The current maximum read or write bandwidth we obtain from an entire 2PiB Lustre file system is

approximately 15GiB/s, but to obtain this for a single job or workflow it must utilise all the OSTs

within the file system, which on a shared HPC system is very unlikely and in a multiuser environment

would lead to considerable performance contention. Comparing this to the DAC, we obtain almost

the same performance from just one DAC unit, i.e. 20 GiB/s read, 15 GiB/s write, and this is available

to a single job or workflow with zero performance contention. The cost delta between 2PiB of

spinning disk Lustre and 19TiB of DAC storage—where the DAC provides more real-world single job

bandwidth performance—is 13x. That is to say, a DAC offers 13x lower cost-per-unit of performance

than our HDD-based Lustre.

4.5 Metadata performance of a single DACHere we use mdtest ‘easy’ phases from the IO-500 benchmark, to probe the maximum metadata

performance we can obtain from a single DAC node, (see Figure 12). In this configuration, the

file system consists of a single MDT and 12 OSTs. For this benchmark, we are using the following

settings to the io500.sh script:

io500 _ mdtest _ easy _ params=”-u -L -vvv”io500 _ mdtest _ easy _ files _ per _ proc=200000

kIO

PS

Number of clients

1 2 4 6 8 20 24 3212 16

20

40

60

80

100

120

140

mdtest-easy IO500 - 1x DAC server (1x MDT)

0

Write kIOPS Stat kIOPS Delete kIOPS

Figure 12 Metadata performance within a single node vs the number of clients

From the graph, we can see a maximum of 83K write (file creates) IOPS, 136K stat IOPS and 69K

delete IOPS.

4.6 Metadata limit of a single client nodeWhen we utilise just one client node to drive metadata performance, we see the following

performance: Create 21 KIOPS, Stat 71 KIOPS, Delete 58 KIOPS.


4.7 Metadata scaling across multiple DACsTo measure metadata performance as we scale the Lustre file system across multiple DAC units, we

use the same mdtest easy benchmarks that we used in the last section. We found we were almost

able to scale linearly up from one DAC to 24 DACs as shown in Figure 13, where each DAC server

contains a single MDT and 12 OSTs. We are using Lustre’s DNE2 striped directories to spread the

metadata load across all the MDT targets. We also tested a configuration using two MDTs per DAC,

48 MDTs in total and saw a further-improved performance. However, we were unable to explore

configurations with three or more MDTs per DAC due to a current bug in Lustre11 limiting the

maximum number of MDTs in a file system.

All tests were run using 128 clients. Our io500.sh configuration was similar to the single DAC tests.

io500 _ mdtest _ easy _ params=”-u -L -vvv”

io500 _ mdtest _ easy _ files _ per _ proc=120000

In the directory setup section of io500.sh the mdtest directory was configured to be striped over all

MDTs.

# Use DNE2 for mdt _ easy

lfs setdirstripe -c -1 -D $io500 _ mdt _ easy

The maximum performance was achieved with 48 MDTs, with approximately 1.3M creates, 2.9M

stats and 0.7M deletes per second.

500

1000

1500

2000

2500

3000

kIO

PS

Number of MDTs in Filesystem

1 2 4 8 12 24 4816 20

mdtest-easy IO500 results for multiple MDTs using DNE2

0


Figure 13 Multiple DAC unit metadata scalability

It is worth noting that the DAC Orchestrator will only limit a file system to one MDT per host when a

user requests more than 24 NVMe drives to be added into their file system. Smaller file systems are

given an MDT for every OST.

4.8 Metadata — comparison to HDD Lustre systemWe have made a comparison of metadata performance seen on the DAC with the same metadata

tests run on our traditional spinning disk Lustre file systems, as described above in the bandwidth

comparison section. These file systems contain only a single MDT, which is built on the Dell MD3420

SAS array, which can contain up to 24 2.5” SAS drives. Our MDT devices are configured as a RAID10

of 20 300GiB 15K RPM SAS HDDs, with 4 SAS SSDs used as a read and writethrough cache.

Similar to the single-DAC benchmark above, here we varied the number of client nodes in the

benchmark to see how metadata performance scaled. This was carried out on a ‘live’ in-use production

file system, so it was not possible to ensure other jobs were not causing contention in the system.

Here we see the write and delete performance remaining relatively consistent with a peak of 22K

creates, and 150K stats and 29K deletes per second. Other than the stat performance, which can be

affected by client caching, the create and delete performance of a single DAC MDT is between 2-3x

greater, and multiple DAC MDTs vastly outperform this configuration.

Number of clients

mdtest-easy IO500 - HDD-Based Lustre - 1x MDT on RAID10 SAS-HDD with Flash cache


kIO

PS

1 2 4 6 8 20 24 3212 16

20

40

60

80

100

120

140

160

0

Figure 14 Metadata performance of a HDD-based production file system with one MDT

4.9 IOPS scaling within a single DAC For this test, we utilise the same IO-500 ior-easy benchmarks as we used for the bandwidth tests.

However, we limit the IOR transfer size to 4KiB. We perform this test both with O_DIRECT on and

off to also see a worst-case performance without any page-cache buffering or readahead.

For all ior-easy benchmarks, we configure the ior-easy directory in io500.sh as follows:

# 1 stripe, 16M stripe-sizelfs setstripe -c 1 -S 16M $io500 _ ior _ easy

The ior-easy options used are, for O_DIRECT:

io500 _ ior _ easy _ size=”5G”io500 _ ior _ easy _ params=”-vvv -a=POSIX --posix.odirect -t 4k -b ${io500 _ ior _ easy _ size} -F”

and for tests without O_DIRECT we use:

io500 _ ior _ easy _ size=”5G”io500 _ ior _ easy _ params=”-vvv -a=POSIX -t 4k -b ${io500 _ ior _ easy _ size} -F”


Direct IO Disabled — Buffered IO / ReadaheadMost applications do not utilise O_DIRECT and can make use of asynchronous write and Lustre’s

client readahead in order to increase small I/O performance. Small writes will be written to the VFS

page cache and flushed as bulk RPCs, whereas reads will cause the Lustre client to prefetch file

data into memory, causing subsequent small sequential reads to find the data available in memory

immediately without further RPCs to the server. Figure 15 below shows how such buffered 4K

sequential I/O scales as we add additional OSTs within a single DAC, with 32 clients.

kIO

PS

Number of OSTs in Filesystems

2 4 6 128 10

IOR - 4k transfer-size - IOPS - Single DAC Server

Maximum 4k Write kIOPS Maximum 4k Read kIOPS

1000

2000

3000

4000

5000

6000

0

Figure 15 Single DAC IOPS scaling with direct I/O off and 32 clients

Direct I/O EnabledWith Direct I/O enabled all write and read calls will cause a request over the network to the Lustre

servers and bypass all caches, thus giving us a worst-case performance for such small I/O sizes.

Figure 16 shows the results for 32 clients as we increase the number of OSTs in a single DAC server.

100

150

kIO

PS

2 4 6 128 10

IOR - 4k transfer-size - O_DIRECT - Single DAC Server


50

200

250

0

Number of OSTs in Filesystems

Figure 16 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO enabled, as we

increase the number of OSTs in a single DAC node

4.10 IOPS scaling across multiple DACs This test is identical to the single DAC tests above, however now we are using multiple DAC servers,

each with 12 OSTs configured. As before, we are testing with both Direct IO enabled and disabled.

Direct IO Disabled — Buffered IO / Readahead Figure 17 shows the sequential read/write 4k IOPs achieved as we scale the number of DAC servers

in the file system, where each DAC server contains 12 OSTs.

kIO

PS

Number of DAC Servers

2 4 6 248 20

IOR - 4k transfer-size - IOPS - Multiple DAC Servers

12 161


5000

10000

15000

20000

25000

30000

0

Figure 17 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO disabled, as we

increase the number of DAC servers in the file system

Direct IO Enabled

Figure 18 shows the same test as above in Figure 17, but with Direct IO enabled sequential read/write

4k IOPs achieved as we scale the number of DAC servers in the file system, where each DAC server

contains 12 OSTs.

kIO

PS

Number of DAC Servers

2 4 248 20

IOR - 4k transfer-size - O_DIRECT - IOPS - Multiple DAC Servers

12 161


200

400

600

800

1000

1200

1600

1400

0

Figure 18 ior-easy from 32 clients doing 4k sequential reads/writes with Direct IO enabled, as we

increase the number of DAC servers in the file system


4.11 IOPS limit of a single client nodeHere we probe the maximum IOPS values achievable from a single client to a single DAC node (12

OSTs), performing the same tests as above with Direct I/O enabled and disabled.

4k Sequential IOPS Write kIOPS Read kIOPS

Direct IO Enabled 74 69

Direct IO Disabled / Buffered

1300 1350

4.12 IOPS — comparison to HDD Lustre systemAs described above in the Bandwidth test section, our HDD-based Lustre file systems usually consist of

36 OSTs, where each OST is a 10-disk RAID6 array and these OSTs are spread across 6 OSS servers.

The values below were the maximum IOPS values achieved under the same ior-easy tests as used

above on the DAC, with 32 client nodes. As before, we limit the ior transfer size to 4KiB and measure

the sequential read/write IOPS values.

4k Sequential IOPS Write kIOPS Read kIOPS

Direct IO Enabled 16 370

Direct IO Disabled / Buffered

2300 1700

5 Application use cases

Here we undertake an initial investigation of real-life HPC application performance benefits obtained

from using the DAC. We have chosen three use cases: first, a traditional large-scale, long-running

HPC simulation that uses regular checkpointing; second a large-scale structural biology cryo-EM

workload (using the Relion application), and third: a radio astronomy SKA prototype application. We

expect many more use cases across machine learning, engineering, computational chemistry and

bioinformatics to emerge as further testing is undertaken.

5.1 Large checkpoint restart and simulation output dataAs mentioned in the introduction, the concept of the DAC is a generalisation of the “Burst Buffer”

solutions that were built to support large-scale checkpointing of long-running HPC simulations. The

requirement here is to be able to dump the resident memory of a multi-node application at regular

intervals. For the CSD3-skylake cluster, each node comprises 384 GB of DRAM totalling around

150TB across the cluster.

To read this amount in from the main Lustre file system and to write this out for a checkpoint would

take approximately 16 hours — a prohibitively long time given the nature of the simulation. By use

of the DAC, it would take around 12 minutes to perform the same function. This has important

consequences in terms of core hours used for the simulation — it reduces the cost of the simulation

from 344,000 core hours to 147,000 core hours, this has a significant monetary value, with the cost

of the DAC being recovered after just one year with this class of usage.

5.2 Relion — Cryo-EM structural biology The field of Cryogenic Electron Microscopy (Cryo-EM) is currently experiencing a great deal of

interest due to dramatic increases in resolution and the possibilities that this offers to the scientists

using these techniques. Relion, now at version 3, with significant improvements in computational

performance, is one of the key applications for analysing and processing these large data sets.

Greater resolution brings challenges — as the volume of data ingest from such instruments increases

dramatically, and the compute requirements for processing and analysing this data explode.

The Relion refinement pipeline is an iterative process that performs multiple iterations over the same

data to find the best structure. As the total volume of data can be tens of terabytes in size, this is

beyond the memory capacity of almost all current-generation computers and thus, the data must

be repeatedly read from the file system. Therefore, coupled with the various optimisations that have

been carried out into the compute performance of the application, the bottleneck in application

performance moves to the I/O.

A recent challenging test case produced by Cambridge research staff has a size of 20TB. The I/O

time for this test case on the Cumulus traditional Lustre file system versus the new NVMe DAC

reduces I/O wait times from over an hour to just a couple of minutes. This has an immediate impact

as it reduces the amount of time biological samples remain in situ, in the instrument, increasing the

overall throughput of the service.

5.3 Square Kilometre Array — Science Data Processor benchmarkAs part of the work to prototype the SKA’s SDP component, the DAC was used to provide a

secondary, larger-scale prototyping environment to assess the potential performance of the “Buffer

component” of the Science Data Processor architecture which provides a pivotal storage resource

for intermediate (hot buffer) and final (cold buffer) science data products. The performance

requirements for the hot buffer are anticipated to be in the order of around 4GB/s per compute node

across a cluster of around 3000 nodes. Initial prototyping was performed on an OpenStack bare-

metal cluster — supporting a number of execution frameworks together with a networked NVMe

storage appliance. For this exercise, a Slurm service was made available and the user was able to

easily move on to the Cumulus-skylake cluster and exploit the DAC and demonstrate the scalability

of the prototype application, using the DAC, without impacting other users.


6 Discussion

We present here a generalisation of the ‘Burst Buffer’ concept — a Data Accelerator based on

commodity hardware components sourced from Dell EMC and Intel together with an open source

orchestration layer, developed at the University of Cambridge, that seamlessly integrates with the

widely-used Slurm job scheduler. This creates a Lustre parallel file system on a per-job basis, as

directed by the job submission command, on the DAC with the size and performance capability and

stage-in/stage-out routes as required by the job.

The solution delivers a large fraction of the theoretical NVMe SSD performance to a single job in a

completely transparent and determinate fashion, delivering 100x the job performance delivered by

traditional spinning disk Lustre file systems. The solution was tested with synthetic benchmarks and

three HPC applications, and shown to deliver significant application performance advantages.

The synthetic benchmark data collected shows more than 90% of the raw write performance is

delivered to remote clients while read efficiency was 58% of the raw drive performance. This was

due to the limitation of the available network bandwidth. Furthermore, unlike traditional spinning disk

central file systems, a large fraction of this total performance can be delivered to a single HPC job.

This breakthrough performance puts the Cambridge DAC prototype at number 1 in the IO-500 in the

June 2019 submission. This makes the solution the fastest HPC storage system in the world at an

unrivalled price point compared to competitive proprietary solutions.

When designing a DAC commodity storage solution there are a wide range of server, storage and

networking elements available and a wide range of configuration options. The design choices made

here were intended to produce a DAC appliance with the highest and most balanced performance

for the lowest cost, with performance prioritised over capacity. Thus, medium-sized 1.6 TB NVMe

drives were used, with 12 drives per server helping to balance the overall R/W performance. NVMe

disks tend to have more read performance than write. The 12 disk solution was a compromise which

yielded an overall raw R/W performance per DAC of 36 / 15 GB/s, which is limited by the two OPA

card network solution to an achievable R/W bandwidth out of the DAC of 22 / 15 GB/s, explaining

why read efficiency is lower than the write efficiency, since read bandwidth is capped by the OPA

card and write is not and we need to over-provision read performance in order to bring the write

performance up. In any case, the PCIe switches in the R740xd server max-out at around 27.21 GB/s

so even if another OPA card was added only an additional 5.2 GB/s read bandwidth could be realised.

A more balanced system could also be provided by using Intel Optane disks but current market

pricing means that this is a far more expensive option and it is more cost effective to over-provision

read bandwidth in order to increase relative write performance. It would also be possible to add more

NVMe drives to gain an equal R/W performance. Within the Dell EMC R740xd, it is possible to add

twice as many disks at 2 or 3 times the capacity enabling a broad range of capacity vs performance

solutions to be designed. With new PCIe-gen4 servers on the horizon and the increasing availability

of 200 Gb/s HPC networking products, future DAC designs with even higher performance per DAC

are soon to be possible. This will offer even better price points for performance vs capacity targets.

The DAC appliance is built using cloud-native infrastructure as code methodologies using Ansible.

All the scripts and software documentation along with a detailed “How-To” relating to Ansible and

golang tuning and build of the DAC are freely available at the DAC code repository.5 This combined

with the integration of the DAC with Slurm now means that for the first time extremely high and

100% determinant I/O bandwidth, IOPS and metadata performance is readily available to the wider

HPC community. Importantly this no-longer needs expensive proprietary solutions but can be easily

constructed from commodity server and storage technology and open source software.

Solutions like the DAC will help drive uptake of new SSD based storage solutions in HPC and propel

the next wave of data-intensive application development needed to meet the demands of the data-

centric world we now live in. The fields of Astronomy, Physics and Medicine are generating vast and

rapidly increasing data sets which require ever more complex analytical techniques such as Hadoop,

Spark and machine learning. If HPC systems are going to keep pace with these workflows, hybrid

multi-tiered storage solutions will be needed, with seamless, automatic data movement. The solution

presented here allows for just this type of solution to be explored and further developed and tested

with a wide range of applications and workflows.

Future DAC development work involves implementing BeeGFS as an alternative parallel file system

to Lustre and a direct-attached NVMe storage pool via NVMe over fabrics to HPC nodes within

the cluster, again under the control of the SLURM scheduler. In addition, extensive I/O telemetry

functionality is being investigated that will report on the consumption of I/O resources on a per job

level from both the server and client. We will also test an increased range of HPC applications where

we profile the I/O activity and performance increase, producing a better understanding of how to

best utilise tiered storage solutions consisting of NVMe and traditional spinning disk solutions.


7 References

1 https://slurm.schedmd.com/documentation.html

2 https://www.dellemc.com/resources/en-us/asset/customer-profiles-case-studies/products/

servers/cambridge_case_study.pdf

3 https://www.stackhpc.com/cluster-as-a-service.html

4 https://www.vi4io.org/

5 https://github.com/RSE-Cambridge/data-acc

6 https://www.lanl.gov/projects/trinity/about.php

7 https://phys.org/news/2012-03-sdsc-gordon-supercomputer-ready.html

8 https://slurm.schedmd.com/burst_buffer.html

9 https://www.nersc.gov/users/computational-systems/cori/burst-buffer/

10 https://ark.intel.com/content/www/us/en/ark/products/97005/intel-ssd-dc-p4600-series-

1-6tb-2-5in-pcie-3-1-x4-3d1-tlc.html

11 https://jira.whamcloud.com/browse/LU-12506

12 https://jira.whamcloud.com/browse/LU-12385

13 https://github.com/hpc/ior

14 https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-

products/Intel_OP_Performance_Tuning_UG_H93143_v13_0.pdf

15 http://wiki.lustre.org/Installing_the_Lustre_Software

34 HPC Innovation Exchange | 001 The Data Accelerator

The information in this publication is provided “as is.”Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license.

Copyright ©2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC are trademarks of Dell Inc. and subsidiaries in the United States and other countries. Other trademarks may be trademarks of their respective owners. Dell Corporation Limited. Registered in England. Reg. No. 02081369 Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK.

Intel and the Intel Logo are trademarks of Intel Corporation and subsidiaries in the U.S. and/or other countries.

hpc innovation exchange - dell emc isilon...to test the real-world benefits of the dac, three hpc...

Documents