parallel file systems - › sites › default › files › files › ...4 benefits of parallel file...

PARALLEL FILE SYSTEMS

Agenda

SeagateConfiden-al

ParallelFileSystems:• History• What?• Why?•  SoWhat?•  Example:LustreandSpectrumScale(GPFS)

• BestPrac1cesforParallelI/O

Evolution of Parallel File Systems

SeagateConfiden-al

Early1980s Today

ParallelFileSystem

MetadataServer

StorageServers

ComputerClient

Management

Metadata

DataPath

1980s–Early1990s 1990s

•  Linux

BOTTLENECK

•  Clustered

Worksta1on

Sharing

•  Auspex•  SUN/NFS

•  NetApp

Benefits of Parallel File Systems

SeagateConfiden-al

•  Increase ability to scale by avoiding bottlenecks (separate metadata and data)

•  Take advantage of available bandwidth

•  Increase parallel streams to storage using more than one client

•  Allows compute cluster to have higher utilization not waiting on I/O

•  Growing clusters with ability to perform larger calculations

MetadataServer

StorageServers

ComputerClient

Management

Metadata

DataPath

Major Parallel File Systems Today

SeagateConfiden-al

Lustre:• HighPerformanceandhighscalability• NoLicensingcost• Na-onalLabsandUniversi-es

SpectrumScale(akaGPFSfromIBM)• Whenthecustomerisaskingforfeaturessuchassnapshotsandmigra-onbetweendisk-ers

• Homedirectoriesandfastscratchinthesamefilesystem

•  Largenumberoffilesandhighmetadatathroughput,andhighspeedstorageforacomputecluster

A Lustre Cluster

SeagateConfiden-al

Client

Router

MDS MDS

Diskarrays&SANFabric

LustreClient1-100,000

Supportmul1plenetworktypes

40/100GE,EDRIB,OMP

ObjectStorageServers(OSS)1-1,000s

MetadataTarget(MDT)

ObjectStorageTarget(OST)

OSS diskCIFSClient

Gateway

NFSClient

ClusterStor HA-MDS

ClusterStor SSU

MetadataServers(MDS)

Summary

SeagateConfiden-al

•  Whenyouneedhighperformance,highscalability->ParallelFileSystems

•  ParallelFileSystemsinproduc-on->ClusterStor

BEST PRACTICES FOR PARALLEL I/O

Seagate Confidential 9

§  Access patterns

§  Metadata Usage

Performance §  L300 - Bandwidth, Metadata

§  L300N – NytroXD I/O Accelerator

POSIX and MPI-IO §  POSIX and directly accessing file system

§  Use of MPI-IO as middleware I/O library

§  MPI-IO Hints

§  Examples

Parallel File Systems Background §  Parallel I/O Performance and Clusterstor Lustre

Optimal configuration for KAUST §  Recommended tuning options for HPC workloads

Tools to identify issues §  Visibility/monitoring on servers/clients

§  File/directory striping

§  I/O tracing

Best Practices for application usage of Lustre §  DOs and DONTs

Agenda

LUSTRE PARALLEL I/O PERFORMANCE BACKGROUND

Lustre Components

Clients

MDS OSS OSS

Directory Operations, File

open/close, metadata, and concurrency

File creation, file status, and

recovery

File I/O and locking

Lustre is an open source, distributed parallel file system §  Object-based design provides extreme

scalability §  Compute clients interact directly with

storage servers §  Comprised of:

§  Clients

§  Metadata Servers and Targets

§  Storage Servers and Targets

Lustre Parallel File System

L300 ClusterStor Management (SMU/MMU): Management and Metadata (MDS/MDT)

CS Manager and MDS/MGS Nodes §  2RU Integrated Controllers – Server 1: CSM Mgmt – Server 2: Boot

– Server 3: MGS/MDS – Server 4: MDS

Fault Tolerance Serviceability

2U24 JBOD – MDT §  SAS JBOD for MDS §  Disk Configuration – Qty 20 Drives for MDT – 2x RAID10, 5+5 – Qty 2 Global Hot spares

Scalable Storage Unit (SSU) SSU §  5U84 Enclosure §  2 Object Storage Servers per

SSU §  Two (2) trays of 41 HDD’s each

for Object Storage Targets §  2 SSDs (WIBs, Journals,

NytroXD) §  H/A on each SSU §  Infiniband EDR, 40GbE, OPA

data network connectivity

ClusterStor Hardware and the Lustre File System

Object Storage Server

Seagate Embedded Application Server

Object Storage Target Seagate 5U84 Storage

Enclosure

Meta Data and Management

Servers 2U x 4 Servers

Meta Data Target

Seagate 2U24 JOBD

1) Where is file?

2) File is at…. Client

3) Single File (3,072Kb)

5a) File block stripe 1 of 3 (1,024Kb)

5b) File block stripe 2 of 3 (1,024Kb)

5c) File block stripe 3 of 3 (1,024Kb)

4) File is broken into block stripe segments (1,024Kb)

MDS <-> OST Interaction: Use Cases

Create object

Unlink

StatFS

Create new data object for file

Delete data object if file unlinked

Get info from OST for size of file, space on OST

File on Lustre

FILE Attributes (name, permission, owner, security label)

Data Blocks, actual size Handled by OST (Each data block is object on OST)

Mapping from name to data objects (aka LOV EA) Stored in extended attributes on MDT

Handled by MDT

●  Allows file data to be stored evenly on multiple OSTs ●  RAID 0 pattern

Striping

OST1 OST2 OST3

Object

File A File B

Example: Single Stripe File All file data is in a single OST

•  Object – stripes belonging to one file on an OST •  Stripe count – number of objects involved •  Stripe size – max size of one stripe •  Stripe object sizes add up to total file size In this example: •  File A stripe count = 2 •  File B stripe count = 3 •  Typical stripe size is 1MB •  File B stripe size is twice that

of File A.

Striping Terminology

OST3 3 Object

File A

File B

Striping (cont.)

Object

File A File B

Example: Fully-Striped Files •  These are fully striped files, meaning striped over all OSTs

•  Also called wide striping •  This achieves maximum bandwidth to one file

•  File A has a hole or is sparse

•  This happens commonly in HPC usage of a file system

●  The parallel I/O performance comes from separating the metadata and data objects

●  The OSSs handle the containers for the component objects.

●  The data (file) can be striping across multiple OSTs for improved parallelism

●  Typically this striping is used for very large, single files with multiple processes accessing.

Overview

CONFIGURATION AND TUNING

RPCs •  lctl set_param osc.*.max_rpcs_in_flight=256 [default: 8] •  lctl set_param osc.*.max_pages_per_rpc=1024 [default:256] Write cache •  lctl set_param osc.*.max_dirty_mb=1024 [default: 32]

Client-Side Tunables

Readahead – Best for streaming read workload

•  lctl set_param llite.*.max_read_ahead_mb=1024 [default: 40]

•  lctl set_param llite.*.max_read_ahead_per_file_mb=1024 [default: 40]

Readahead - is your workload strided?

•  May set the value to default (40), or tune to lower value

Readahead Step

•  lctl set_param llite.*.read_ahead_step=4 (Seagate client only)

Wire checksum

•  /proc/fs/lustre/osc/*/checksums=0

LRU size is dynamic by default - Controls number of client-side locks in an LRU cached locks queue.

•  lctl set_param ldlm.namespaces.*osc*.lru_size=0

LNET verbosity

•  sysctl -w lnet.debug=0

While there are tunables available on the servers, typically the system defaults are optimal. In special cases, Seagate Support and Engineering may recommend specific changes.

Server-Side Tunables

A number of tools on the management server for monitoring Ltop (and GUI) $ ltop

Filesystem: cstor

Inodes: 1434.250m total, 0.006m used ( 0%), 1434.244m free

Space: 451.605t total, 160.525t used ( 36%), 291.080t free Bytes/s: 0.000g read, 0.000g write, 0 IOPS

MDops/s: 0 open, 0 close, 0 getattr, 0 setattr

0 link, 0 unlink, 0 mkdir, 0 rmdir 0 statfs, 0 rename, 0 getxattr

>OST S OSS Exp CR rMB/s wMB/s IOPS LOCKS LGR LCR %cpu %mem %spc

0000 F cstor04 17 0 0 0 0 0 0 1 2 48 0 0001 F cstor05 17 0 0 0 0 0 0 0 2 49 0

0002 F cstor06 17 0 0 0 0 96 0 0 33 99 69

0003 F cstor07 17 0 0 0 0 96 0 0 26 99 73

Visibility Tools – Server Performance Monitoring

For clientside tools, checking the RPC write/read stats: $ lctl get_param osc.*OST0000*.import rpcs: inflight: 0 unregistering: 0 timeouts: 0 avg_waittime: 304635 usec read_data_averages: bytes_per_rpc: 16777216 usec_per_rpc: 105534 MB_per_sec: 158.97 write_data_averages: bytes_per_rpc: 16692814 usec_per_rpc: 759389 MB_per_sec: 21.98

Visibility Tools – Client Performance Monitoring

For clientside tools, checking the RPC stats:

•  Pages per RPC – All 1024-page writes, * 4k = 4MB RPCs

$ lctl get_param osc.*OST*.rpc_stats read write

pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 3 0 0 ... 256: 587809 64 64 | 868 0 0 512: 4734 0 64 | 1270 0 0 1024: 320913 35 100 | 467052 99 100 read write

rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 8870 0 0 | 49 0 0 ... 4: 22367 2 7 | 48 0 0

Visibility Tools – Client Performance Monitoring

For metadata statistics $ llstat /proc/fs/lustre/mdc/*MDT*/md_stats

snapshot_time 1488139067.685549

close 66

create 9

getattr 1

intent_lock 402

read_page 23

unlink 285

intent_getattr_async 244

revalidate_lock 567

Visibility Tools – Client Metadata Monitoring

$ llobdstat /proc/fs/lustre/osc/*OST0000* 1

/usr/bin/llobdstat on /proc/fs/lustre/osc/cstor-OST0003-osc-ffff8807f72e7400

Processor counters run at 1200.000 MHz

Read: 1.96468e+12, Write: 1.92445e+12, create/destroy: 0/0, stat: 0, punch: 0

[NOTE: cx: create, dx: destroy, st: statfs, pu: punch ] Timestamp Read-delta ReadRate Write-delta WriteRate -------------------------------------------------------- 1428086691 412.00MB 411.85MB/s 0.00MB 0.00MB/s 1428086692 184.00MB 183.93MB/s 0.00MB 0.00MB/s 1428086693 296.00MB 295.87MB/s 0.00MB 0.00MB/s

Visibility Tools – Client Monitoring OSS Statistics

Use Lustre and stripe-aware tools instead of standard Unix $ lfs df

UUID 1K-blocks Used Available Use% Mounted on

cstor-MDT0000_UUID 2255452564 4317344 2221057108 0% /mnt/cstor[MDT:0]

cstor-OST0000_UUID 121226819924 119016 120014045804 0% /mnt/cstor[OST:0]

$ lfs find old/ -t f -print0 | xargs -0 rm $ lfs help

User Tools – lfs

Setting file/directory striping

$ mkdir dir_count_wide_A

$ lfs setstripe –c -1 –S1048576 dir_count_wide_A

$ lfs getstripe dir_count_wide_A

count=3

stripe_count: 3 stripe_size: 1048576 stripe_offset: -1

User Tools – lfs

Dir A Dir B

Setting file/directory striping

$ mkdir dir_count_wide_B

$ lfs setstripe -c 2 -S 2097152 dir_count_wide_B

$ lfs getstripe dir_count_wide_B

count=2

stripe_count: 2 stripe_size: 2097152 stripe_offset: -1

User Tools – lfs

Dir A Dir B

TRACING

•  Unexpected I/O patterns are common – Dusty decks or data access libraries may hide I/O – Many applications running concurrently and competing for resources – Original developer and intent may not be available

•  Examples seen – In a loop: random seek forward or back approx 1GB, then read several bytes, taking days

to process a 20GB input;16 bytes per I/O at 100 ops/sec is 14 days to read 20GB – 1372 bytes written at strides of 175616 bytes, by 128 nodes concurrently – Uncoordinated concurrent writes to the end of a shared log file, over NFS – Record updates by re-writing the entire file – Open, Write < 100 bytes, Close, Repeat

Tracing I/O Patterns

•  strace is a great tool to easily obtain simple I/O profiles

•  Gives us a good idea of the behavior of the application

•  Try to determine where most of I/O time is spent

•  Can be used to collect I/O system calls –  e.g., all read, write , open, … system calls –  unix tool strace captures all of this information

•  Can help with analyzing behavior of the app –  what files did the app touch, how much data was moved, reads vs. writes, timing,

Tracing

strace traces the execution of system calls. strace -f -tt -T -e trace=open,close,read,write,lseek –o $OUT $EXE

-f will trace child processes from fork() -tt tells strace to record time in microseconds -T shows time spent in system call -e trace= will trace only these system calls -o $OUT tells strace where to put the trace (don’t put trace output in test FS) $EXE is the command you would actually use to run the program

Instrumenting application for strace not always possible, so attach to running process:

-p $PID will attached to an existing process ID

Using strace

fd = open(argv[1], O_RDWR | O_CREAT ); while ((rc = read(fd, buffer, 1024)) > 0) printf("%d\n", rc);

$ strace -f -tt -T -o /tmp/strace.out ./testProgram -arg1 -arg2

Dumps strace output to file /tmp/strace.out time system call and other info ------------------------------------------------------- 09:59:33.790079 open(“/fs/home/test", O_RDWR|O_CREAT, 03777761102063720) = 3 <0.000026> 09:59:33.790154 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1024) = 1024 <0.000025> 09:59:33.790291 write(1, "1024\n", 51024) = 5 <0.000013> 09:59:33.790339 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.000018> 09:59:33.790403 write(1, "1024\n", 51024) = 5 <0.000007> 09:59:33.790442 read(3, "", 1024) = 0 <0.000048>

Simple Program Example

Collecting straces from MPI codes

Wrapper Script:

#!/bin/sh

# Commandline:

# mpirun -np 4 ./mpi-strace ./a.out <args>

unset STRACE_SUFFIX

STRACE_SUFFIX=`hostname`.$PMI_RANK # or, $$

OPTS="-f -F -tt -T –e trace=open,close,read,write,lseek"

exec strace $OPTS "$@" 2> strace.${STRACE_SUFFIX}

Tracing MPI Codes

Creates an strace file for each MPI process:

% mpirun -np 4 ./mpi-strace.sh ./IOR -o /fs/home/IOR.test -F

% ls strace.*

strace.sjSC-201.0

Strace.sjSC-202.2

Strace.sjSC-203.1

Strace.sjSC-204.3

Each strace output file contains trace from individual rank.

Tracing MPI Codes

Darshan •  Excellent for profiling I/O •  No code changes •  Requires recompiling the MPI application with Darshan

Tau •  Application profiling

Other Profiling Tools

BEST PRACTICES FOR APPLICATION I/O

•  Increase ability for scaling by avoiding bottlenecks

•  Take advantage of available bandwidth

•  Increase streams to parallel storage using more than one client

•  Where applicable, take advantage of client-side cache

•  Use large transfers when possible or I/O library to describe smaller or noncontiguous transfers

Objectives

•  Partition the problem domain

•  Processes work on own subdomains independently

•  Optionally, write intermediate results

–  For restart after failure or other reasons

•  Optionally, reiterate on writes

•  Upon completion, assemble and integrate parts

•  Write final result

–  May be same format as intermediate results

General Application Behavior

•  The I/O pattern affects performance.

•  N-to-N streaming workloads can approach disk bandwidths.

•  Random I/O is limited by disk seek performance.

•  N-to-1 sharing affects performance because of locking contention.

•  Strided I/O patterns are somewhere in between.

•  Keep access aligned when possible.

General Design Considerations

Serial, “Rank 0”, “Head node” model

Bottleneck, no parallel paths

N-to-N: Unique file-per-process (FPP)

Restart needs same number of nodes

Many files to handle for post-processing

N-to-1: A single-shared-file (SSF)

All data in a single file for restart purposes

Easier for post-processing

Typical I/O Access

Node 0

Node 1

Node 2

Node 0

Node 1

Node 2

File 0

File 1

File 2

Node 0

Node 1

Node 2

Contiguous

All data from a task is contiguous in the file

Discontiguous

Strided

Random

N-1 Access Patterns

Clients

Lustre loves 1MB I/Os •  But not if they aren’t aligned on stripe boundaries if your app has a natural

stride, use that •  lfs setstripe --stripesize=256k

More stripes = more parallel bandwidth •  clients >= stripe count •  Applicable to N-1 (SSF), not N-N (FPP)

Sequential vs. Random •  It’s all about seeks

IO Patterns

Writes are generally faster than reads •  It’s all about seeks

Well-aligned I/O is always best •  Else may be performing a read-modify-write

N-N is ideal, N-1 is more challenging •  Avoid overlapping access to same file region •  Lustre Distributed Lock Manager will enforce coherency (and serialize)

IO Patterns

Buffered I/O for smaller I/Os •  If application has 32KB I/O calls, e.g., use buffering (default) •  Has some userspace/kernelspace overhead •  May be cached for reads on client

Use O_DIRECT when appropriate •  O_DIRECT is useful for large IO •  Does not buffer client cache, so where you are not interested in reading the file •  Does not allow for readahead (prefetching)

I/O Options

Deep directories have little contention, but do require serial lookups

Directory locks allow a single client to work unilaterally (dir-per-client)

Racing creates in a single dir must be serialized

Directory Layout

RULES OF THUMB

Bigger is better §  Large files, large transfers, large numbers of clients are good

Aggregate writes – use large transfers at least 64KB when possible or I/O library to describe smaller or noncontiguous transfers §  Contiguous data patterns utilize prefetching and write-behind far better than noncontiguous (or

random) patterns §  Collective I/O can aggregate for you, transforming disjoint accesses into contiguous access

But bigger isn’t always necessary §  I/O transfer size isn’t super important for streaming workloads because of read-ahead and write

batching §  64K is often just as good as 4MB, and sometimes even better

N-to-N streaming workloads can approach disk bandwidths.

Rule of Thumb #1

Sharing is important §  And parallel file systems work hard to make sharing “perfect” so that clients anywhere in the network

have an up-to-date view of files and their data (unlike NFS)

Sharing has a cost §  Sharing between clients requires coordination by the file system, and therefore has a cost §  N-to-1 sharing affects performance because the metadata manager has to mediate access.

For N-1, use MPI-IO §  Collective I/O and Datatypes §  MPI-IO Hints when possible

Avoid overlapped write regions – best to use block-aligned data

Rule of Thumb #2

Design application I/O to describe as much as possible to file system: §  Open file with appropriate mode §  Use collective calls when available §  Describe data movement with fewest possible operations

Match file organization to process partitioning if possible §  Order dimensions so relatively large blocks are contiguous with respect to data decomposition

Random I/O is limited by disk seek performance and read-modify-write

The I/O pattern affects performance.

Rule of Thumb #3

Parallel File Systems are not optimized for metadata, instead for moving data §  Opening files and closing files incur overhead. Keep file creates, opens, and closes to a minimum –

open/close once if possible Use your own subdirectory to avoid contention with others

Create multiple directories, distribute across DNE if available

Avoid filling directories to maximum Don’t use ‘ls -l’ or ‘du’ on millions of files (or to check file size progress) §  Similarly, turn off color alias (ls --color) §  Typically ‘ls’ is aliased to ‘ls --color=tty’. This actually does a stat. §  ls –l: calls readdir() followed by a sequence of stat(2) calls for every name returned by readdir()

Rule of Thumb #4

Increase ability for scaling by avoiding bottlenecks

Parallel I/O from compute nodes eliminates the head node bottleneck

Increase streams to parallel storage using more than one client

N-N access is generally tuned for parallel file systems

N-1 access may need additional tuning

Collective I/O may be appropriate for N-1

Use tracing to understand application I/O patterns

Keep access aligned when possible.

Rule of Thumb #5

PERFORMANCE

Seagate Confidential Seagate Confidential

Streaming Bandwidth Performance

•  L300/4TB Tardis

•  CS9000/Makara

•  IOR-2.10.3

•  16 clients

•  Stonewalling (180 sec)

•  Average of 5 iterations

Direct I/O

CS9000 L300 Write Read Write Read

9.6 10.1 12.3 13.6

Buffered I/O

CS9000 L300 Write Read Write Read

9.8 9.4 13.3 12.3

Streaming Bandwidth Performance

•  L300N/4TB Tardis

•  IOR-2.10.3

•  16 clients

•  Stonewalling (180 sec)

•  Average of 5 iterations

Direct I/O

CS9000 Makara L300N Tardis Write Read Write Read

9.6 10.1 16.1 16.0

63 Seagate Confidential

Drive Write Performance -- iostat

Advanced MMU Prototype Performance

•  Single MDS/MDT

•  2x SSU+1 (32 OSTs)

•  RAID10 (5+5)

•  mdtest-1.9.3

•  Reporting File/s

MDS Create Stat Open Remove

CS9000 CMU 75K 220K 90K 65K

L300 MMU 55K 220K 95K 70K

L300 MMU-A 105K 350K 220K 80K

Large-Scale Lustre Scaling Testing Test conducted at Customer Site 488 Client Nodes, FDR Fabric 1. Scaling SSU+1 Count

36x L300 SSU+1 FDR (bottlenecked fabric for writes)

2. Scaling Metadata Servers (DNE)

7x MDS/MDT Base MMU Active/Passive + 3x MMU Active/Active

Lustre Read Performance Scaling •  Test single SSU+1 •  14 Compute Nodes •  12.97 GB/s Read

•  Project Linear Scaling IOR -v -B -F -a POSIX -r -k -m -E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT

4 8 12 16 20 24 28 32 36

SSU+1 Count

IOR Read Performance

Lustre Read Performance Scaling •  Scaling 4 - 36 SSU+1s •  14 Nodes per SSU+1 •  Projection based on SSU+1 IOR -v -B -F -a POSIX -r -k -m –E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT

4 8 12 16 20 24 28 32 36

SSU+1 Count

IOR Read Performance

Lustre Write Scaling •  Scaling 4 - 32 SSUs •  12 Compute Nodes per SSU •  Projection based on 1 SSU IOR -vv -k -B -F -a POSIX -w -b 256G -t 64M -C -e -m -D 120 -i 3 -o $OUT

4 8 12 16 20 24 28 32

SSU Count

IOR Write Scaling

Lustre Metadata Scaling Scaling 1 - 7 DNEs 64 Nodes per DNE (1M Files per DNE) mdtest -i 3 -n $(( $DNES * 1048576 / $PROCS )) -F -u -C -E -T -r -v -i 3 -d $DIR[@$DIR2…]

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

1 2 3 4 5 6 7

File Create/s

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1 2 3 4 5 6 7

File Stat/s

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

1 2 3 4 5 6 7

File Remove/s

•  Latest Seagate Lustre Client Based on IEEL

•  Improved Performance

•  Best of Seagate and IEEL

IOR -t 1M -b 131072M -F -C -v -v \

-E -k -m -i 3 -w -r -o $OUTFILE

OS Lustre Client Write MB/s

Read MB/s

CentOS 6.5 Seagate 2.5.1 795.37 889.32

CentOS 6.5 OpenSFS 2.7.0 946.12 968.24

CentOS 6.5 Seagate 2.7.13 (IEEL) 1,048.41 1,475.64

CentoOS 7.2 Seagate 2.5.1 991.38 1,274.77

CentoOS 7.2 OpenSFS 2.7.0 1,032.32 1,209.10

CentoOS 7.2 Seagate 2.7.13 (IEEL) 1,265.84 1,804.68

Single-Stream Performance

NytroXD Intelligent I/O Manager

Workloads are becoming increasingly unpredictable for the storage system Ø  Streaming IO Ø  Random IO Ø  Unaligned

Solution: ü  Use proper technology for the workload ü  Cost conscious

Seagate Confidential

Nytro Accelerator (NXD) is a new feature that combines new software functionality ○  Leverages the existing L300N/G300N dual SSDs to provide automatic, selective

acceleration of the performance of specific IO accesses, characterized as “small.” ○  The caching software, referred to as “NytroXD” acts as a filter driver to cache small

blocks of data in an SSD cache to improved performance

The definition of a “small block” in this context is configurable. ○  The IO_bypass size parameter, which determines whether an I/O is accelerated or

not is configurable and is designated as being settable by the user.

NytroXD Intelligent I/O Manager

Basic NytroXD Data Path

41-HDD OST0 GridRAID

2-SSD NXD RAID10

Client Bypass=32KB

1024KB 4KB

32KB 64KB

•  I/O Requests coming from the Client will be directed at the OSS

•  Requests <= Bypass Size will be written to the SSDs

•  Requests > Bypass Size will be written to HDDs

•  Default Bypass is set to 32KB, but can be set as high as 1MB

Basic NytroXD Data Path

2-SSD NXD RAID10

Client Bypass=32KB

1024KB

4KB 32KB

•  As the SSD array fills, the data is destaged to the HDD arrays

•  The data that arrives on the SSD will be available for reading unless evicted from cache

•  Current caching policy favors accelerated writes

Small, Aligned, Random I/O – IOR, 120-second stonewall, buffered I/O

16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)

Seagate NytroXD, L300N – Write/Rewrite, Aligned I/O

0 5,000

10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

128 64 32 16 8 4 2 1

Transfer Size (KB)

Write -- Aligned, Random

No NXD

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

128 64 32 16 8 4 2 1

Transfer Size (KB)

Rewrite -- Aligned, Random

No NXD

Small, Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O

Seagate NytroXD, L300N – Write/Rewrite, Unaligned I/O

10,000

15,000

20,000

25,000

30,000

35,000

40,000

160000 80000 40000 20000 10000 5000

Transfer Size (Bytes)

Rewrite -- Unaligned, Random

No NXD

10,000

15,000

20,000

25,000

30,000

160000 80000 40000 20000 10000 5000

Write -- Unaligned, Random

No NXD

Small, Aligned/Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O

Seagate NytroXD, L300N – Read, Aligned/Unaligned I/O

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

128 64 32 16 8 4 2 1

Transfer Size (KB)

Read -- Aligned, Random

No NXD

20,000

40,000

60,000

80,000

100,000

120,000

160000 80000 40000 20000 10000 5000

Read -- Unaligned, Random

No NXD

Recommended Usage and Caveats for NytroXD •  NytroXD is primarily designed for small writes under the bypass

threshold.

•  Random Write/Rewrite I/O (and Read if in cache) is recommended.

•  For random rewrite of data on the HDDs, 4KB-page aligned data will go to SSD cache.

•  For random rewrite of data on the HDDs, unaligned data may incur a read-modify-write from the HDDs to the SSDs

Recommended Usage and Caveats for NytroXD •  Depending on the size of the SSDs and workload, the data may not

necessarily be available for reading from the SSD cache.

•  Sequential Buffered I/O will be aggregated on the client typically larger than the bypass setting.

•  Sequential Direct I/O with requests smaller than or equal to the bypass will go to SSD cache.

•  As the utilization of the SSD reaches 40% and 60%, the destaging of data (copying from SSDs to HDDs) will accelerate.

•  When the utilization reaches 80% of the SSD, read eviction increases to accommodate additional incoming writes.

Monitoring NytroXD Usage - perfmon /opt/seagate/nytrocli/nytrocli64 /xd show perfmon <snip> purple04: Name of the Cache Group = nxd_cache_0 purple04: Number of VDs = 1 purple04: Number of Cache Devices = 1 purple04: Queue Depth = 4096 purple04: Total Cache Size = 371.054 GiB purple04: Cache Size in use = 26.692 GiB purple04: Cache Block Size = 4 KiB purple04: Cache Window Size = 64 KiB purple04: Bypass IO Size = 1.0 MiB purple04: Total number of I/Os = 1076225215 purple04: Number of reads = 516326249 purple04: Number of writes = 559898966 purple04: Total number of bypass I/Os = 200012374 purple04: Number of bypass reads = 52053543 purple04: Number of bypass writes = 147958831 purple04: Cache Hits = 660474280 purple04: Cache Misses = 18446744073 purple04: Number of dirty CWs = 435362 purple04: Total Cache Blocks flushed = 3812514848

Histogram Feature in NytroXD •  To use the Histogram of I/O sizes being written/read on the OSS,

the following may be configured:manually clear cache, check cache group name (nxd_cache_0, e.g.):

nytrocli64 /xd show perfmon | grep “Cache Group”

nytrocli64 /xd get cachestate cg=nxd_cache_0

nytrocli64 /xd set cachestate=disable cg=nxd_cache_0

nytrocli64 /xd set histparam=enable

Histogram Feature in NytroXD •  To collect the results:

nytrocli64 /xd show histogram

•  Once completed, disable the histogram and re-enable the cache:

nytrolci64 /xd set histparam=disable

nytrocli64 /xd set cachestate=enable cg=nxd_cache_0

Histogram Output The histogram shows both Writes and Reads

This is a test of sequential 128KB (Direct I/O) writes and random buffered I/O 4KB writes. (Writes only shown here.)

Num Writes < 4K = 0 Num Writes 4K = 9943751 Num Writes 4K+1 - 8K = 12902 Num Writes 8K+1 - 16K = 76 Num Writes 16K+1 - 32K = 0 Num Writes 32K+1 - 64K = 0 Num Writes 64K+1 - 128K = 1034128 Num Writes 128K+1 - 256K = 0 Num Writes 256K+1 - 512K = 0 Num Writes 512K+1 - 1M = 0 Num Writes 1M+1 - 2M = 0 Num Writes 2M+1 - 4M = 0 Num Writes 4M+1 - 8M = 0 Num Writes 8M+1 - 16M = 0 Num Writes 16M+1 - 32M = 0

I/O INTERFACES: POSIX

Applications require more software than just a parallel file system

An interface between the application and file system is needed

We’ll be looking at these for application access: §  POSIX §  MPI-IO

Both can provide layer for higher-level I/O libraries

Application I/O Interfaces

Higher-level I/O Libraries POSIX MPI-IO

Parallel File System I/O Hardware

Application

POSIX is the IEEE Portable Operating System Interface for Computing Environments

Defines a standard way for an application program to obtain basic services from the O/S

POSIX I/O is the mechanism that almost all serial applications use to perform I/O

Created when a single computer owned its own file system

•  No ability to describe parallel I/O constructs

open(), seek(), write(), read(), close()

•  O/S maps these calls directly into file system operations Provides a useful, ubiquitous interface for basic I/O

POSIX I/O

Lustre gets high bandwidth at sync time by coalescing contiguous dirty buffers into a single large transfer to the OSTs

Maximum amount of dirty data in the kernel = 1024MB

Linux uses “flush on close” semantics

•  Keep in mind if your application does multiple open/write/close cycles on the same file

File Writing

Write syscall size §  Small write syscalls have a large overhead in the kernel §  Large write syscalls (> 1MB) may not offer a gain for buffered I/O (though will for

Direct I/O

Example test run (1 client, buffered I/O) §  4KB writes: 400 MB/sec §  1MB writes: 1 GB/sec §  16MB writes: 1 GB/sec

Write system call alignment §  It is very beneficial to align writes to page size boundaries §  Prevents read-modify-write cycle all down through system

File Writing

Read syscall size §  Small read syscalls (< 1 page) will trigger a read of the entire page to populate the cache §  Large read syscalls (> 1MB) may not offer a gain for buffered I/O Example test run (1 client, buffered I/O) §  4KB writes: 500 MB/sec §  1MB writes: 1 GB/sec §  16MB writes: 1 GB/sec

Read system call alignment §  Does not affect performance similar to writes since whole blocks are read from the OSTs §  Sequential reads will trigger background readaheads

File Reading

Optimize case that most file accesses are totally sequential file reads

Launch large asynchronous reads in background

•  Lustre client’s readahead can get benefits from the Object Storage Server doing readahead

Readahead will be turned off by non-sequential read patterns in the application

Only available in buffered I/O, not direct I/O

Read Ahead

•  Observe the I/O pattern and if sequential or strided, launch background reads ahead of time

•  Only effective with forward sequential/strided reads

Readahead

I/O INTERFACES: MPI-IO

MPI-IO – I/O interface specification for MPI applications Data model is same as POSIX •  Stream of bytes in file, not self-describing nor containing additional metadata •  Allows for more complex patterns than POSIX Features: •  Noncontiguous I/O with MPI datatypes and filetypes •  Collective I/O Operations – allows for optimizations

•  Complex data pattern described in single operation, passing more information •  Nonblocking I/O Access

•  Allows overlap of I/O and computation as can progress before return of I/O call •  MPI Hints for specific tunings

MPI-IO

Independent I/O operations specify only what a single process will do

Many applications have phase of computation and I/O §  During I/O phases, all processes read/write data

Collective I/O is coordinated access to storage by a group of processes §  Collective I/O function called by all processes participating in I/O §  Allows I/O layers to know more about access as a whole, more opportunities for

optimization in lower software layers §  Without collective I/O, no understanding of what other processes are doing so cannot

coordinate efficiently between them

Independent and Collective I/O

P0 P1 P2 P3 P4 P0 P1 P2 P3 P4

Independent I/O Collective I/O

Problems with independent, noncontiguous access §  Lots of small accesses

Idea: Reorganize access to match layout on disks §  Single processes use data sieving to get data for many §  Often reduces total I/O through sharing of common blocks

Second “phase” redistributes data to final destinations for reads

Two-phase writes operation in reverse (redistribute then I/O)

Collective I/O and Two-Phase I/O

P2 P0 P1 P2 P0 P1 P2 P0 P1

Initial State Phase I: I/O Phase II: Redistribution

MPI I/O version of “Hello World”

First program writes a file with test in it

Second program reads back the file and prints the contents

Shows basic API use and error checking

Simple MPI I/O Examples

#include <mpi.h>

#include <mpio.h> /* may be necessary on some systems */

int main(int argc, char **argv)

int ret, count, rank;

char buf[13] = "Hello World\n";

MPI_File fh;

MPI_Status status;

MPI_Init(&argc, &argv);

ret = MPI_File_open(MPI_COMM_WORLD, "myfile",

MPI_MODE_WRONLY | MPI_MODE_CREATE,

MPI_INFO_NULL, &fh);

if (ret != MPI_SUCCESS) return 1;

/* continues on next slide */

Simple MPI-IO: Writing (1)

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

/* with 13 tasks, each writes single byte of string */

ret = MPI_File_write(fh, buf[rank], 1, MPI_CHAR, &status);

MPI_File_close(&fh);

MPI_Finalize();

return 0;

Simple MPI-IO: Writing (2)

#include <mpi.h>

#include <mpio.h>

#include <stdio.h>

int main(int argc, char **argv)

int ret, count;

char buf[13];

MPI_File fh;

MPI_Status status;

ret = MPI_File_open(MPI_COMM_WORLD, "myfile",

MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);

/* continues on next slide */

Simple MPI-IO: Reading (1)

ret = MPI_File_read(fh, buf, 13, MPI_CHAR, &status);

MPI_Get_count(&status, MPI_CHAR, &count);

if (count != 13) return 1;

printf(“%s”, buf);

MPI_Finalize();

return 0;

Simple MPI-IO: Reading (2)

$ mpicc mpiio-hello-write.c –o mpiio-hello-write

$ mpicc mpiio-hello-read.c –o mpiio-hello-read

$ mpiexec –n 13 mpiio-hello-write

$ mpiexec –n 3 mpiio-hello-read

Hello World

$ ls –l myfile

-rwxr-xr-x 1 bloewe users 13 Jan 18 08:19 myfile

$ cat myfile

Hello World

Compiling and Running

Array to be written to a common file containing the global array in row-major order

Contiguous data in memory, but noncontiguous in file (stored in row-major order)

Example: Distributed Arrays

P1 P0 P2

P4 P3 P5

n columns m

2D array distributed on a 2x3 process grid

gsizes[0] = 2; /* no. of rows in global array */

gsizes[1] = 3; /* no. of columns in global array */

distribs[0] = MPI_DISTRIBUTE_BLOCK; /* block distribution */

distribs[1] = MPI_DISTRIBUTE_BLOCK; /* block distribution */

dargs[0] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */

dargs[1] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */

psizes[0] = 2; /* no. of processes in vertical dimension

of process grid */

psizes[1] = 3; /* no. of processes in horizontal dimension

of process grid */

Setup Parameters

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Type_create_darray(6, rank, 2, gsizes, distribs,

dargs, psizes, MPI_ORDER_C, MPI_FLOAT,

&filetype);

MPI_Type_commit(&filetype);

Filetype – MPI datatype specifying what portion of the file is visible to the process and what type of data it is

May be basic etype or derived datatype from etypes

Define Filetype

MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile",

MPI_MODE_CREATE | MPI_MODE_WRONLY,

MPI_INFO_NULL, &fh);

MPI_File_set_view(fh, 0, MPI_FLOAT, filetype,

"native", MPI_INFO_NULL);

local_array_size = num_local_rows * num_local_cols;

MPI_File_write_all(fh, local_array, local_array_size,

MPI_FLOAT, &status);

By combining the noncontiguous datatype with the collective access, can merge several small requests into a few larger requests

Open File, Write, Close

Test case (write, interleaved): §  6 processes §  4,096 floats per proc §  98MB file

Comparing: §  Collective I/O example §  Independent I/O (multiple I/O calls)

Distributed Array Performance

Collective vs. Independent I/O

Collective/Datatype

Independent

2 seconds 37 seconds

Buffer size §  cb_buffer_size – Controls the size (in bytes) of the intermediate buffer used in two-phase

collective I/O Read/Write §  cb_nodes – Controls when collective buffering is applied to collective read operations §  romio_cb_write – Controls when collective buffering is applied to collective write operations Aggregators §  cb_config_list – Provides explicit control over aggregators (see ROMIO User’s Guide)

MPI-IO Collective Hints

Pass hints in MPI_Info object to MPI_File_open(): §  MPI_Info mpiHints; §  MPI_Info_create(&mpiHints); §  MPI_Info_set(mpiHints, key1, value1); §  MPI_Info_set(mpiHints, key2, value2); §  MPI_File_open(comm, fileName, fd_mode, mpiHints, fd); §  MPI_Info_free(&mpiHints);

Using MPI-IO Hints

Different MPI I/O implementations exist

We’re interested in ROMIO (part of MPICH) from Argonne National Laboratory §  ROMIO’s ADIO layer - abstract I/O interface optimized for different file systems. §  ROMIO contains patches to support Lustre, allowing hints to be passed through MPI-IO to

Lustre for layout §  Supports local file systems, network file systems, and parallel file systems §  Includes data sieving and two-phase optimizations

MPI-O: ROMIO Implementation

MPI-IO Interface

Common Functionality

ADIO Interface

UFS PVFS NFS Lustre

ROMIO’s layered architecture

MPI I/O provides a rich interface allowing us to describe: §  Noncontiguous accesses in memory, file, or both §  Collective I/O

More Information on MPI-IO available in: §  W. Gropp, E. Lusk, R. Thakur, Using MPI-2: Advanced Features of the Message-Passing

Interface, MIT Press, Cambridge, MA, 1999.

MPI-IO Final Thoughts

Seagate Confidential 110 Seagate Confidential

Questions?

parallel file systems - › sites › default › files › files › ...4 benefits of parallel file...

Documents

transversal i/o scheduling for parallel file systems: from

parallel file system simulator

ibm's general parallel file system (gpfs) 1.4 for aix

by ali alskaykha parallel virtual file system pvfs pvfs...

crc file systems storage...

parallel virtual file system

pvfs (parallel virtual file system)

ibm general parallel file system - introduction

acm parallel computing tech pack journeyman’s programming...

general parallel file system 13

the design of vip-fs: a virtual, parallel file system for...

panasas parallel file system

scalable performance of the panasas parallel file system ·...

scalable performance of the panasas parallel file system

olav’frijns’ - cas.web.cern.ch · ’ slide46’...

poli ca de conﬁden alitate · 2021. 1. 17. · poli ca de...

an analysis of state-of-the-art parallel file systems for...

diagnosing performance problems in parallel file...

parallel io library benchmarking on gpfs - · pdf file– io...

distributed replicated parallel file system - heinlein ·...