parallel file systems - › sites › default › files › files › ...4 benefits of parallel file...

110
1 PARALLEL FILE SYSTEMS

Upload: others

Post on 27-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

1

PARALLEL FILE SYSTEMS

Page 2: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

2

Agenda

SeagateConfiden-al

ParallelFileSystems:• History• What?• Why?•  SoWhat?•  Example:LustreandSpectrumScale(GPFS)

• BestPrac1cesforParallelI/O

Page 3: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

3

Evolution of Parallel File Systems

SeagateConfiden-al

Early1980s Today

ParallelFileSystem

MetadataServer

StorageServers

ComputerClient

Management

Metadata

DataPath

1980s–Early1990s 1990s

•  Linux

BOTTLENECK

•  Clustered

Worksta1on

Sharing

•  Auspex•  SUN/NFS

•  NetApp

Page 4: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

4

Benefits of Parallel File Systems

SeagateConfiden-al

•  Increase ability to scale by avoiding bottlenecks (separate metadata and data)

•  Take advantage of available bandwidth

•  Increase parallel streams to storage using more than one client

•  Allows compute cluster to have higher utilization not waiting on I/O

•  Growing clusters with ability to perform larger calculations

MetadataServer

StorageServers

ComputerClient

Management

Metadata

DataPath

Page 5: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

5

Major Parallel File Systems Today

SeagateConfiden-al

Lustre:• HighPerformanceandhighscalability• NoLicensingcost• Na-onalLabsandUniversi-es

SpectrumScale(akaGPFSfromIBM)• Whenthecustomerisaskingforfeaturessuchassnapshotsandmigra-onbetweendisk-ers

• Homedirectoriesandfastscratchinthesamefilesystem

•  Largenumberoffilesandhighmetadatathroughput,andhighspeedstorageforacomputecluster

Page 6: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

6

A Lustre Cluster

SeagateConfiden-al

Client

Client

Client

Router

MDS MDS

OSS

OSS

disk

disk

disk

OSS

Diskarrays&SANFabric

LustreClient1-100,000

Supportmul1plenetworktypes

40/100GE,EDRIB,OMP

ObjectStorageServers(OSS)1-1,000s

MetadataTarget(MDT)

ObjectStorageTarget(OST)

OSS diskCIFSClient

Gateway

NFSClient

ClusterStor HA-MDS

ClusterStor SSU

MetadataServers(MDS)

Page 7: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

7

Summary

SeagateConfiden-al

•  Whenyouneedhighperformance,highscalability->ParallelFileSystems

•  ParallelFileSystemsinproduc-on->ClusterStor

Page 8: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

BEST PRACTICES FOR PARALLEL I/O

Page 9: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 9

§  Access patterns

§  Metadata Usage

Performance §  L300 - Bandwidth, Metadata

§  L300N – NytroXD I/O Accelerator

POSIX and MPI-IO §  POSIX and directly accessing file system

§  Use of MPI-IO as middleware I/O library

§  MPI-IO Hints

§  Examples

Parallel File Systems Background §  Parallel I/O Performance and Clusterstor Lustre

Optimal configuration for KAUST §  Recommended tuning options for HPC workloads

Tools to identify issues §  Visibility/monitoring on servers/clients

§  File/directory striping

§  I/O tracing

Best Practices for application usage of Lustre §  DOs and DONTs

Agenda

Page 10: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

LUSTRE PARALLEL I/O PERFORMANCE BACKGROUND

Page 11: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 11

Lustre Components

OSS

Clients

MDS OSS OSS

Directory Operations, File

open/close, metadata, and concurrency

File creation, file status, and

recovery

File I/O and locking

Page 12: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 12

Lustre is an open source, distributed parallel file system §  Object-based design provides extreme

scalability §  Compute clients interact directly with

storage servers §  Comprised of:

§  Clients

§  Metadata Servers and Targets

§  Storage Servers and Targets

Lustre Parallel File System

Page 13: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 13

L300 ClusterStor Management (SMU/MMU): Management and Metadata (MDS/MDT)

CS Manager and MDS/MGS Nodes §  2RU Integrated Controllers – Server 1: CSM Mgmt – Server 2: Boot

– Server 3: MGS/MDS – Server 4: MDS

Fault Tolerance Serviceability

2U24 JBOD – MDT §  SAS JBOD for MDS §  Disk Configuration – Qty 20 Drives for MDT – 2x RAID10, 5+5 – Qty 2 Global Hot spares

Page 14: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 14

Scalable Storage Unit (SSU) SSU §  5U84 Enclosure §  2 Object Storage Servers per

SSU §  Two (2) trays of 41 HDD’s each

for Object Storage Targets §  2 SSDs (WIBs, Journals,

NytroXD) §  H/A on each SSU §  Infiniband EDR, 40GbE, OPA

data network connectivity

Page 15: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 15

ClusterStor Hardware and the Lustre File System

Object Storage Server

Seagate Embedded Application Server

Object Storage Target Seagate 5U84 Storage

Enclosure

Meta Data and Management

Servers 2U x 4 Servers

Meta Data Target

Seagate 2U24 JOBD

1) Where is file?

2) File is at…. Client

File

3) Single File (3,072Kb)

5a) File block stripe 1 of 3 (1,024Kb)

5b) File block stripe 2 of 3 (1,024Kb)

5c) File block stripe 3 of 3 (1,024Kb)

4) File is broken into block stripe segments (1,024Kb)

Page 16: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 16

MDS <-> OST Interaction: Use Cases

Create object

Unlink

StatFS

Create new data object for file

Delete data object if file unlinked

Get info from OST for size of file, space on OST

Page 17: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 17

File on Lustre

FILE Attributes (name, permission, owner, security label)

Data Blocks, actual size Handled by OST (Each data block is object on OST)

Mapping from name to data objects (aka LOV EA) Stored in extended attributes on MDT

Handled by MDT

Page 18: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 18

●  Allows file data to be stored evenly on multiple OSTs ●  RAID 0 pattern

Striping

1 2 3

OST1 OST2 OST3

Object

File A File B

1

Example: Single Stripe File All file data is in a single OST

Page 19: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 19

•  Object – stripes belonging to one file on an OST •  Stripe count – number of objects involved •  Stripe size – max size of one stripe •  Stripe object sizes add up to total file size In this example: •  File A stripe count = 2 •  File B stripe count = 3 •  Typical stripe size is 1MB •  File B stripe size is twice that

of File A.

Striping Terminology

1 3 5

OST1

1

2 4

OST2

2

OST3 3 Object

File A

File B

Page 20: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 20

Striping (cont.)

1 4 7

OST1

1

2 5

OST2

2

3 6

OST3

3

Object

File A File B

9

Example: Fully-Striped Files •  These are fully striped files, meaning striped over all OSTs

•  Also called wide striping •  This achieves maximum bandwidth to one file

•  File A has a hole or is sparse

•  This happens commonly in HPC usage of a file system

Page 21: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 21

●  The parallel I/O performance comes from separating the metadata and data objects

●  The OSSs handle the containers for the component objects.

●  The data (file) can be striping across multiple OSTs for improved parallelism

●  Typically this striping is used for very large, single files with multiple processes accessing.

Overview

Page 22: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

CONFIGURATION AND TUNING

Page 23: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 23

RPCs •  lctl set_param osc.*.max_rpcs_in_flight=256 [default: 8] •  lctl set_param osc.*.max_pages_per_rpc=1024 [default:256] Write cache •  lctl set_param osc.*.max_dirty_mb=1024 [default: 32]

Client-Side Tunables

Page 24: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 24

Readahead – Best for streaming read workload

•  lctl set_param llite.*.max_read_ahead_mb=1024 [default: 40]

•  lctl set_param llite.*.max_read_ahead_per_file_mb=1024 [default: 40]

Readahead - is your workload strided?

•  May set the value to default (40), or tune to lower value

Readahead Step

•  lctl set_param llite.*.read_ahead_step=4 (Seagate client only)

Client-Side Tunables

Page 25: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 25

Wire checksum

•  /proc/fs/lustre/osc/*/checksums=0

LRU size is dynamic by default - Controls number of client-side locks in an LRU cached locks queue.

•  lctl set_param ldlm.namespaces.*osc*.lru_size=0

LNET verbosity

•  sysctl -w lnet.debug=0

Client-Side Tunables

Page 26: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 26

While there are tunables available on the servers, typically the system defaults are optimal. In special cases, Seagate Support and Engineering may recommend specific changes.

Server-Side Tunables

Page 27: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

TOOLS

Page 28: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 28

A number of tools on the management server for monitoring Ltop (and GUI) $ ltop

Filesystem: cstor

Inodes: 1434.250m total, 0.006m used ( 0%), 1434.244m free

Space: 451.605t total, 160.525t used ( 36%), 291.080t free Bytes/s: 0.000g read, 0.000g write, 0 IOPS

MDops/s: 0 open, 0 close, 0 getattr, 0 setattr

0 link, 0 unlink, 0 mkdir, 0 rmdir 0 statfs, 0 rename, 0 getxattr

>OST S OSS Exp CR rMB/s wMB/s IOPS LOCKS LGR LCR %cpu %mem %spc

0000 F cstor04 17 0 0 0 0 0 0 1 2 48 0 0001 F cstor05 17 0 0 0 0 0 0 0 2 49 0

0002 F cstor06 17 0 0 0 0 96 0 0 33 99 69

0003 F cstor07 17 0 0 0 0 96 0 0 26 99 73

Visibility Tools – Server Performance Monitoring

Page 29: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 29

For clientside tools, checking the RPC write/read stats: $ lctl get_param osc.*OST0000*.import rpcs: inflight: 0 unregistering: 0 timeouts: 0 avg_waittime: 304635 usec read_data_averages: bytes_per_rpc: 16777216 usec_per_rpc: 105534 MB_per_sec: 158.97 write_data_averages: bytes_per_rpc: 16692814 usec_per_rpc: 759389 MB_per_sec: 21.98

Visibility Tools – Client Performance Monitoring

Page 30: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 30

For clientside tools, checking the RPC stats:

•  Pages per RPC – All 1024-page writes, * 4k = 4MB RPCs

$ lctl get_param osc.*OST*.rpc_stats read write

pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 3 0 0 ... 256: 587809 64 64 | 868 0 0 512: 4734 0 64 | 1270 0 0 1024: 320913 35 100 | 467052 99 100 read write

rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 8870 0 0 | 49 0 0 ... 4: 22367 2 7 | 48 0 0

Visibility Tools – Client Performance Monitoring

Page 31: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 31

For metadata statistics $ llstat /proc/fs/lustre/mdc/*MDT*/md_stats

snapshot_time 1488139067.685549

close 66

create 9

getattr 1

intent_lock 402

read_page 23

unlink 285

intent_getattr_async 244

revalidate_lock 567

Visibility Tools – Client Metadata Monitoring

Page 32: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 32

$ llobdstat /proc/fs/lustre/osc/*OST0000* 1

/usr/bin/llobdstat on /proc/fs/lustre/osc/cstor-OST0003-osc-ffff8807f72e7400

Processor counters run at 1200.000 MHz

Read: 1.96468e+12, Write: 1.92445e+12, create/destroy: 0/0, stat: 0, punch: 0

[NOTE: cx: create, dx: destroy, st: statfs, pu: punch ] Timestamp Read-delta ReadRate Write-delta WriteRate -------------------------------------------------------- 1428086691 412.00MB 411.85MB/s 0.00MB 0.00MB/s 1428086692 184.00MB 183.93MB/s 0.00MB 0.00MB/s 1428086693 296.00MB 295.87MB/s 0.00MB 0.00MB/s

Visibility Tools – Client Monitoring OSS Statistics

Page 33: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 33

Use Lustre and stripe-aware tools instead of standard Unix $ lfs df

UUID 1K-blocks Used Available Use% Mounted on

cstor-MDT0000_UUID 2255452564 4317344 2221057108 0% /mnt/cstor[MDT:0]

cstor-OST0000_UUID 121226819924 119016 120014045804 0% /mnt/cstor[OST:0]

cstor-OST0001_UUID 121226819924 119016 120014045804 0% /mnt/cstor[OST:1]

cstor-OST0002_UUID 121226819924 84273533756 35738853464 70% /mnt/cstor[OST:2]

cstor-OST0003_UUID 121226819924 89476742140 30536591256 75% /mnt/cstor[OST:3]

$ lfs find old/ -t f -print0 | xargs -0 rm $ lfs help

User Tools – lfs

Page 34: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 34

Setting file/directory striping

$ mkdir dir_count_wide_A

$ lfs setstripe –c -1 –S1048576 dir_count_wide_A

$ lfs getstripe dir_count_wide_A

count=3

stripe_count: 3 stripe_size: 1048576 stripe_offset: -1

User Tools – lfs

1

OST1

1

2

OST2

2

3

OST3

Dir A Dir B

Page 35: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 35

Setting file/directory striping

$ mkdir dir_count_wide_B

$ lfs setstripe -c 2 -S 2097152 dir_count_wide_B

$ lfs getstripe dir_count_wide_B

count=2

stripe_count: 2 stripe_size: 2097152 stripe_offset: -1

User Tools – lfs

1

OST1

1

2

OST2

2

3

OST3

Dir A Dir B

Page 36: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

TRACING

Page 37: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 37

•  Unexpected I/O patterns are common – Dusty decks or data access libraries may hide I/O – Many applications running concurrently and competing for resources – Original developer and intent may not be available

•  Examples seen – In a loop: random seek forward or back approx 1GB, then read several bytes, taking days

to process a 20GB input;16 bytes per I/O at 100 ops/sec is 14 days to read 20GB – 1372 bytes written at strides of 175616 bytes, by 128 nodes concurrently – Uncoordinated concurrent writes to the end of a shared log file, over NFS – Record updates by re-writing the entire file – Open, Write < 100 bytes, Close, Repeat

Tracing I/O Patterns

Page 38: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 38

•  strace is a great tool to easily obtain simple I/O profiles

•  Gives us a good idea of the behavior of the application

•  Try to determine where most of I/O time is spent

•  Can be used to collect I/O system calls –  e.g., all read, write , open, … system calls –  unix tool strace captures all of this information

•  Can help with analyzing behavior of the app –  what files did the app touch, how much data was moved, reads vs. writes, timing,

etc.

Tracing

Page 39: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 39

strace traces the execution of system calls. strace -f -tt -T -e trace=open,close,read,write,lseek –o $OUT $EXE

-f will trace child processes from fork() -tt tells strace to record time in microseconds -T shows time spent in system call -e trace= will trace only these system calls -o $OUT tells strace where to put the trace (don’t put trace output in test FS) $EXE is the command you would actually use to run the program

Instrumenting application for strace not always possible, so attach to running process:

-p $PID will attached to an existing process ID

Using strace

Page 40: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 40

fd = open(argv[1], O_RDWR | O_CREAT ); while ((rc = read(fd, buffer, 1024)) > 0) printf("%d\n", rc);

$ strace -f -tt -T -o /tmp/strace.out ./testProgram -arg1 -arg2

Dumps strace output to file /tmp/strace.out time system call and other info ------------------------------------------------------- 09:59:33.790079 open(“/fs/home/test", O_RDWR|O_CREAT, 03777761102063720) = 3 <0.000026> 09:59:33.790154 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1024) = 1024 <0.000025> 09:59:33.790291 write(1, "1024\n", 51024) = 5 <0.000013> 09:59:33.790339 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.000018> 09:59:33.790403 write(1, "1024\n", 51024) = 5 <0.000007> 09:59:33.790442 read(3, "", 1024) = 0 <0.000048>

Simple Program Example

Page 41: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 41

Collecting straces from MPI codes

Wrapper Script:

#!/bin/sh

# Commandline:

# mpirun -np 4 ./mpi-strace ./a.out <args>

unset STRACE_SUFFIX

STRACE_SUFFIX=`hostname`.$PMI_RANK # or, $$

OPTS="-f -F -tt -T –e trace=open,close,read,write,lseek"

exec strace $OPTS "$@" 2> strace.${STRACE_SUFFIX}

Tracing MPI Codes

Page 42: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 42

Creates an strace file for each MPI process:

% mpirun -np 4 ./mpi-strace.sh ./IOR -o /fs/home/IOR.test -F

% ls strace.*

strace.sjSC-201.0

Strace.sjSC-202.2

Strace.sjSC-203.1

Strace.sjSC-204.3

Each strace output file contains trace from individual rank.

Tracing MPI Codes

Page 43: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 43

Darshan •  Excellent for profiling I/O •  No code changes •  Requires recompiling the MPI application with Darshan

Tau •  Application profiling

Other Profiling Tools

Page 44: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

BEST PRACTICES FOR APPLICATION I/O

Page 45: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 45

•  Increase ability for scaling by avoiding bottlenecks

•  Take advantage of available bandwidth

•  Increase streams to parallel storage using more than one client

•  Where applicable, take advantage of client-side cache

•  Use large transfers when possible or I/O library to describe smaller or noncontiguous transfers

Objectives

Page 46: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 46

•  Partition the problem domain

•  Processes work on own subdomains independently

•  Optionally, write intermediate results

–  For restart after failure or other reasons

•  Optionally, reiterate on writes

•  Upon completion, assemble and integrate parts

•  Write final result

–  May be same format as intermediate results

General Application Behavior

Page 47: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 47

•  The I/O pattern affects performance.

•  N-to-N streaming workloads can approach disk bandwidths.

•  Random I/O is limited by disk seek performance.

•  N-to-1 sharing affects performance because of locking contention.

•  Strided I/O patterns are somewhere in between.

•  Keep access aligned when possible.

General Design Considerations

Page 48: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 48

Serial, “Rank 0”, “Head node” model

Bottleneck, no parallel paths

N-to-N: Unique file-per-process (FPP)

Restart needs same number of nodes

Many files to handle for post-processing

N-to-1: A single-shared-file (SSF)

All data in a single file for restart purposes

Easier for post-processing

Typical I/O Access

Node 0

Node 1

Node 2

Node 0

Node 1

Node 2

File 0

File 1

File 2

File

Node 0

Node 1

Node 2

File

Page 49: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 49

Contiguous

All data from a task is contiguous in the file

Discontiguous

Strided

Random

N-1 Access Patterns

a

a

b

b

c

c

a

a

b

b

c

c

a

a

b

b

c

c

a

a

b

b

c

c

A B C

a

a

a

a

b

b

b

b

c

c

c

c

A B C

a

a

b

b

c

c

a

a

b

b

c

c

a

a

b

b

c

c

Clients

File

Page 50: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 50

Lustre loves 1MB I/Os •  But not if they aren’t aligned on stripe boundaries if your app has a natural

stride, use that •  lfs setstripe --stripesize=256k

More stripes = more parallel bandwidth •  clients >= stripe count •  Applicable to N-1 (SSF), not N-N (FPP)

Sequential vs. Random •  It’s all about seeks

IO Patterns

Page 51: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 51

Writes are generally faster than reads •  It’s all about seeks

Well-aligned I/O is always best •  Else may be performing a read-modify-write

N-N is ideal, N-1 is more challenging •  Avoid overlapping access to same file region •  Lustre Distributed Lock Manager will enforce coherency (and serialize)

IO Patterns

Page 52: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 52

Buffered I/O for smaller I/Os •  If application has 32KB I/O calls, e.g., use buffering (default) •  Has some userspace/kernelspace overhead •  May be cached for reads on client

Use O_DIRECT when appropriate •  O_DIRECT is useful for large IO •  Does not buffer client cache, so where you are not interested in reading the file •  Does not allow for readahead (prefetching)

I/O Options

Page 53: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 53

Deep directories have little contention, but do require serial lookups

Directory locks allow a single client to work unilaterally (dir-per-client)

Racing creates in a single dir must be serialized

Directory Layout

Page 54: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

RULES OF THUMB

Page 55: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 55

Bigger is better §  Large files, large transfers, large numbers of clients are good

Aggregate writes – use large transfers at least 64KB when possible or I/O library to describe smaller or noncontiguous transfers §  Contiguous data patterns utilize prefetching and write-behind far better than noncontiguous (or

random) patterns §  Collective I/O can aggregate for you, transforming disjoint accesses into contiguous access

But bigger isn’t always necessary §  I/O transfer size isn’t super important for streaming workloads because of read-ahead and write

batching §  64K is often just as good as 4MB, and sometimes even better

N-to-N streaming workloads can approach disk bandwidths.

Rule of Thumb #1

Page 56: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 56

Sharing is important §  And parallel file systems work hard to make sharing “perfect” so that clients anywhere in the network

have an up-to-date view of files and their data (unlike NFS)

Sharing has a cost §  Sharing between clients requires coordination by the file system, and therefore has a cost §  N-to-1 sharing affects performance because the metadata manager has to mediate access.

For N-1, use MPI-IO §  Collective I/O and Datatypes §  MPI-IO Hints when possible

Avoid overlapped write regions – best to use block-aligned data

Rule of Thumb #2

Page 57: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 57

Design application I/O to describe as much as possible to file system: §  Open file with appropriate mode §  Use collective calls when available §  Describe data movement with fewest possible operations

Match file organization to process partitioning if possible §  Order dimensions so relatively large blocks are contiguous with respect to data decomposition

Random I/O is limited by disk seek performance and read-modify-write

The I/O pattern affects performance.

Rule of Thumb #3

Page 58: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 58

Parallel File Systems are not optimized for metadata, instead for moving data §  Opening files and closing files incur overhead. Keep file creates, opens, and closes to a minimum –

open/close once if possible Use your own subdirectory to avoid contention with others

Create multiple directories, distribute across DNE if available

Avoid filling directories to maximum Don’t use ‘ls -l’ or ‘du’ on millions of files (or to check file size progress) §  Similarly, turn off color alias (ls --color) §  Typically ‘ls’ is aliased to ‘ls --color=tty’. This actually does a stat. §  ls –l: calls readdir() followed by a sequence of stat(2) calls for every name returned by readdir()

Rule of Thumb #4

Page 59: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 59

Increase ability for scaling by avoiding bottlenecks

Parallel I/O from compute nodes eliminates the head node bottleneck

Increase streams to parallel storage using more than one client

N-N access is generally tuned for parallel file systems

N-1 access may need additional tuning

Collective I/O may be appropriate for N-1

Use tracing to understand application I/O patterns

Keep access aligned when possible.

Rule of Thumb #5

Page 60: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

PERFORMANCE

Page 61: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential Seagate Confidential

Streaming Bandwidth Performance

•  L300/4TB Tardis

•  CS9000/Makara

•  IOR-2.10.3

•  16 clients

•  Stonewalling (180 sec)

•  Average of 5 iterations

Direct I/O

CS9000 L300 Write Read Write Read

9.6 10.1 12.3 13.6

Buffered I/O

CS9000 L300 Write Read Write Read

9.8 9.4 13.3 12.3

61

Page 62: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential Seagate Confidential

Streaming Bandwidth Performance

•  L300N/4TB Tardis

•  IOR-2.10.3

•  16 clients

•  Stonewalling (180 sec)

•  Average of 5 iterations

Direct I/O

CS9000 Makara L300N Tardis Write Read Write Read

9.6 10.1 16.1 16.0

62

Page 63: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

63 Seagate Confidential

Drive Write Performance -- iostat

Page 64: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential Seagate Confidential

Advanced MMU Prototype Performance

•  Single MDS/MDT

•  2x SSU+1 (32 OSTs)

•  RAID10 (5+5)

•  mdtest-1.9.3

•  Reporting File/s

MDS Create Stat Open Remove

CS9000 CMU 75K 220K 90K 65K

L300 MMU 55K 220K 95K 70K

L300 MMU-A 105K 350K 220K 80K

64

Page 65: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

65 Seagate Confidential

Large-Scale Lustre Scaling Testing Test conducted at Customer Site 488 Client Nodes, FDR Fabric 1. Scaling SSU+1 Count

36x L300 SSU+1 FDR (bottlenecked fabric for writes)

2. Scaling Metadata Servers (DNE)

7x MDS/MDT Base MMU Active/Passive + 3x MMU Active/Active

65

Page 66: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

66 Seagate Confidential

Lustre Read Performance Scaling •  Test single SSU+1 •  14 Compute Nodes •  12.97 GB/s Read

•  Project Linear Scaling IOR -v -B -F -a POSIX -r -k -m -E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT

0

50

100

150

200

250

300

350

400

450

500

4 8 12 16 20 24 28 32 36

GB

/s

SSU+1 Count

IOR Read Performance

Read

Page 67: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

67 Seagate Confidential

Lustre Read Performance Scaling •  Scaling 4 - 36 SSU+1s •  14 Nodes per SSU+1 •  Projection based on SSU+1 IOR -v -B -F -a POSIX -r -k -m –E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT

0

50

100

150

200

250

300

350

400

450

500

4 8 12 16 20 24 28 32 36

GB

/s

SSU+1 Count

IOR Read Performance

Read

Page 68: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

68 Seagate Confidential

Lustre Write Scaling •  Scaling 4 - 32 SSUs •  12 Compute Nodes per SSU •  Projection based on 1 SSU IOR -vv -k -B -F -a POSIX -w -b 256G -t 64M -C -e -m -D 120 -i 3 -o $OUT

0

50

100

150

200

250

4 8 12 16 20 24 28 32

GB

/s

SSU Count

IOR Write Scaling

Write

Page 69: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

69 Seagate Confidential

Lustre Metadata Scaling Scaling 1 - 7 DNEs 64 Nodes per DNE (1M Files per DNE) mdtest -i 3 -n $(( $DNES * 1048576 / $PROCS )) -F -u -C -E -T -r -v -i 3 -d $DIR[@$DIR2…]

-

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

1 2 3 4 5 6 7

File

/s

DNEs

File Create/s

-

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1 2 3 4 5 6 7

File

/s

DNEs

File Stat/s

-

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

1 2 3 4 5 6 7

File

/s

DNEs

File Remove/s

Page 70: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential Seagate Confidential

•  Latest Seagate Lustre Client Based on IEEL

•  Improved Performance

•  Best of Seagate and IEEL

IOR -t 1M -b 131072M -F -C -v -v \

-E -k -m -i 3 -w -r -o $OUTFILE

70

OS Lustre Client Write MB/s

Read MB/s

CentOS 6.5 Seagate 2.5.1 795.37 889.32

CentOS 6.5 OpenSFS 2.7.0 946.12 968.24

CentOS 6.5 Seagate 2.7.13 (IEEL) 1,048.41 1,475.64

CentoOS 7.2 Seagate 2.5.1 991.38 1,274.77

CentoOS 7.2 OpenSFS 2.7.0 1,032.32 1,209.10

CentoOS 7.2 Seagate 2.7.13 (IEEL) 1,265.84 1,804.68

Single-Stream Performance

Page 71: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 71

NytroXD Intelligent I/O Manager

Workloads are becoming increasingly unpredictable for the storage system Ø  Streaming IO Ø  Random IO Ø  Unaligned

Solution: ü  Use proper technology for the workload ü  Cost conscious

Page 72: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential

Nytro Accelerator (NXD) is a new feature that combines new software functionality ○  Leverages the existing L300N/G300N dual SSDs to provide automatic, selective

acceleration of the performance of specific IO accesses, characterized as “small.” ○  The caching software, referred to as “NytroXD” acts as a filter driver to cache small

blocks of data in an SSD cache to improved performance

The definition of a “small block” in this context is configurable. ○  The IO_bypass size parameter, which determines whether an I/O is accelerated or

not is configurable and is designated as being settable by the user.

NytroXD Intelligent I/O Manager

Page 73: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 73

Basic NytroXD Data Path

OSS0

OSS1

41-HDD OST0 GridRAID

41-HDD OST1 GridRAID

2-SSD NXD RAID10

Client Bypass=32KB

1024KB 4KB

32KB 64KB

•  I/O Requests coming from the Client will be directed at the OSS

•  Requests <= Bypass Size will be written to the SSDs

•  Requests > Bypass Size will be written to HDDs

•  Default Bypass is set to 32KB, but can be set as high as 1MB

Page 74: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 74

Basic NytroXD Data Path

OSS0

OSS1

41-HDD OST0 GridRAID

41-HDD OST1 GridRAID

2-SSD NXD RAID10

Client Bypass=32KB

1024KB

4KB 32KB

64KB

•  As the SSD array fills, the data is destaged to the HDD arrays

•  The data that arrives on the SSD will be available for reading unless evicted from cache

•  Current caching policy favors accelerated writes

4KB

32KB

Page 75: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 75

Small, Aligned, Random I/O – IOR, 120-second stonewall, buffered I/O

16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)

Seagate NytroXD, L300N – Write/Rewrite, Aligned I/O

0 5,000

10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

128 64 32 16 8 4 2 1

IOPS

Transfer Size (KB)

Write -- Aligned, Random

NXD

No NXD

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

128 64 32 16 8 4 2 1

IOPS

Transfer Size (KB)

Rewrite -- Aligned, Random

NXD

No NXD

Page 76: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 76

Small, Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O

16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)

Seagate NytroXD, L300N – Write/Rewrite, Unaligned I/O

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

160000 80000 40000 20000 10000 5000

IOPS

Transfer Size (Bytes)

Rewrite -- Unaligned, Random

NXD

No NXD

0

5,000

10,000

15,000

20,000

25,000

30,000

160000 80000 40000 20000 10000 5000

IOPS

Transfer Size (Bytes)

Write -- Unaligned, Random

NXD

No NXD

Page 77: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 77

Small, Aligned/Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O

16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)

Seagate NytroXD, L300N – Read, Aligned/Unaligned I/O

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

128 64 32 16 8 4 2 1

IOPS

Transfer Size (KB)

Read -- Aligned, Random

NXD

No NXD

0

20,000

40,000

60,000

80,000

100,000

120,000

160000 80000 40000 20000 10000 5000

IOPS

Transfer Size (Bytes)

Read -- Unaligned, Random

NXD

No NXD

Page 78: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 78

Recommended Usage and Caveats for NytroXD •  NytroXD is primarily designed for small writes under the bypass

threshold.

•  Random Write/Rewrite I/O (and Read if in cache) is recommended.

•  For random rewrite of data on the HDDs, 4KB-page aligned data will go to SSD cache.

•  For random rewrite of data on the HDDs, unaligned data may incur a read-modify-write from the HDDs to the SSDs

Page 79: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 79

Recommended Usage and Caveats for NytroXD •  Depending on the size of the SSDs and workload, the data may not

necessarily be available for reading from the SSD cache.

•  Sequential Buffered I/O will be aggregated on the client typically larger than the bypass setting.

•  Sequential Direct I/O with requests smaller than or equal to the bypass will go to SSD cache.

•  As the utilization of the SSD reaches 40% and 60%, the destaging of data (copying from SSDs to HDDs) will accelerate.

•  When the utilization reaches 80% of the SSD, read eviction increases to accommodate additional incoming writes.

Page 80: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 80

Monitoring NytroXD Usage - perfmon /opt/seagate/nytrocli/nytrocli64 /xd show perfmon <snip> purple04: Name of the Cache Group = nxd_cache_0 purple04: Number of VDs = 1 purple04: Number of Cache Devices = 1 purple04: Queue Depth = 4096 purple04: Total Cache Size = 371.054 GiB purple04: Cache Size in use = 26.692 GiB purple04: Cache Block Size = 4 KiB purple04: Cache Window Size = 64 KiB purple04: Bypass IO Size = 1.0 MiB purple04: Total number of I/Os = 1076225215 purple04: Number of reads = 516326249 purple04: Number of writes = 559898966 purple04: Total number of bypass I/Os = 200012374 purple04: Number of bypass reads = 52053543 purple04: Number of bypass writes = 147958831 purple04: Cache Hits = 660474280 purple04: Cache Misses = 18446744073 purple04: Number of dirty CWs = 435362 purple04: Total Cache Blocks flushed = 3812514848

Page 81: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 81

Histogram Feature in NytroXD •  To use the Histogram of I/O sizes being written/read on the OSS,

the following may be configured:manually clear cache, check cache group name (nxd_cache_0, e.g.):

nytrocli64 /xd show perfmon | grep “Cache Group”

nytrocli64 /xd get cachestate cg=nxd_cache_0

nytrocli64 /xd set cachestate=disable cg=nxd_cache_0

nytrocli64 /xd get cachestate cg=nxd_cache_0

nytrocli64 /xd set histparam=enable

<RUN WORKLOAD>

Page 82: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 82

Histogram Feature in NytroXD •  To collect the results:

nytrocli64 /xd show histogram

•  Once completed, disable the histogram and re-enable the cache:

nytrolci64 /xd set histparam=disable

nytrocli64 /xd set cachestate=enable cg=nxd_cache_0

nytrocli64 /xd get cachestate cg=nxd_cache_0

Page 83: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 83

Histogram Output The histogram shows both Writes and Reads

This is a test of sequential 128KB (Direct I/O) writes and random buffered I/O 4KB writes. (Writes only shown here.)

Num Writes < 4K = 0 Num Writes 4K = 9943751 Num Writes 4K+1 - 8K = 12902 Num Writes 8K+1 - 16K = 76 Num Writes 16K+1 - 32K = 0 Num Writes 32K+1 - 64K = 0 Num Writes 64K+1 - 128K = 1034128 Num Writes 128K+1 - 256K = 0 Num Writes 256K+1 - 512K = 0 Num Writes 512K+1 - 1M = 0 Num Writes 1M+1 - 2M = 0 Num Writes 2M+1 - 4M = 0 Num Writes 4M+1 - 8M = 0 Num Writes 8M+1 - 16M = 0 Num Writes 16M+1 - 32M = 0

Page 84: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

I/O INTERFACES: POSIX

Page 85: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 85

Applications require more software than just a parallel file system

An interface between the application and file system is needed

We’ll be looking at these for application access: §  POSIX §  MPI-IO

Both can provide layer for higher-level I/O libraries

Application I/O Interfaces

Higher-level I/O Libraries POSIX MPI-IO

Parallel File System I/O Hardware

Application

Page 86: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 86

POSIX is the IEEE Portable Operating System Interface for Computing Environments

Defines a standard way for an application program to obtain basic services from the O/S

POSIX I/O is the mechanism that almost all serial applications use to perform I/O

Created when a single computer owned its own file system

•  No ability to describe parallel I/O constructs

open(), seek(), write(), read(), close()

•  O/S maps these calls directly into file system operations Provides a useful, ubiquitous interface for basic I/O

POSIX I/O

Page 87: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 87

Lustre gets high bandwidth at sync time by coalescing contiguous dirty buffers into a single large transfer to the OSTs

Maximum amount of dirty data in the kernel = 1024MB

Linux uses “flush on close” semantics

•  Keep in mind if your application does multiple open/write/close cycles on the same file

File Writing

Page 88: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 88

Write syscall size §  Small write syscalls have a large overhead in the kernel §  Large write syscalls (> 1MB) may not offer a gain for buffered I/O (though will for

Direct I/O

Example test run (1 client, buffered I/O) §  4KB writes: 400 MB/sec §  1MB writes: 1 GB/sec §  16MB writes: 1 GB/sec

Write system call alignment §  It is very beneficial to align writes to page size boundaries §  Prevents read-modify-write cycle all down through system

File Writing

Page 89: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 89

Read syscall size §  Small read syscalls (< 1 page) will trigger a read of the entire page to populate the cache §  Large read syscalls (> 1MB) may not offer a gain for buffered I/O Example test run (1 client, buffered I/O) §  4KB writes: 500 MB/sec §  1MB writes: 1 GB/sec §  16MB writes: 1 GB/sec

Read system call alignment §  Does not affect performance similar to writes since whole blocks are read from the OSTs §  Sequential reads will trigger background readaheads

File Reading

Page 90: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 90

Optimize case that most file accesses are totally sequential file reads

Launch large asynchronous reads in background

•  Lustre client’s readahead can get benefits from the Object Storage Server doing readahead

Readahead will be turned off by non-sequential read patterns in the application

Only available in buffered I/O, not direct I/O

Read Ahead

•  Observe the I/O pattern and if sequential or strided, launch background reads ahead of time

•  Only effective with forward sequential/strided reads

Readahead

Page 91: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

I/O INTERFACES: MPI-IO

Page 92: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 92

MPI-IO – I/O interface specification for MPI applications Data model is same as POSIX •  Stream of bytes in file, not self-describing nor containing additional metadata •  Allows for more complex patterns than POSIX Features: •  Noncontiguous I/O with MPI datatypes and filetypes •  Collective I/O Operations – allows for optimizations

•  Complex data pattern described in single operation, passing more information •  Nonblocking I/O Access

•  Allows overlap of I/O and computation as can progress before return of I/O call •  MPI Hints for specific tunings

MPI-IO

Page 93: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 93

Independent I/O operations specify only what a single process will do

Many applications have phase of computation and I/O §  During I/O phases, all processes read/write data

Collective I/O is coordinated access to storage by a group of processes §  Collective I/O function called by all processes participating in I/O §  Allows I/O layers to know more about access as a whole, more opportunities for

optimization in lower software layers §  Without collective I/O, no understanding of what other processes are doing so cannot

coordinate efficiently between them

Independent and Collective I/O

P0 P1 P2 P3 P4 P0 P1 P2 P3 P4

Independent I/O Collective I/O

Page 94: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 94

Problems with independent, noncontiguous access §  Lots of small accesses

Idea: Reorganize access to match layout on disks §  Single processes use data sieving to get data for many §  Often reduces total I/O through sharing of common blocks

Second “phase” redistributes data to final destinations for reads

Two-phase writes operation in reverse (redistribute then I/O)

Collective I/O and Two-Phase I/O

P2 P0 P1 P2 P0 P1 P2 P0 P1

Initial State Phase I: I/O Phase II: Redistribution

Page 95: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 95

MPI I/O version of “Hello World”

First program writes a file with test in it

Second program reads back the file and prints the contents

Shows basic API use and error checking

Simple MPI I/O Examples

Page 96: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 96

#include <mpi.h>

#include <mpio.h> /* may be necessary on some systems */

int main(int argc, char **argv)

{

int ret, count, rank;

char buf[13] = "Hello World\n";

MPI_File fh;

MPI_Status status;

MPI_Init(&argc, &argv);

ret = MPI_File_open(MPI_COMM_WORLD, "myfile",

MPI_MODE_WRONLY | MPI_MODE_CREATE,

MPI_INFO_NULL, &fh);

if (ret != MPI_SUCCESS) return 1;

/* continues on next slide */

Simple MPI-IO: Writing (1)

Page 97: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 97

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

/* with 13 tasks, each writes single byte of string */

ret = MPI_File_write(fh, buf[rank], 1, MPI_CHAR, &status);

MPI_File_close(&fh);

MPI_Finalize();

return 0;

}

Simple MPI-IO: Writing (2)

Page 98: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 98

#include <mpi.h>

#include <mpio.h>

#include <stdio.h>

int main(int argc, char **argv)

{

int ret, count;

char buf[13];

MPI_File fh;

MPI_Status status;

MPI_Init(&argc, &argv);

ret = MPI_File_open(MPI_COMM_WORLD, "myfile",

MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);

if (ret != MPI_SUCCESS) return 1;

/* continues on next slide */

Simple MPI-IO: Reading (1)

Page 99: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 99

ret = MPI_File_read(fh, buf, 13, MPI_CHAR, &status);

if (ret != MPI_SUCCESS) return 1;

MPI_Get_count(&status, MPI_CHAR, &count);

if (count != 13) return 1;

printf(“%s”, buf);

MPI_File_close(&fh);

MPI_Finalize();

return 0;

}

Simple MPI-IO: Reading (2)

Page 100: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 100

$ mpicc mpiio-hello-write.c –o mpiio-hello-write

$ mpicc mpiio-hello-read.c –o mpiio-hello-read

$ mpiexec –n 13 mpiio-hello-write

$ mpiexec –n 3 mpiio-hello-read

Hello World

Hello World

Hello World

$ ls –l myfile

-rwxr-xr-x 1 bloewe users 13 Jan 18 08:19 myfile

$ cat myfile

Hello World

Compiling and Running

Page 101: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 101

Array to be written to a common file containing the global array in row-major order

Contiguous data in memory, but noncontiguous in file (stored in row-major order)

Example: Distributed Arrays

P1 P0 P2

P4 P3 P5

n columns m

row

s

2D array distributed on a 2x3 process grid

Page 102: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 102

gsizes[0] = 2; /* no. of rows in global array */

gsizes[1] = 3; /* no. of columns in global array */

distribs[0] = MPI_DISTRIBUTE_BLOCK; /* block distribution */

distribs[1] = MPI_DISTRIBUTE_BLOCK; /* block distribution */

dargs[0] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */

dargs[1] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */

psizes[0] = 2; /* no. of processes in vertical dimension

of process grid */

psizes[1] = 3; /* no. of processes in horizontal dimension

of process grid */

Setup Parameters

Page 103: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 103

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Type_create_darray(6, rank, 2, gsizes, distribs,

dargs, psizes, MPI_ORDER_C, MPI_FLOAT,

&filetype);

MPI_Type_commit(&filetype);

Filetype – MPI datatype specifying what portion of the file is visible to the process and what type of data it is

May be basic etype or derived datatype from etypes

Define Filetype

Page 104: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 104

MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile",

MPI_MODE_CREATE | MPI_MODE_WRONLY,

MPI_INFO_NULL, &fh);

MPI_File_set_view(fh, 0, MPI_FLOAT, filetype,

"native", MPI_INFO_NULL);

local_array_size = num_local_rows * num_local_cols;

MPI_File_write_all(fh, local_array, local_array_size,

MPI_FLOAT, &status);

MPI_File_close(&fh);

By combining the noncontiguous datatype with the collective access, can merge several small requests into a few larger requests

Open File, Write, Close

Page 105: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 105

Test case (write, interleaved): §  6 processes §  4,096 floats per proc §  98MB file

Comparing: §  Collective I/O example §  Independent I/O (multiple I/O calls)

Distributed Array Performance

Collective vs. Independent I/O

Collective/Datatype

Independent

2 seconds 37 seconds

Page 106: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 106

Buffer size §  cb_buffer_size – Controls the size (in bytes) of the intermediate buffer used in two-phase

collective I/O Read/Write §  cb_nodes – Controls when collective buffering is applied to collective read operations §  romio_cb_write – Controls when collective buffering is applied to collective write operations Aggregators §  cb_config_list – Provides explicit control over aggregators (see ROMIO User’s Guide)

MPI-IO Collective Hints

Page 107: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 107

Pass hints in MPI_Info object to MPI_File_open(): §  MPI_Info mpiHints; §  MPI_Info_create(&mpiHints); §  MPI_Info_set(mpiHints, key1, value1); §  MPI_Info_set(mpiHints, key2, value2); §  MPI_File_open(comm, fileName, fd_mode, mpiHints, fd); §  MPI_Info_free(&mpiHints);

Using MPI-IO Hints

Page 108: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 108

Different MPI I/O implementations exist

We’re interested in ROMIO (part of MPICH) from Argonne National Laboratory §  ROMIO’s ADIO layer - abstract I/O interface optimized for different file systems. §  ROMIO contains patches to support Lustre, allowing hints to be passed through MPI-IO to

Lustre for layout §  Supports local file systems, network file systems, and parallel file systems §  Includes data sieving and two-phase optimizations

MPI-O: ROMIO Implementation

MPI-IO Interface

Common Functionality

ADIO Interface

UFS PVFS NFS Lustre

ROMIO’s layered architecture

Page 109: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 109

MPI I/O provides a rich interface allowing us to describe: §  Noncontiguous accesses in memory, file, or both §  Collective I/O

More Information on MPI-IO available in: §  W. Gropp, E. Lusk, R. Thakur, Using MPI-2: Advanced Features of the Message-Passing

Interface, MIT Press, Cambridge, MA, 1999.

MPI-IO Final Thoughts

Page 110: PARALLEL FILE SYSTEMS - › sites › default › files › files › ...4 Benefits of Parallel File Systems Seagate Confiden-al • Increase ability to scale by avoiding bottlenecks

Seagate Confidential 110 Seagate Confidential

Questions?