parallel file systems - › sites › default › files › files › ...4 benefits of parallel file...
Post on 27-Jun-2020
13 Views
Preview:
TRANSCRIPT
1
PARALLEL FILE SYSTEMS
2
Agenda
SeagateConfiden-al
ParallelFileSystems:• History• What?• Why?• SoWhat?• Example:LustreandSpectrumScale(GPFS)
• BestPrac1cesforParallelI/O
3
Evolution of Parallel File Systems
SeagateConfiden-al
Early1980s Today
ParallelFileSystem
MetadataServer
StorageServers
ComputerClient
Management
Metadata
DataPath
1980s–Early1990s 1990s
• Linux
BOTTLENECK
• Clustered
Worksta1on
Sharing
• Auspex• SUN/NFS
• NetApp
4
Benefits of Parallel File Systems
SeagateConfiden-al
• Increase ability to scale by avoiding bottlenecks (separate metadata and data)
• Take advantage of available bandwidth
• Increase parallel streams to storage using more than one client
• Allows compute cluster to have higher utilization not waiting on I/O
• Growing clusters with ability to perform larger calculations
MetadataServer
StorageServers
ComputerClient
Management
Metadata
DataPath
5
Major Parallel File Systems Today
SeagateConfiden-al
Lustre:• HighPerformanceandhighscalability• NoLicensingcost• Na-onalLabsandUniversi-es
SpectrumScale(akaGPFSfromIBM)• Whenthecustomerisaskingforfeaturessuchassnapshotsandmigra-onbetweendisk-ers
• Homedirectoriesandfastscratchinthesamefilesystem
• Largenumberoffilesandhighmetadatathroughput,andhighspeedstorageforacomputecluster
6
A Lustre Cluster
SeagateConfiden-al
Client
Client
Client
Router
MDS MDS
OSS
OSS
disk
disk
disk
…
OSS
Diskarrays&SANFabric
LustreClient1-100,000
Supportmul1plenetworktypes
40/100GE,EDRIB,OMP
ObjectStorageServers(OSS)1-1,000s
MetadataTarget(MDT)
ObjectStorageTarget(OST)
OSS diskCIFSClient
Gateway
NFSClient
ClusterStor HA-MDS
ClusterStor SSU
MetadataServers(MDS)
7
Summary
SeagateConfiden-al
• Whenyouneedhighperformance,highscalability->ParallelFileSystems
• ParallelFileSystemsinproduc-on->ClusterStor
BEST PRACTICES FOR PARALLEL I/O
Seagate Confidential 9
§ Access patterns
§ Metadata Usage
Performance § L300 - Bandwidth, Metadata
§ L300N – NytroXD I/O Accelerator
POSIX and MPI-IO § POSIX and directly accessing file system
§ Use of MPI-IO as middleware I/O library
§ MPI-IO Hints
§ Examples
Parallel File Systems Background § Parallel I/O Performance and Clusterstor Lustre
Optimal configuration for KAUST § Recommended tuning options for HPC workloads
Tools to identify issues § Visibility/monitoring on servers/clients
§ File/directory striping
§ I/O tracing
Best Practices for application usage of Lustre § DOs and DONTs
Agenda
LUSTRE PARALLEL I/O PERFORMANCE BACKGROUND
Seagate Confidential 11
Lustre Components
OSS
Clients
MDS OSS OSS
Directory Operations, File
open/close, metadata, and concurrency
File creation, file status, and
recovery
File I/O and locking
Seagate Confidential 12
Lustre is an open source, distributed parallel file system § Object-based design provides extreme
scalability § Compute clients interact directly with
storage servers § Comprised of:
§ Clients
§ Metadata Servers and Targets
§ Storage Servers and Targets
Lustre Parallel File System
Seagate Confidential 13
L300 ClusterStor Management (SMU/MMU): Management and Metadata (MDS/MDT)
CS Manager and MDS/MGS Nodes § 2RU Integrated Controllers – Server 1: CSM Mgmt – Server 2: Boot
– Server 3: MGS/MDS – Server 4: MDS
Fault Tolerance Serviceability
2U24 JBOD – MDT § SAS JBOD for MDS § Disk Configuration – Qty 20 Drives for MDT – 2x RAID10, 5+5 – Qty 2 Global Hot spares
Seagate Confidential 14
Scalable Storage Unit (SSU) SSU § 5U84 Enclosure § 2 Object Storage Servers per
SSU § Two (2) trays of 41 HDD’s each
for Object Storage Targets § 2 SSDs (WIBs, Journals,
NytroXD) § H/A on each SSU § Infiniband EDR, 40GbE, OPA
data network connectivity
Seagate Confidential 15
ClusterStor Hardware and the Lustre File System
Object Storage Server
Seagate Embedded Application Server
Object Storage Target Seagate 5U84 Storage
Enclosure
Meta Data and Management
Servers 2U x 4 Servers
Meta Data Target
Seagate 2U24 JOBD
1) Where is file?
2) File is at…. Client
File
3) Single File (3,072Kb)
5a) File block stripe 1 of 3 (1,024Kb)
5b) File block stripe 2 of 3 (1,024Kb)
5c) File block stripe 3 of 3 (1,024Kb)
4) File is broken into block stripe segments (1,024Kb)
Seagate Confidential 16
MDS <-> OST Interaction: Use Cases
Create object
Unlink
StatFS
Create new data object for file
Delete data object if file unlinked
Get info from OST for size of file, space on OST
Seagate Confidential 17
File on Lustre
FILE Attributes (name, permission, owner, security label)
Data Blocks, actual size Handled by OST (Each data block is object on OST)
Mapping from name to data objects (aka LOV EA) Stored in extended attributes on MDT
Handled by MDT
Seagate Confidential 18
● Allows file data to be stored evenly on multiple OSTs ● RAID 0 pattern
Striping
1 2 3
OST1 OST2 OST3
Object
File A File B
1
Example: Single Stripe File All file data is in a single OST
Seagate Confidential 19
• Object – stripes belonging to one file on an OST • Stripe count – number of objects involved • Stripe size – max size of one stripe • Stripe object sizes add up to total file size In this example: • File A stripe count = 2 • File B stripe count = 3 • Typical stripe size is 1MB • File B stripe size is twice that
of File A.
Striping Terminology
1 3 5
OST1
1
2 4
OST2
2
OST3 3 Object
File A
File B
Seagate Confidential 20
Striping (cont.)
1 4 7
OST1
1
2 5
OST2
2
3 6
OST3
3
Object
File A File B
9
Example: Fully-Striped Files • These are fully striped files, meaning striped over all OSTs
• Also called wide striping • This achieves maximum bandwidth to one file
• File A has a hole or is sparse
• This happens commonly in HPC usage of a file system
Seagate Confidential 21
● The parallel I/O performance comes from separating the metadata and data objects
● The OSSs handle the containers for the component objects.
● The data (file) can be striping across multiple OSTs for improved parallelism
● Typically this striping is used for very large, single files with multiple processes accessing.
Overview
CONFIGURATION AND TUNING
Seagate Confidential 23
RPCs • lctl set_param osc.*.max_rpcs_in_flight=256 [default: 8] • lctl set_param osc.*.max_pages_per_rpc=1024 [default:256] Write cache • lctl set_param osc.*.max_dirty_mb=1024 [default: 32]
Client-Side Tunables
Seagate Confidential 24
Readahead – Best for streaming read workload
• lctl set_param llite.*.max_read_ahead_mb=1024 [default: 40]
• lctl set_param llite.*.max_read_ahead_per_file_mb=1024 [default: 40]
Readahead - is your workload strided?
• May set the value to default (40), or tune to lower value
Readahead Step
• lctl set_param llite.*.read_ahead_step=4 (Seagate client only)
Client-Side Tunables
Seagate Confidential 25
Wire checksum
• /proc/fs/lustre/osc/*/checksums=0
LRU size is dynamic by default - Controls number of client-side locks in an LRU cached locks queue.
• lctl set_param ldlm.namespaces.*osc*.lru_size=0
LNET verbosity
• sysctl -w lnet.debug=0
Client-Side Tunables
Seagate Confidential 26
While there are tunables available on the servers, typically the system defaults are optimal. In special cases, Seagate Support and Engineering may recommend specific changes.
Server-Side Tunables
TOOLS
Seagate Confidential 28
A number of tools on the management server for monitoring Ltop (and GUI) $ ltop
Filesystem: cstor
Inodes: 1434.250m total, 0.006m used ( 0%), 1434.244m free
Space: 451.605t total, 160.525t used ( 36%), 291.080t free Bytes/s: 0.000g read, 0.000g write, 0 IOPS
MDops/s: 0 open, 0 close, 0 getattr, 0 setattr
0 link, 0 unlink, 0 mkdir, 0 rmdir 0 statfs, 0 rename, 0 getxattr
>OST S OSS Exp CR rMB/s wMB/s IOPS LOCKS LGR LCR %cpu %mem %spc
0000 F cstor04 17 0 0 0 0 0 0 1 2 48 0 0001 F cstor05 17 0 0 0 0 0 0 0 2 49 0
0002 F cstor06 17 0 0 0 0 96 0 0 33 99 69
0003 F cstor07 17 0 0 0 0 96 0 0 26 99 73
Visibility Tools – Server Performance Monitoring
Seagate Confidential 29
For clientside tools, checking the RPC write/read stats: $ lctl get_param osc.*OST0000*.import rpcs: inflight: 0 unregistering: 0 timeouts: 0 avg_waittime: 304635 usec read_data_averages: bytes_per_rpc: 16777216 usec_per_rpc: 105534 MB_per_sec: 158.97 write_data_averages: bytes_per_rpc: 16692814 usec_per_rpc: 759389 MB_per_sec: 21.98
Visibility Tools – Client Performance Monitoring
Seagate Confidential 30
For clientside tools, checking the RPC stats:
• Pages per RPC – All 1024-page writes, * 4k = 4MB RPCs
$ lctl get_param osc.*OST*.rpc_stats read write
pages per rpc rpcs % cum % | rpcs % cum % 1: 0 0 0 | 3 0 0 ... 256: 587809 64 64 | 868 0 0 512: 4734 0 64 | 1270 0 0 1024: 320913 35 100 | 467052 99 100 read write
rpcs in flight rpcs % cum % | rpcs % cum % 0: 0 0 0 | 0 0 0 1: 8870 0 0 | 49 0 0 ... 4: 22367 2 7 | 48 0 0
Visibility Tools – Client Performance Monitoring
Seagate Confidential 31
For metadata statistics $ llstat /proc/fs/lustre/mdc/*MDT*/md_stats
snapshot_time 1488139067.685549
close 66
create 9
getattr 1
intent_lock 402
read_page 23
unlink 285
intent_getattr_async 244
revalidate_lock 567
Visibility Tools – Client Metadata Monitoring
Seagate Confidential 32
$ llobdstat /proc/fs/lustre/osc/*OST0000* 1
/usr/bin/llobdstat on /proc/fs/lustre/osc/cstor-OST0003-osc-ffff8807f72e7400
Processor counters run at 1200.000 MHz
Read: 1.96468e+12, Write: 1.92445e+12, create/destroy: 0/0, stat: 0, punch: 0
[NOTE: cx: create, dx: destroy, st: statfs, pu: punch ] Timestamp Read-delta ReadRate Write-delta WriteRate -------------------------------------------------------- 1428086691 412.00MB 411.85MB/s 0.00MB 0.00MB/s 1428086692 184.00MB 183.93MB/s 0.00MB 0.00MB/s 1428086693 296.00MB 295.87MB/s 0.00MB 0.00MB/s
Visibility Tools – Client Monitoring OSS Statistics
Seagate Confidential 33
Use Lustre and stripe-aware tools instead of standard Unix $ lfs df
UUID 1K-blocks Used Available Use% Mounted on
cstor-MDT0000_UUID 2255452564 4317344 2221057108 0% /mnt/cstor[MDT:0]
cstor-OST0000_UUID 121226819924 119016 120014045804 0% /mnt/cstor[OST:0]
cstor-OST0001_UUID 121226819924 119016 120014045804 0% /mnt/cstor[OST:1]
cstor-OST0002_UUID 121226819924 84273533756 35738853464 70% /mnt/cstor[OST:2]
cstor-OST0003_UUID 121226819924 89476742140 30536591256 75% /mnt/cstor[OST:3]
$ lfs find old/ -t f -print0 | xargs -0 rm $ lfs help
User Tools – lfs
Seagate Confidential 34
Setting file/directory striping
$ mkdir dir_count_wide_A
$ lfs setstripe –c -1 –S1048576 dir_count_wide_A
$ lfs getstripe dir_count_wide_A
count=3
stripe_count: 3 stripe_size: 1048576 stripe_offset: -1
User Tools – lfs
1
OST1
1
2
OST2
2
3
OST3
Dir A Dir B
Seagate Confidential 35
Setting file/directory striping
$ mkdir dir_count_wide_B
$ lfs setstripe -c 2 -S 2097152 dir_count_wide_B
$ lfs getstripe dir_count_wide_B
count=2
stripe_count: 2 stripe_size: 2097152 stripe_offset: -1
User Tools – lfs
1
OST1
1
2
OST2
2
3
OST3
Dir A Dir B
TRACING
Seagate Confidential 37
• Unexpected I/O patterns are common – Dusty decks or data access libraries may hide I/O – Many applications running concurrently and competing for resources – Original developer and intent may not be available
• Examples seen – In a loop: random seek forward or back approx 1GB, then read several bytes, taking days
to process a 20GB input;16 bytes per I/O at 100 ops/sec is 14 days to read 20GB – 1372 bytes written at strides of 175616 bytes, by 128 nodes concurrently – Uncoordinated concurrent writes to the end of a shared log file, over NFS – Record updates by re-writing the entire file – Open, Write < 100 bytes, Close, Repeat
Tracing I/O Patterns
Seagate Confidential 38
• strace is a great tool to easily obtain simple I/O profiles
• Gives us a good idea of the behavior of the application
• Try to determine where most of I/O time is spent
• Can be used to collect I/O system calls – e.g., all read, write , open, … system calls – unix tool strace captures all of this information
• Can help with analyzing behavior of the app – what files did the app touch, how much data was moved, reads vs. writes, timing,
etc.
Tracing
Seagate Confidential 39
strace traces the execution of system calls. strace -f -tt -T -e trace=open,close,read,write,lseek –o $OUT $EXE
-f will trace child processes from fork() -tt tells strace to record time in microseconds -T shows time spent in system call -e trace= will trace only these system calls -o $OUT tells strace where to put the trace (don’t put trace output in test FS) $EXE is the command you would actually use to run the program
Instrumenting application for strace not always possible, so attach to running process:
-p $PID will attached to an existing process ID
Using strace
Seagate Confidential 40
fd = open(argv[1], O_RDWR | O_CREAT ); while ((rc = read(fd, buffer, 1024)) > 0) printf("%d\n", rc);
$ strace -f -tt -T -o /tmp/strace.out ./testProgram -arg1 -arg2
Dumps strace output to file /tmp/strace.out time system call and other info ------------------------------------------------------- 09:59:33.790079 open(“/fs/home/test", O_RDWR|O_CREAT, 03777761102063720) = 3 <0.000026> 09:59:33.790154 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,1024) = 1024 <0.000025> 09:59:33.790291 write(1, "1024\n", 51024) = 5 <0.000013> 09:59:33.790339 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.000018> 09:59:33.790403 write(1, "1024\n", 51024) = 5 <0.000007> 09:59:33.790442 read(3, "", 1024) = 0 <0.000048>
Simple Program Example
Seagate Confidential 41
Collecting straces from MPI codes
Wrapper Script:
#!/bin/sh
# Commandline:
# mpirun -np 4 ./mpi-strace ./a.out <args>
unset STRACE_SUFFIX
STRACE_SUFFIX=`hostname`.$PMI_RANK # or, $$
OPTS="-f -F -tt -T –e trace=open,close,read,write,lseek"
exec strace $OPTS "$@" 2> strace.${STRACE_SUFFIX}
Tracing MPI Codes
Seagate Confidential 42
Creates an strace file for each MPI process:
% mpirun -np 4 ./mpi-strace.sh ./IOR -o /fs/home/IOR.test -F
% ls strace.*
strace.sjSC-201.0
Strace.sjSC-202.2
Strace.sjSC-203.1
Strace.sjSC-204.3
Each strace output file contains trace from individual rank.
Tracing MPI Codes
Seagate Confidential 43
Darshan • Excellent for profiling I/O • No code changes • Requires recompiling the MPI application with Darshan
Tau • Application profiling
Other Profiling Tools
BEST PRACTICES FOR APPLICATION I/O
Seagate Confidential 45
• Increase ability for scaling by avoiding bottlenecks
• Take advantage of available bandwidth
• Increase streams to parallel storage using more than one client
• Where applicable, take advantage of client-side cache
• Use large transfers when possible or I/O library to describe smaller or noncontiguous transfers
Objectives
Seagate Confidential 46
• Partition the problem domain
• Processes work on own subdomains independently
• Optionally, write intermediate results
– For restart after failure or other reasons
• Optionally, reiterate on writes
• Upon completion, assemble and integrate parts
• Write final result
– May be same format as intermediate results
General Application Behavior
Seagate Confidential 47
• The I/O pattern affects performance.
• N-to-N streaming workloads can approach disk bandwidths.
• Random I/O is limited by disk seek performance.
• N-to-1 sharing affects performance because of locking contention.
• Strided I/O patterns are somewhere in between.
• Keep access aligned when possible.
General Design Considerations
Seagate Confidential 48
Serial, “Rank 0”, “Head node” model
Bottleneck, no parallel paths
N-to-N: Unique file-per-process (FPP)
Restart needs same number of nodes
Many files to handle for post-processing
N-to-1: A single-shared-file (SSF)
All data in a single file for restart purposes
Easier for post-processing
Typical I/O Access
Node 0
Node 1
Node 2
Node 0
Node 1
Node 2
File 0
File 1
File 2
File
Node 0
Node 1
Node 2
File
Seagate Confidential 49
Contiguous
All data from a task is contiguous in the file
Discontiguous
Strided
Random
N-1 Access Patterns
a
a
b
b
c
c
a
a
b
b
c
c
a
a
b
b
c
c
a
a
b
b
c
c
A B C
a
a
a
a
b
b
b
b
c
c
c
c
A B C
a
a
b
b
c
c
a
a
b
b
c
c
a
a
b
b
c
c
Clients
File
Seagate Confidential 50
Lustre loves 1MB I/Os • But not if they aren’t aligned on stripe boundaries if your app has a natural
stride, use that • lfs setstripe --stripesize=256k
More stripes = more parallel bandwidth • clients >= stripe count • Applicable to N-1 (SSF), not N-N (FPP)
Sequential vs. Random • It’s all about seeks
IO Patterns
Seagate Confidential 51
Writes are generally faster than reads • It’s all about seeks
Well-aligned I/O is always best • Else may be performing a read-modify-write
N-N is ideal, N-1 is more challenging • Avoid overlapping access to same file region • Lustre Distributed Lock Manager will enforce coherency (and serialize)
IO Patterns
Seagate Confidential 52
Buffered I/O for smaller I/Os • If application has 32KB I/O calls, e.g., use buffering (default) • Has some userspace/kernelspace overhead • May be cached for reads on client
Use O_DIRECT when appropriate • O_DIRECT is useful for large IO • Does not buffer client cache, so where you are not interested in reading the file • Does not allow for readahead (prefetching)
I/O Options
Seagate Confidential 53
Deep directories have little contention, but do require serial lookups
Directory locks allow a single client to work unilaterally (dir-per-client)
Racing creates in a single dir must be serialized
Directory Layout
RULES OF THUMB
Seagate Confidential 55
Bigger is better § Large files, large transfers, large numbers of clients are good
Aggregate writes – use large transfers at least 64KB when possible or I/O library to describe smaller or noncontiguous transfers § Contiguous data patterns utilize prefetching and write-behind far better than noncontiguous (or
random) patterns § Collective I/O can aggregate for you, transforming disjoint accesses into contiguous access
But bigger isn’t always necessary § I/O transfer size isn’t super important for streaming workloads because of read-ahead and write
batching § 64K is often just as good as 4MB, and sometimes even better
N-to-N streaming workloads can approach disk bandwidths.
Rule of Thumb #1
Seagate Confidential 56
Sharing is important § And parallel file systems work hard to make sharing “perfect” so that clients anywhere in the network
have an up-to-date view of files and their data (unlike NFS)
Sharing has a cost § Sharing between clients requires coordination by the file system, and therefore has a cost § N-to-1 sharing affects performance because the metadata manager has to mediate access.
For N-1, use MPI-IO § Collective I/O and Datatypes § MPI-IO Hints when possible
Avoid overlapped write regions – best to use block-aligned data
Rule of Thumb #2
Seagate Confidential 57
Design application I/O to describe as much as possible to file system: § Open file with appropriate mode § Use collective calls when available § Describe data movement with fewest possible operations
Match file organization to process partitioning if possible § Order dimensions so relatively large blocks are contiguous with respect to data decomposition
Random I/O is limited by disk seek performance and read-modify-write
The I/O pattern affects performance.
Rule of Thumb #3
Seagate Confidential 58
Parallel File Systems are not optimized for metadata, instead for moving data § Opening files and closing files incur overhead. Keep file creates, opens, and closes to a minimum –
open/close once if possible Use your own subdirectory to avoid contention with others
Create multiple directories, distribute across DNE if available
Avoid filling directories to maximum Don’t use ‘ls -l’ or ‘du’ on millions of files (or to check file size progress) § Similarly, turn off color alias (ls --color) § Typically ‘ls’ is aliased to ‘ls --color=tty’. This actually does a stat. § ls –l: calls readdir() followed by a sequence of stat(2) calls for every name returned by readdir()
Rule of Thumb #4
Seagate Confidential 59
Increase ability for scaling by avoiding bottlenecks
Parallel I/O from compute nodes eliminates the head node bottleneck
Increase streams to parallel storage using more than one client
N-N access is generally tuned for parallel file systems
N-1 access may need additional tuning
Collective I/O may be appropriate for N-1
Use tracing to understand application I/O patterns
Keep access aligned when possible.
Rule of Thumb #5
PERFORMANCE
Seagate Confidential Seagate Confidential
Streaming Bandwidth Performance
• L300/4TB Tardis
• CS9000/Makara
• IOR-2.10.3
• 16 clients
• Stonewalling (180 sec)
• Average of 5 iterations
Direct I/O
CS9000 L300 Write Read Write Read
9.6 10.1 12.3 13.6
Buffered I/O
CS9000 L300 Write Read Write Read
9.8 9.4 13.3 12.3
61
Seagate Confidential Seagate Confidential
Streaming Bandwidth Performance
• L300N/4TB Tardis
• IOR-2.10.3
• 16 clients
• Stonewalling (180 sec)
• Average of 5 iterations
Direct I/O
CS9000 Makara L300N Tardis Write Read Write Read
9.6 10.1 16.1 16.0
62
63 Seagate Confidential
Drive Write Performance -- iostat
Seagate Confidential Seagate Confidential
Advanced MMU Prototype Performance
• Single MDS/MDT
• 2x SSU+1 (32 OSTs)
• RAID10 (5+5)
• mdtest-1.9.3
• Reporting File/s
MDS Create Stat Open Remove
CS9000 CMU 75K 220K 90K 65K
L300 MMU 55K 220K 95K 70K
L300 MMU-A 105K 350K 220K 80K
64
65 Seagate Confidential
Large-Scale Lustre Scaling Testing Test conducted at Customer Site 488 Client Nodes, FDR Fabric 1. Scaling SSU+1 Count
36x L300 SSU+1 FDR (bottlenecked fabric for writes)
2. Scaling Metadata Servers (DNE)
7x MDS/MDT Base MMU Active/Passive + 3x MMU Active/Active
65
66 Seagate Confidential
Lustre Read Performance Scaling • Test single SSU+1 • 14 Compute Nodes • 12.97 GB/s Read
• Project Linear Scaling IOR -v -B -F -a POSIX -r -k -m -E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT
0
50
100
150
200
250
300
350
400
450
500
4 8 12 16 20 24 28 32 36
GB
/s
SSU+1 Count
IOR Read Performance
Read
67 Seagate Confidential
Lustre Read Performance Scaling • Scaling 4 - 36 SSU+1s • 14 Nodes per SSU+1 • Projection based on SSU+1 IOR -v -B -F -a POSIX -r -k -m –E -b 16G -t 64M -C -e -m -D 120 -i 3 -o $OUT
0
50
100
150
200
250
300
350
400
450
500
4 8 12 16 20 24 28 32 36
GB
/s
SSU+1 Count
IOR Read Performance
Read
68 Seagate Confidential
Lustre Write Scaling • Scaling 4 - 32 SSUs • 12 Compute Nodes per SSU • Projection based on 1 SSU IOR -vv -k -B -F -a POSIX -w -b 256G -t 64M -C -e -m -D 120 -i 3 -o $OUT
0
50
100
150
200
250
4 8 12 16 20 24 28 32
GB
/s
SSU Count
IOR Write Scaling
Write
69 Seagate Confidential
Lustre Metadata Scaling Scaling 1 - 7 DNEs 64 Nodes per DNE (1M Files per DNE) mdtest -i 3 -n $(( $DNES * 1048576 / $PROCS )) -F -u -C -E -T -r -v -i 3 -d $DIR[@$DIR2…]
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
1 2 3 4 5 6 7
File
/s
DNEs
File Create/s
-
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1 2 3 4 5 6 7
File
/s
DNEs
File Stat/s
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
1 2 3 4 5 6 7
File
/s
DNEs
File Remove/s
Seagate Confidential Seagate Confidential
• Latest Seagate Lustre Client Based on IEEL
• Improved Performance
• Best of Seagate and IEEL
IOR -t 1M -b 131072M -F -C -v -v \
-E -k -m -i 3 -w -r -o $OUTFILE
70
OS Lustre Client Write MB/s
Read MB/s
CentOS 6.5 Seagate 2.5.1 795.37 889.32
CentOS 6.5 OpenSFS 2.7.0 946.12 968.24
CentOS 6.5 Seagate 2.7.13 (IEEL) 1,048.41 1,475.64
CentoOS 7.2 Seagate 2.5.1 991.38 1,274.77
CentoOS 7.2 OpenSFS 2.7.0 1,032.32 1,209.10
CentoOS 7.2 Seagate 2.7.13 (IEEL) 1,265.84 1,804.68
Single-Stream Performance
Seagate Confidential 71
NytroXD Intelligent I/O Manager
Workloads are becoming increasingly unpredictable for the storage system Ø Streaming IO Ø Random IO Ø Unaligned
Solution: ü Use proper technology for the workload ü Cost conscious
Seagate Confidential
Nytro Accelerator (NXD) is a new feature that combines new software functionality ○ Leverages the existing L300N/G300N dual SSDs to provide automatic, selective
acceleration of the performance of specific IO accesses, characterized as “small.” ○ The caching software, referred to as “NytroXD” acts as a filter driver to cache small
blocks of data in an SSD cache to improved performance
The definition of a “small block” in this context is configurable. ○ The IO_bypass size parameter, which determines whether an I/O is accelerated or
not is configurable and is designated as being settable by the user.
NytroXD Intelligent I/O Manager
Seagate Confidential 73
Basic NytroXD Data Path
OSS0
OSS1
41-HDD OST0 GridRAID
41-HDD OST1 GridRAID
2-SSD NXD RAID10
Client Bypass=32KB
1024KB 4KB
32KB 64KB
• I/O Requests coming from the Client will be directed at the OSS
• Requests <= Bypass Size will be written to the SSDs
• Requests > Bypass Size will be written to HDDs
• Default Bypass is set to 32KB, but can be set as high as 1MB
Seagate Confidential 74
Basic NytroXD Data Path
OSS0
OSS1
41-HDD OST0 GridRAID
41-HDD OST1 GridRAID
2-SSD NXD RAID10
Client Bypass=32KB
1024KB
4KB 32KB
64KB
• As the SSD array fills, the data is destaged to the HDD arrays
• The data that arrives on the SSD will be available for reading unless evicted from cache
• Current caching policy favors accelerated writes
4KB
32KB
Seagate Confidential 75
Small, Aligned, Random I/O – IOR, 120-second stonewall, buffered I/O
16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)
Seagate NytroXD, L300N – Write/Rewrite, Aligned I/O
0 5,000
10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
128 64 32 16 8 4 2 1
IOPS
Transfer Size (KB)
Write -- Aligned, Random
NXD
No NXD
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
128 64 32 16 8 4 2 1
IOPS
Transfer Size (KB)
Rewrite -- Aligned, Random
NXD
No NXD
Seagate Confidential 76
Small, Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O
16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)
Seagate NytroXD, L300N – Write/Rewrite, Unaligned I/O
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
160000 80000 40000 20000 10000 5000
IOPS
Transfer Size (Bytes)
Rewrite -- Unaligned, Random
NXD
No NXD
0
5,000
10,000
15,000
20,000
25,000
30,000
160000 80000 40000 20000 10000 5000
IOPS
Transfer Size (Bytes)
Write -- Unaligned, Random
NXD
No NXD
Seagate Confidential 77
Small, Aligned/Unaligned, Random I/O – IOR, 120-second stonewall, buffered I/O
16 nodes, 64 processes total, 1x SSU, (2x SSDs vs. 82x 7200RPM HDDs)
Seagate NytroXD, L300N – Read, Aligned/Unaligned I/O
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
128 64 32 16 8 4 2 1
IOPS
Transfer Size (KB)
Read -- Aligned, Random
NXD
No NXD
0
20,000
40,000
60,000
80,000
100,000
120,000
160000 80000 40000 20000 10000 5000
IOPS
Transfer Size (Bytes)
Read -- Unaligned, Random
NXD
No NXD
Seagate Confidential 78
Recommended Usage and Caveats for NytroXD • NytroXD is primarily designed for small writes under the bypass
threshold.
• Random Write/Rewrite I/O (and Read if in cache) is recommended.
• For random rewrite of data on the HDDs, 4KB-page aligned data will go to SSD cache.
• For random rewrite of data on the HDDs, unaligned data may incur a read-modify-write from the HDDs to the SSDs
Seagate Confidential 79
Recommended Usage and Caveats for NytroXD • Depending on the size of the SSDs and workload, the data may not
necessarily be available for reading from the SSD cache.
• Sequential Buffered I/O will be aggregated on the client typically larger than the bypass setting.
• Sequential Direct I/O with requests smaller than or equal to the bypass will go to SSD cache.
• As the utilization of the SSD reaches 40% and 60%, the destaging of data (copying from SSDs to HDDs) will accelerate.
• When the utilization reaches 80% of the SSD, read eviction increases to accommodate additional incoming writes.
Seagate Confidential 80
Monitoring NytroXD Usage - perfmon /opt/seagate/nytrocli/nytrocli64 /xd show perfmon <snip> purple04: Name of the Cache Group = nxd_cache_0 purple04: Number of VDs = 1 purple04: Number of Cache Devices = 1 purple04: Queue Depth = 4096 purple04: Total Cache Size = 371.054 GiB purple04: Cache Size in use = 26.692 GiB purple04: Cache Block Size = 4 KiB purple04: Cache Window Size = 64 KiB purple04: Bypass IO Size = 1.0 MiB purple04: Total number of I/Os = 1076225215 purple04: Number of reads = 516326249 purple04: Number of writes = 559898966 purple04: Total number of bypass I/Os = 200012374 purple04: Number of bypass reads = 52053543 purple04: Number of bypass writes = 147958831 purple04: Cache Hits = 660474280 purple04: Cache Misses = 18446744073 purple04: Number of dirty CWs = 435362 purple04: Total Cache Blocks flushed = 3812514848
Seagate Confidential 81
Histogram Feature in NytroXD • To use the Histogram of I/O sizes being written/read on the OSS,
the following may be configured:manually clear cache, check cache group name (nxd_cache_0, e.g.):
nytrocli64 /xd show perfmon | grep “Cache Group”
nytrocli64 /xd get cachestate cg=nxd_cache_0
nytrocli64 /xd set cachestate=disable cg=nxd_cache_0
nytrocli64 /xd get cachestate cg=nxd_cache_0
nytrocli64 /xd set histparam=enable
<RUN WORKLOAD>
Seagate Confidential 82
Histogram Feature in NytroXD • To collect the results:
nytrocli64 /xd show histogram
• Once completed, disable the histogram and re-enable the cache:
nytrolci64 /xd set histparam=disable
nytrocli64 /xd set cachestate=enable cg=nxd_cache_0
nytrocli64 /xd get cachestate cg=nxd_cache_0
Seagate Confidential 83
Histogram Output The histogram shows both Writes and Reads
This is a test of sequential 128KB (Direct I/O) writes and random buffered I/O 4KB writes. (Writes only shown here.)
Num Writes < 4K = 0 Num Writes 4K = 9943751 Num Writes 4K+1 - 8K = 12902 Num Writes 8K+1 - 16K = 76 Num Writes 16K+1 - 32K = 0 Num Writes 32K+1 - 64K = 0 Num Writes 64K+1 - 128K = 1034128 Num Writes 128K+1 - 256K = 0 Num Writes 256K+1 - 512K = 0 Num Writes 512K+1 - 1M = 0 Num Writes 1M+1 - 2M = 0 Num Writes 2M+1 - 4M = 0 Num Writes 4M+1 - 8M = 0 Num Writes 8M+1 - 16M = 0 Num Writes 16M+1 - 32M = 0
I/O INTERFACES: POSIX
Seagate Confidential 85
Applications require more software than just a parallel file system
An interface between the application and file system is needed
We’ll be looking at these for application access: § POSIX § MPI-IO
Both can provide layer for higher-level I/O libraries
Application I/O Interfaces
Higher-level I/O Libraries POSIX MPI-IO
Parallel File System I/O Hardware
Application
Seagate Confidential 86
POSIX is the IEEE Portable Operating System Interface for Computing Environments
Defines a standard way for an application program to obtain basic services from the O/S
POSIX I/O is the mechanism that almost all serial applications use to perform I/O
Created when a single computer owned its own file system
• No ability to describe parallel I/O constructs
open(), seek(), write(), read(), close()
• O/S maps these calls directly into file system operations Provides a useful, ubiquitous interface for basic I/O
POSIX I/O
Seagate Confidential 87
Lustre gets high bandwidth at sync time by coalescing contiguous dirty buffers into a single large transfer to the OSTs
Maximum amount of dirty data in the kernel = 1024MB
Linux uses “flush on close” semantics
• Keep in mind if your application does multiple open/write/close cycles on the same file
File Writing
Seagate Confidential 88
Write syscall size § Small write syscalls have a large overhead in the kernel § Large write syscalls (> 1MB) may not offer a gain for buffered I/O (though will for
Direct I/O
Example test run (1 client, buffered I/O) § 4KB writes: 400 MB/sec § 1MB writes: 1 GB/sec § 16MB writes: 1 GB/sec
Write system call alignment § It is very beneficial to align writes to page size boundaries § Prevents read-modify-write cycle all down through system
File Writing
Seagate Confidential 89
Read syscall size § Small read syscalls (< 1 page) will trigger a read of the entire page to populate the cache § Large read syscalls (> 1MB) may not offer a gain for buffered I/O Example test run (1 client, buffered I/O) § 4KB writes: 500 MB/sec § 1MB writes: 1 GB/sec § 16MB writes: 1 GB/sec
Read system call alignment § Does not affect performance similar to writes since whole blocks are read from the OSTs § Sequential reads will trigger background readaheads
File Reading
Seagate Confidential 90
Optimize case that most file accesses are totally sequential file reads
Launch large asynchronous reads in background
• Lustre client’s readahead can get benefits from the Object Storage Server doing readahead
Readahead will be turned off by non-sequential read patterns in the application
Only available in buffered I/O, not direct I/O
Read Ahead
• Observe the I/O pattern and if sequential or strided, launch background reads ahead of time
• Only effective with forward sequential/strided reads
Readahead
I/O INTERFACES: MPI-IO
Seagate Confidential 92
MPI-IO – I/O interface specification for MPI applications Data model is same as POSIX • Stream of bytes in file, not self-describing nor containing additional metadata • Allows for more complex patterns than POSIX Features: • Noncontiguous I/O with MPI datatypes and filetypes • Collective I/O Operations – allows for optimizations
• Complex data pattern described in single operation, passing more information • Nonblocking I/O Access
• Allows overlap of I/O and computation as can progress before return of I/O call • MPI Hints for specific tunings
MPI-IO
Seagate Confidential 93
Independent I/O operations specify only what a single process will do
Many applications have phase of computation and I/O § During I/O phases, all processes read/write data
Collective I/O is coordinated access to storage by a group of processes § Collective I/O function called by all processes participating in I/O § Allows I/O layers to know more about access as a whole, more opportunities for
optimization in lower software layers § Without collective I/O, no understanding of what other processes are doing so cannot
coordinate efficiently between them
Independent and Collective I/O
P0 P1 P2 P3 P4 P0 P1 P2 P3 P4
Independent I/O Collective I/O
Seagate Confidential 94
Problems with independent, noncontiguous access § Lots of small accesses
Idea: Reorganize access to match layout on disks § Single processes use data sieving to get data for many § Often reduces total I/O through sharing of common blocks
Second “phase” redistributes data to final destinations for reads
Two-phase writes operation in reverse (redistribute then I/O)
Collective I/O and Two-Phase I/O
P2 P0 P1 P2 P0 P1 P2 P0 P1
Initial State Phase I: I/O Phase II: Redistribution
Seagate Confidential 95
MPI I/O version of “Hello World”
First program writes a file with test in it
Second program reads back the file and prints the contents
Shows basic API use and error checking
Simple MPI I/O Examples
Seagate Confidential 96
#include <mpi.h>
#include <mpio.h> /* may be necessary on some systems */
int main(int argc, char **argv)
{
int ret, count, rank;
char buf[13] = "Hello World\n";
MPI_File fh;
MPI_Status status;
MPI_Init(&argc, &argv);
ret = MPI_File_open(MPI_COMM_WORLD, "myfile",
MPI_MODE_WRONLY | MPI_MODE_CREATE,
MPI_INFO_NULL, &fh);
if (ret != MPI_SUCCESS) return 1;
/* continues on next slide */
Simple MPI-IO: Writing (1)
Seagate Confidential 97
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* with 13 tasks, each writes single byte of string */
ret = MPI_File_write(fh, buf[rank], 1, MPI_CHAR, &status);
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
Simple MPI-IO: Writing (2)
Seagate Confidential 98
#include <mpi.h>
#include <mpio.h>
#include <stdio.h>
int main(int argc, char **argv)
{
int ret, count;
char buf[13];
MPI_File fh;
MPI_Status status;
MPI_Init(&argc, &argv);
ret = MPI_File_open(MPI_COMM_WORLD, "myfile",
MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
if (ret != MPI_SUCCESS) return 1;
/* continues on next slide */
Simple MPI-IO: Reading (1)
Seagate Confidential 99
ret = MPI_File_read(fh, buf, 13, MPI_CHAR, &status);
if (ret != MPI_SUCCESS) return 1;
MPI_Get_count(&status, MPI_CHAR, &count);
if (count != 13) return 1;
printf(“%s”, buf);
MPI_File_close(&fh);
MPI_Finalize();
return 0;
}
Simple MPI-IO: Reading (2)
Seagate Confidential 100
$ mpicc mpiio-hello-write.c –o mpiio-hello-write
$ mpicc mpiio-hello-read.c –o mpiio-hello-read
$ mpiexec –n 13 mpiio-hello-write
$ mpiexec –n 3 mpiio-hello-read
Hello World
Hello World
Hello World
$ ls –l myfile
-rwxr-xr-x 1 bloewe users 13 Jan 18 08:19 myfile
$ cat myfile
Hello World
Compiling and Running
Seagate Confidential 101
Array to be written to a common file containing the global array in row-major order
Contiguous data in memory, but noncontiguous in file (stored in row-major order)
Example: Distributed Arrays
P1 P0 P2
P4 P3 P5
n columns m
row
s
2D array distributed on a 2x3 process grid
Seagate Confidential 102
gsizes[0] = 2; /* no. of rows in global array */
gsizes[1] = 3; /* no. of columns in global array */
distribs[0] = MPI_DISTRIBUTE_BLOCK; /* block distribution */
distribs[1] = MPI_DISTRIBUTE_BLOCK; /* block distribution */
dargs[0] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */
dargs[1] = MPI_DISTRIBUTE_DFLT_DARG; /* default block size */
psizes[0] = 2; /* no. of processes in vertical dimension
of process grid */
psizes[1] = 3; /* no. of processes in horizontal dimension
of process grid */
Setup Parameters
Seagate Confidential 103
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Type_create_darray(6, rank, 2, gsizes, distribs,
dargs, psizes, MPI_ORDER_C, MPI_FLOAT,
&filetype);
MPI_Type_commit(&filetype);
Filetype – MPI datatype specifying what portion of the file is visible to the process and what type of data it is
May be basic etype or derived datatype from etypes
Define Filetype
Seagate Confidential 104
MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile",
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
MPI_File_set_view(fh, 0, MPI_FLOAT, filetype,
"native", MPI_INFO_NULL);
local_array_size = num_local_rows * num_local_cols;
MPI_File_write_all(fh, local_array, local_array_size,
MPI_FLOAT, &status);
MPI_File_close(&fh);
By combining the noncontiguous datatype with the collective access, can merge several small requests into a few larger requests
Open File, Write, Close
Seagate Confidential 105
Test case (write, interleaved): § 6 processes § 4,096 floats per proc § 98MB file
Comparing: § Collective I/O example § Independent I/O (multiple I/O calls)
Distributed Array Performance
Collective vs. Independent I/O
Collective/Datatype
Independent
2 seconds 37 seconds
Seagate Confidential 106
Buffer size § cb_buffer_size – Controls the size (in bytes) of the intermediate buffer used in two-phase
collective I/O Read/Write § cb_nodes – Controls when collective buffering is applied to collective read operations § romio_cb_write – Controls when collective buffering is applied to collective write operations Aggregators § cb_config_list – Provides explicit control over aggregators (see ROMIO User’s Guide)
MPI-IO Collective Hints
Seagate Confidential 107
Pass hints in MPI_Info object to MPI_File_open(): § MPI_Info mpiHints; § MPI_Info_create(&mpiHints); § MPI_Info_set(mpiHints, key1, value1); § MPI_Info_set(mpiHints, key2, value2); § MPI_File_open(comm, fileName, fd_mode, mpiHints, fd); § MPI_Info_free(&mpiHints);
Using MPI-IO Hints
Seagate Confidential 108
Different MPI I/O implementations exist
We’re interested in ROMIO (part of MPICH) from Argonne National Laboratory § ROMIO’s ADIO layer - abstract I/O interface optimized for different file systems. § ROMIO contains patches to support Lustre, allowing hints to be passed through MPI-IO to
Lustre for layout § Supports local file systems, network file systems, and parallel file systems § Includes data sieving and two-phase optimizations
MPI-O: ROMIO Implementation
MPI-IO Interface
Common Functionality
ADIO Interface
UFS PVFS NFS Lustre
ROMIO’s layered architecture
Seagate Confidential 109
MPI I/O provides a rich interface allowing us to describe: § Noncontiguous accesses in memory, file, or both § Collective I/O
More Information on MPI-IO available in: § W. Gropp, E. Lusk, R. Thakur, Using MPI-2: Advanced Features of the Message-Passing
Interface, MIT Press, Cambridge, MA, 1999.
MPI-IO Final Thoughts
Seagate Confidential 110 Seagate Confidential
Questions?
top related