optimizing performance of hpc storage systems torben kling petersen, phd principal architect high...

Optimizing Performance of HPC Storage Systems

Torben Kling Petersen, PhDPrincipal Architect

High Performance Computing

2

Current Top10 …..Rank Name Computer Site Total

Cores Rmax Rpeak Power (KW)

File system Size Perf

1 Tianhe-2TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi

National Super Computer Center in Guangzhou

3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s

2 TitanCray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x

DOE/SC/Oak Ridge National Laboratory

560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s

3 SequoiaBlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect

DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s

4 K computerFujitsu, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s

5 Mira BlueGene/Q, Power BQC 16C 1.60GHz, Custom

DOE/SC/Argonne National Lab. 786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s

N/A BlueWatersCray XK7, Opteron 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x

NCSA - - - - Lustre 24 PB 1100 GB/s

6 Piz DaintCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x

Swiss National Supercomputing Centre (CSCS)

115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s

7 StampedePowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi

TACC/Univ. of Texas 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s

8 JUQUEENBlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect

Forschungs zentrum Juelich (FZJ)

458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s

9 Vulcan BlueGene/Q, Power 16C 1.6GHz, Custom Intercon. DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s

10 SuperMUCiDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR

Leibniz Rechenzentrum 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s

3

Performance testing

Lies

Bigger lies

Benchmarks

©Xyratex 20134

• IOR• IOzone• Bonnie++• spgdd-survey• obdfilter-survey• FIO• dd/xdd• Filebench• dbench• Iometer• MDstat• metarates …….

Storage benchmarks

©Xyratex 2013

5

Lustre® Architecture – High Level

Client

Client MDS MDS

OSS

OSS

disk

disk

disk

…

Lustre Client1-100,000

Support multiple network typesIB, X-GigE

Object Storage Servers (OSS)

1-1,000s

Metadata Servers (MDS)

Metadata Target (MDT)

Object Storage Target (OST)

Client Router

OSSOSSOSSOSS

Disk arrays & SAN Fabric

OSS disk

CIFS Client

GatewayNFS

Client

Dissecting benchmarking

7

The chain and the weakest link …

Memory Memory

Client OSS

disk interconnect

busbusnetworkbus

Disk drive perfRAID sets

SAS or S-ATAFile system ??

SAS Controller/ExpanderPCI-E bus

CPUMemory

OSInterconnect

CPUMemory

PCI-E busFS client MPI stack

SAS port over subscriptionCabling …

RAID controller (SW/HW)

Non-blocking fabric ?TCP-IP Overhead ??

Routing ?

Only a balanced system will deliver performance …..

©Xyratex 2013

8

• Using obdfilter-survey is a Lustre benchmark tool that measures OSS and backend OST performance and does not measure LNET or Client performance

• This is a good benchmark to isolate network and client from the server.

• Example of obdfilter-survey parameters[root@oss1 ~]# nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey

• Parameters Defined– size=65536 // file size (2x Controller Memory is good practice)– nobjhi=1 nobjlo=1 // number of files– thrhi=256 thrlo=256 // number of worker threads when testing OSS

• If you see results significantly lower than what is expected, rerun the test multiple times to ensure those results are not consistent.

• This benchmark can also target individual OSTs if we see an OSS node performing lower than expected, it can be because of a single OST performing lower due to drive issue, RAID array rebuilds, etc.

[root@oss1 ~]# targets=“fsname-OST0000 fsname-OST0002” nobjlo=1 nobjhi=1 thrlo=256 thrhi=256 size=65536 obdfilter-survey

Server Side Benchmark

©Xyratex 2013

9

• IOR uses MPI-IO to execute the benchmark tool across all nodes and mimics typical HPC applications running on Clients

• Within IOR, one can configure the benchmark for File-Per-Process, and Single-Shared-File

– File-Per-Process: Creates a unique file per task and most common way to measure peak throughput of a Lustre parallel Filesystem

– Single-Shared-File: Creates a Single File across all tasks running on all clients

• Two primary modes for IOR– Buffered: This is default and takes advantage of Linux page caches

on the Client– DirectIO: Bypasses Linux page caching and writes directly to the

filesystem

Client Side Benchmark

©Xyratex 2013

10

• At customer sites, typically all clients have the same architecture, same number of CPU cores, and same amount of memory.

• With a uniform client architecture, the parameters for IOR are simpler to tune and optimize for benchmarking

• Example for 200 Clients –Number of Cores per Client: 16 (# nproc)–Amount of Memory per Client 32GB (cat /proc/meminfo)

Typical Client Configuration

©Xyratex 2013

11

• Always want to transfer 2x the memory size of the total number of clients used to avoid any client side caching effect

• In our example:– (200 Clients*32 GB)*2 = 12,800 GB

• Total file size for the IOR benchmark will be 12.8 TB–NOTE: Typically all nodes are uniform.

IOR Rule of Thumb

Lustre Configuration

©Xyratex 2013

13

• Lustre read_cache_enable– controls whether data read from disk during a read request is kept in

memory and available for later read requests for the same data, without having to re-read it from disk. By default, read cache is enabled (read_cache_enable = 1).

• Lustre writethrough_cache_enable– controls whether data sent to the OSS as a write request is kept in the

read cache and available for later reads, or if it is discarded from cache when the write is completed. By default, writethrough cache is enabled (writethrough_cache_enable = 1)

• Lustre readcache_max_filesize– controls the maximum size of a file that both the read cache and

writethrough cache will try to keep in memory. Files larger than readcache_max_filesize will not be kept in cache for either reads or

writes. Default is all file sizes are cached.

Lustre Server Caching Description

©Xyratex 2013

14

• Network Checksums– Default is turned on and impacts performance.

Disabling this is first thing we do for performance

• LRU Size– Typically we disable this parameter– Parameter used to control the number of client-side locks in an LRU queue

• Max RPCs in Flight– Default is 8, increase to 32– RPC is remote procedure call– This tunable is the maximum number of concurrent RPCs in flight from

clients.

• Max Dirty MB– Default is 32, good rule of thumb is 4x the value of max_rpcs_in_flight. – Defines the amount of MBs of dirty data can be written and

queued up on the client

Client Lustre Parameters

©Xyratex 2013

15

• Default Lustre Stripe size is 1M and stripe count is 1–Each file is written to 1 OST with a stripe size of 1M–When multiple files are created and written, MDS will do

best effort to distribute the load across all available OSTs

• The default stripe size and count can be changed. Smallest Stripe size is 64K and can be increased by 64K and stripe count can be increased to include all OSTs

–Changing stripe count to all OSTs indicates each file will be created using all OSTs. This is best when creating a single shared file from multiple Lustre Clients

• One can create multiple directories with various stripe sizes and counts to optimize for performance

Lustre Striping

Experimental setup&

Results

©Xyratex 201317

• ClusterStor with 2x CS6000 SSUs – 2TB NL-SAS Hitachi Drives– 4U CMU – Neo 1.2.1, HF applied

• Clients – 64 Clients, 12 Cores, 24GB Memory, QDR– Mellanox FDR core switch– Lustre Client: 1.8.7

• Lustre – Server version: 2.1.3– 4 OSSes– 16 OSTs (RAID 6)

Equipment used

©Xyratex 201318

• Disk backend testing – obdfilter-survey• Client based testing – IOR

– I/O mode– I/O Slots per client– IOR transfer size–Number of Client threads

• Lustre tunings– writethrough cache enabled– read cache enabled – read cache max filesize = 1M Client Settings – LRU Disabled – Checksums Disabled – MAX RPCs in Flight = 32

Subset of test parameters

©Xyratex 201319

# pdsh -g oss "TERM=linux thrlo=256 thrhi=256 nobjlo=1 nobjhi=1 rsz=1024K size=32768 obdfilter-survey"

cstor01n04: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3032.89 [ 713.86, 926.89] rewrite 3064.15 [ 722.83, 848.93] read 3944.49 [ 912.83,1112.82]

cstor01n05: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3022.43 [ 697.83, 819.86] rewrite 3019.55 [ 705.15, 827.87] read 3959.50 [ 945.20,1125.76]

Lustre obdfilter-survey

This means that a single SSU have a • write performance of 6,055 MB/s (75,9 MB/s per disk)• read performance of 7,904 MB/s (98.8 MB/s per disk )

20

Buffered I/O

4 8 16 32 64 128 256 512 1024 1536 -

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Write performance Buffered I/O

-t = 2

-t = 4

-t = 8

-t = 16

-t = 32

-t = 64

No of threads

Per

form

ance

(M

B/s

)

4 8 16 32 64 128 256 512 1024 1536 -

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Read performance - Buffered I/O

-t = 2

-t = 4

-t = 8

-t = 16

-t = 32

-t = 64

No of threads

Per

form

ance

(M

B/s

)

1 2 4 8 16 32 640

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Write Performance - Buffered I/O

np=4

np=8

np=16

np=32

np=64

np=128

np=256

np=512

np=1024

Transfer size (MB)

Per

form

nane

(M

B/s

)

1 2 4 8 16 32 640

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Read Performance - Buffered I/O

np=4

np=8

np=16

np=32

np=64

np=128

np=256

np=512

np=1024

Transfer size (MB)

Per

form

nane

(M

B/s

)

21

Direct I/O

4 8 16 32 64 128 256 512 1024 15360

2,000

4,000

6,000

8,000

10,000

12,000

14,000

IOR FPP DirectIO Reads

-t = 2

-t = 4

-t = 8

-t = 16

-t = 32

-t = 64

No of threads

Thro

ughp

ut (

MB

/s)

4 8 16 32 64 128 256 512 1024 15360

2,000

4,000

6,000

8,000

10,000

12,000

14,000

IOR FPP DirectIO Writes

-t = 2

-t = 4

-t = 8

-t = 16

-t = 32

-t = 64

No of threads

Thro

ughp

ut (

MB

/s)

1 2 4 8 16 32 640

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Read performance - Direct I/O

np=4

np=8

np=16

np=32

np=64

np=128

np=256

np=512

np=1024

Transfer size (MB)

Per

form

ance

(M

B/s

)

1 2 4 8 16 32 640

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Write performance - Direct I/O

np=4

np=8

np=16

np=32

np=64

np=128

np=256

np=512

np=1024

Transfer size (MB)

Per

form

ance

(M

B/s

)

Summary

©Xyratex 201323

• Never trust marketing numbers …• Testing all stage of the data pipeline is essential• Optimal parameters and/or methodology for read and write

are seldom the same• Real life applications can often be configured accordingly

• Balanced architectures will deliver performance–Client based IOR performs within 5% of backend – In excess of 750 MB/s per OST … -> 36 GB/s per rack …

• A well designed solution will scale linearly using Lustre–cf. NCSA BlueWaters

Reflections on the results

Optimizing Performance of HPC Storage Systems

[email protected]

Thank You

optimizing performance of hpc storage systems torben kling petersen, phd principal architect high...

Documents

xeon e5

power bqc

client performancethis

intel xeon phi taccuniv

storage benchmarks xyratex

cray gemini interconnect

top10 systems

customdoescargonne national