early experience with i/o on mira · compute and i/o components of leadership-class machines to...

Venkat Vishwanath Steve Crusan and Kevin Harms Argonne Na5onal Laboratory

[email protected]

Early Experience with I/O on Mira

Disclaimer: Mira I/O subsystem and filesystem is still work in progress. The configuration and performance will change, and we will keep users informed of these.

ALCF-1 I/O Infrastructure

2

40K Nodes 160K Cores 0.55 PFlops

640 I/O

Nodes

10 Gbps Myrinet Switch Complex 900+ ports

100 Nodes 3.2 TB Memory

200 GPUs

128 File Servers

640 GB/s

640x 0.85 GB/s 100 GB/s

128 GB/s 65 GB/s

Intrepid BG/P Compute Resource Eureka Analysis

Cluster

Storage System

ALCF uses the GPFS filesystem for production I/O

544 GB/s

3

48K Nodes 768K Cores 10 PFlops

384 I/O

Nodes

QDR IB Switch Complex 900+ ports

6 TB Memory 96 Nodes 192 GPUs

128 DDN VM File Server 26 PB

768X 2 GB/s

384 GB/s

512 -1024 GB/s

~250GB/s

Mira BG/Q Compute Resource Tukey Analysis

Cluster

Storage System

ALCF-2 I/O Infrastructure

1536 GB/s

1536 GB/s

Storage System consists of 16 DDN SFA12KE “couplets” with each couplet consisting of 560 x 3TB HDD, 32 x 200GB SSD, 8 Virtual Machines (VM) with two QDR IB ports per VM

I/O Subsystem of BG/P

Tree Network Switch

BGP Compute Nodes in a Pset

BGP I/O Node

Storage / Analysis Node

I/O forwarding is a paradigm that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute nodes to dedicated I/O nodes.

5

128 BG/Q Compute Nodes

2x2x4x4x2

ION

2 GB/s

2 GB/s

4 GB/s

•  For every 128 compute nodes, we have two links to the I/O Node on the BG/Q Torus

•  On the ION, we have a QDR Infiniband connected on a PCI-e Gen2 Slot

I/O Subsystem of BG/Q- Simplified View

Control and I/O Daemon (CIOD) – I/O Forwarding mechanism for BG/P and BG/Q

•  CIOD is the I/O forwarding infrastructure provided by IBM for the Blue Gene / P

•  For each compute node, we have a dedicated I/O proxy process to handle the associated I/O operations

•  On BG/Q, we have a thread based implementation instead of process-based

Intrepid vs Mira : I/O Comparison

Intrepid BG/P Mira BG/Q Increase Flops 0.557 PF 10 PF 20X Cores 160K 768K ~5X ION / CN ratio 1 / 64 1 / 128 0.5X IONs per Rack 16 8 0.5X Total Storage 6 PB 35 PB 6X I/O Throughput 62GB/s 240 GB/s Peak 4X User Quota? None Quotas

enforced!! -

I/O per Flops (%) 0.01 0.0024 0.24X

7

I/O interfaces and Libraries on Mira

•  Interfaces –  POSIX – MPI-‐IO

•  Higher-‐level Libraries –  HDF5 –  Netcdf4 –  PnetCDF

•  Early I/O Performance on Mira –  Shared File Write ~70 GB/s –  Shared File Read ~180 GB/s

8

Early Experience Scaling I/O for HACC •  Hybrid/Hardware Accelerated

Computa5onal Cosmology code is lead by Salman Habib at Argonne

•  Members include Katrin Heitmann, Hal Finkel, Adrian Pope, Vitali Morozov

•  Par5cle-‐in-‐Cell code •  Achieved ~14 PF on Sequoia and

7 PF on Mira •  On Mira, HACC uses 1-‐10 Trillion

par5cles

9

30 30

30

� First Large-Scale Simulation on Mira

– 16 racks, 262,144 cores (one-third of Mira)

– 14 hours continuous, no checkpoint restarts

� Simulation Parameters – 9.14 Gpc simulation box

– 7 kpc force resolution

– 1.07 trillion particles

� Science Motivations – Resolve halos that host

luminous red galaxies for analysis of SDSS observations

– Cluster cosmology

– Baryon acoustic oscillations in galaxy clustering

HACC on Mira: Early Science Test Run

Zoom-in visualization of the density field illustrating the global spatial dynamic range of the simulation -- approximately a million-to-one

Early Experience Scaling I/O for HACC •  Typical Checkpoint/Restart dataset is ~100-‐400 TB •  Analysis output, wriden more frequently, is ~1TB-‐10TB •  Step 1:

–  Default I/O mechanism using single shared file with MPI-‐IO –  Achieved ~15 GB/s

•  Step 2: –  File per rank –  Too many files at 32K Nodes (262K files with 8RPN)

•  Step 3: –  An I/O mechanism wri5ng 1 File per ION –  Topology-‐aware I/O to reduce network conten5on –  Achieves up to 160 GB/s (~10 X improvement)

•  Wriden and read ~10 PB of data on Mira (and coun5ng) 10

Few Tips and Tricks for I/O on BG/Q

•  Pre-‐create the files –  Increases shared file write performance from 70GB/s to 130GB/s

•  Single shared file vs file per process (MPI rank) •  Pre-‐allocate files •  HDF5 Datatype

–  Instead of Na5ve, use Big Endian(eg. H5T_IEEE_F64BE) •  MPI Datatypes •  Collec5ve I/O •  MPI-‐IO BgLockless Environment variable

–  Prefix file with bglockless:/ –  BGLOCKLESSMPIO_F_TYPE=0x47504653

11

Darshan – An I/O Characterization Tool

•  An open-‐source tool developed for sta5s5cal profiling of I/O •  Designed to be lightweight and low overhead

–  Finite memory alloca5on for sta5s5cs (about 2MB) done during MPI_Init

–  Overhead of 1-‐2% total to record I/O calls –  Varia5on of I/O is typically around 4-‐5% –  Darshan does not create detailed func5on call traces

•  No source modifica5ons –  Uses PMPI interfaces to intercept MPI calls –  Use ld wrapping to intercept POSIX calls –  Can use dynamic linking with LD_PRELOAD instead

•  Stores results in single compressed log file •  hdp://www.mcs.anl.gov/darshan

12

Darshan on BG/Q •  logs: /gpfs/vesta-fs0/logs/darshan/vesta /gpfs/mira-fs0/logs/darshan/mira

•  bins: /soft/perftools/darshan/darshan/bin

•  wrappers: /soft/perftools/darshan/darshan/wrappers/<wrapper>

•  export PATH=/soft/perftools/darshan/darshan/wrappers/xl:${PATH}

13

fms c48L24 am3p5.x (3/27/2010) 1 of 2

uid: 6277 nprocs: 1944 runtime: 5382 seconds

0

20

40

60

80

100

POSIXMPI-IO

Perc

enta

ge o

f run

tim

e

Average I/O cost per process

ReadWrite

MetadataOther (including application compute)

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

4.5e+07

Read Write Open Stat Seek Mmap Fsync

Ops

(Tot

al, A

ll Pro

cess

es)

I/O Operation Counts

POSIXMPI-IO Indep.

MPI-IO Coll.

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

0-100101-1K

1K-10K

10K-100K

100K-1M

1M-4M4M-10M

10M-100M

100M-1G

1G+

Coun

t (To

tal,

All P

rocs

)

I/O Sizes

Read Write

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

Read Write

Ops

(Tot

al, A

ll Pro

cs)

I/O Pattern

TotalSequential

Consecutive

Most Common Access Sizesaccess size count

65536 23066526800 2378602304 2172322400 2029354

./fms c48L24 am3p5.x <unknown args>

Questions

14

early experience with i/o on mira · compute and i/o components of leadership-class machines to...

Documents