early experience with i/o on mira · compute and i/o components of leadership-class machines to...

14
Venkat Vishwanath Steve Crusan and Kevin Harms Argonne Na5onal Laboratory [email protected] Early Experience with I/O on Mira Disclaimer: Mira I/O subsystem and filesystem is still work in progress. The configuration and performance will change, and we will keep users informed of these.

Upload: others

Post on 07-Nov-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Venkat  Vishwanath    Steve  Crusan  and  Kevin  Harms    Argonne  Na5onal  Laboratory  

[email protected]  

Early Experience with I/O on Mira

Disclaimer: Mira I/O subsystem and filesystem is still work in progress. The configuration and performance will change, and we will keep users informed of these.

Page 2: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

ALCF-1 I/O Infrastructure

2  

40K  Nodes  160K  Cores  0.55  PFlops  

640    I/O  

Nodes  

10  Gbps  Myrinet  Switch  Complex  900+  ports  

100  Nodes  3.2  TB  Memory  

200  GPUs  

128  File  Servers  

640 GB/s

640x 0.85 GB/s 100 GB/s

128 GB/s 65 GB/s

Intrepid BG/P Compute Resource Eureka Analysis

Cluster

Storage System

ALCF uses the GPFS filesystem for production I/O

544 GB/s

Page 3: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

3  

48K  Nodes  768K  Cores  10  PFlops  

384    I/O  

Nodes  

QDR  IB  Switch  Complex  900+  ports  

6  TB  Memory  96  Nodes  192  GPUs  

128  DDN  VM  File  Server  26  PB  

768X 2 GB/s

384 GB/s

512 -1024 GB/s

~250GB/s

Mira BG/Q Compute Resource Tukey Analysis

Cluster

Storage System

ALCF-2 I/O Infrastructure

1536 GB/s

1536 GB/s

Storage System consists of 16 DDN SFA12KE “couplets” with each couplet consisting of 560 x 3TB HDD, 32 x 200GB SSD, 8 Virtual Machines (VM) with two QDR IB ports per VM

Page 4: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

I/O Subsystem of BG/P

Tree  Network   Switch  

BGP Compute Nodes in a Pset

BGP I/O Node

Storage / Analysis Node

I/O forwarding is a paradigm that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute nodes to dedicated I/O nodes.

Page 5: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

5  

128  BG/Q  Compute  Nodes  

2x2x4x4x2  

ION  

2 GB/s

2 GB/s

4 GB/s

•  For every 128 compute nodes, we have two links to the I/O Node on the BG/Q Torus

•  On the ION, we have a QDR Infiniband connected on a PCI-e Gen2 Slot

I/O Subsystem of BG/Q- Simplified View

Page 6: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Control and I/O Daemon (CIOD) – I/O Forwarding mechanism for BG/P and BG/Q

•  CIOD is the I/O forwarding infrastructure provided by IBM for the Blue Gene / P

•  For each compute node, we have a dedicated I/O proxy process to handle the associated I/O operations

•  On BG/Q, we have a thread based implementation instead of process-based

Page 7: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Intrepid vs Mira : I/O Comparison

Intrepid BG/P Mira BG/Q Increase Flops 0.557 PF 10 PF 20X Cores 160K 768K ~5X ION / CN ratio 1 / 64 1 / 128 0.5X IONs per Rack 16 8 0.5X Total Storage 6 PB 35 PB 6X I/O Throughput 62GB/s 240 GB/s Peak 4X User Quota? None Quotas

enforced!! -

I/O per Flops (%) 0.01 0.0024 0.24X

7  

Page 8: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

I/O interfaces and Libraries on Mira

•  Interfaces  –  POSIX    – MPI-­‐IO  

•  Higher-­‐level  Libraries  –  HDF5  –  Netcdf4  –  PnetCDF  

•  Early  I/O  Performance  on  Mira  –  Shared  File  Write  ~70  GB/s  –  Shared  File  Read  ~180  GB/s  

8  

Page 9: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Early Experience Scaling I/O for HACC •  Hybrid/Hardware  Accelerated  

Computa5onal  Cosmology  code  is  lead  by  Salman  Habib  at  Argonne    

•  Members  include  Katrin  Heitmann,  Hal  Finkel,  Adrian  Pope,  Vitali  Morozov  

•  Par5cle-­‐in-­‐Cell  code    •  Achieved  ~14  PF  on  Sequoia  and  

7  PF  on  Mira  •  On  Mira,  HACC  uses  1-­‐10  Trillion  

par5cles  

9  

30 30

30

� First Large-Scale Simulation on Mira

– 16 racks, 262,144 cores (one-third of Mira)

– 14 hours continuous, no checkpoint restarts

� Simulation Parameters – 9.14 Gpc simulation box

– 7 kpc force resolution

– 1.07 trillion particles

� Science Motivations – Resolve halos that host

luminous red galaxies for analysis of SDSS observations

– Cluster cosmology

– Baryon acoustic oscillations in galaxy clustering

HACC on Mira: Early Science Test Run

Zoom-in visualization of the density field illustrating the global spatial dynamic range of the simulation -- approximately a million-to-one

Page 10: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Early Experience Scaling I/O for HACC •  Typical  Checkpoint/Restart  dataset  is  ~100-­‐400  TB  •  Analysis  output,  wriden  more  frequently,  is  ~1TB-­‐10TB  •  Step  1:  

–  Default  I/O  mechanism  using  single  shared  file  with  MPI-­‐IO  –  Achieved  ~15  GB/s  

•  Step  2:  –  File  per  rank    –  Too  many  files  at  32K  Nodes  (262K  files  with  8RPN)  

•  Step  3:  –  An  I/O  mechanism  wri5ng  1  File  per  ION  –  Topology-­‐aware  I/O  to  reduce  network  conten5on      –  Achieves  up  to  160  GB/s  (~10  X  improvement)  

•  Wriden  and  read  ~10  PB  of  data  on  Mira  (and  coun5ng)  10  

Page 11: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Few Tips and Tricks for I/O on BG/Q

•  Pre-­‐create  the  files  –  Increases  shared  file  write  performance  from  70GB/s  to  130GB/s  

•  Single  shared  file  vs  file  per  process  (MPI  rank)  •  Pre-­‐allocate  files  •  HDF5  Datatype  

–  Instead  of  Na5ve,  use  Big  Endian(eg.  H5T_IEEE_F64BE)  •  MPI  Datatypes  •  Collec5ve  I/O  •  MPI-­‐IO  BgLockless  Environment  variable  

–  Prefix  file  with  bglockless:/  –  BGLOCKLESSMPIO_F_TYPE=0x47504653  

11  

Page 12: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Darshan – An I/O Characterization Tool

•  An  open-­‐source  tool  developed  for  sta5s5cal  profiling  of  I/O  •  Designed  to  be  lightweight  and  low  overhead  

–  Finite  memory  alloca5on  for  sta5s5cs  (about  2MB)  done  during  MPI_Init  

–  Overhead  of  1-­‐2%  total  to  record  I/O  calls  –  Varia5on  of  I/O  is  typically  around  4-­‐5%  –  Darshan  does  not  create  detailed  func5on  call  traces  

•  No  source  modifica5ons  –  Uses  PMPI    interfaces  to  intercept  MPI  calls  –  Use  ld  wrapping  to  intercept  POSIX  calls  –  Can  use  dynamic  linking  with  LD_PRELOAD  instead  

•  Stores  results  in  single  compressed  log  file  •  hdp://www.mcs.anl.gov/darshan  

12  

Page 13: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Darshan on BG/Q •  logs: /gpfs/vesta-fs0/logs/darshan/vesta /gpfs/mira-fs0/logs/darshan/mira

•  bins: /soft/perftools/darshan/darshan/bin

•  wrappers: /soft/perftools/darshan/darshan/wrappers/<wrapper>

•  export PATH=/soft/perftools/darshan/darshan/wrappers/xl:${PATH}

13  

fms c48L24 am3p5.x (3/27/2010) 1 of 2

uid: 6277 nprocs: 1944 runtime: 5382 seconds

0

20

40

60

80

100

POSIXMPI-IO

Perc

enta

ge o

f run

tim

e

Average I/O cost per process

ReadWrite

MetadataOther (including application compute)

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

4.5e+07

Read Write Open Stat Seek Mmap Fsync

Ops

(Tot

al, A

ll Pro

cess

es)

I/O Operation Counts

POSIXMPI-IO Indep.

MPI-IO Coll.

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

0-100101-1K

1K-10K

10K-100K

100K-1M

1M-4M4M-10M

10M-100M

100M-1G

1G+

Coun

t (To

tal,

All P

rocs

)

I/O Sizes

Read Write

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

3.5e+07

4e+07

Read Write

Ops

(Tot

al, A

ll Pro

cs)

I/O Pattern

TotalSequential

Consecutive

Most Common Access Sizesaccess size count

65536 23066526800 2378602304 2172322400 2029354

./fms c48L24 am3p5.x <unknown args>

Page 14: Early Experience with I/O on Mira · compute and I/O components of leadership-class machines to meet the requirements of data-intensive applications by shipping I/O calls from compute

Questions

14