parallel i/o : file systems and libraries · parallel file systems : other layers • lower level...

High Performance Computing: Concepts, Methods & Means

Parallel I/O : File Systems and Libraries

Prof. Thomas Sterling

Department of Computer Science

Louisiana State University

March 29th, 2007

2

Topics

• Introduction

• RAID

• Distributed File Systems (NFS)

• Parallel File Systems (PVFS2)

• Parallel I/O Libraries (MPI-IO)

• Parallel File Formats (HDF5)

• Additional Parallel File Systems (GPFS)

• Summary – Materials for Test

3

Topics

• Introduction

• RAID







• Storage capacity: 1 TB per drive

• Areal density: 132 Gbit/in2 (perpendicular recording)

• Rotational speed: 15,000 RPM

• Average latency: 2 ms

• Seek time

– Track-to-track: 0.2 ms

– Average: 3.5 ms

– Full stroke: 6.7 ms

• Sustained transfer rate: up to 125 MB/s

• Non-recoverable error rate: 1 in 1017

• Interface bandwidth:

– Fibre channel: 400 MB/s

– Serially Attached SCSI (SAS): 300 MB/s

– Ultra320 SCSI: 320 MB/s

– Serial ATA (SATA): 300 MB/s

Permanent Storage: Hard Disks

Review

4

Storage – SATA & Overview

- Review• Serial ATA is the newest commodity hard disk

standard.

• SATA uses serial buses as opposed to parallel

buses used by ATA and SCSI.

• The cables attached to SATA drives are smaller

and run faster (around 150 MB/s).

• The Basic disk technologies remain the same

across the three busses

• The platters in disk spin at variety of speeds,

faster the platters spin the faster the data can be

read off the disk and data on the far end of the

platter will become available sooner.

• Rotational speeds range between 5400 RPM to

15000 RPM

• Faster the platters rotate, the lower the latency

and higher the bandwidth.5

PATA vs SATA

I/O Needs on Parallel Computers

• High Performance

– Take advantage of parallel I/O paths (when available)

– Support application-level data access and throughput needs

• Data Integrity

– sanely deal with hardware and power failures

• Single Namespace

– All nodes and users “see” the same file systems

– Equal access from anywhere on the resource.

• Ease of Use

– Where possible, a parallel file system should be accessible

in consistent way, in the same ways as a traditional UNIX-

style file systems.

6Ohio Supercomputer Center

7

Topics

• Introduction

• RAID







Parallel I/O - RAID

• RAID stands for Redundant Array of Inexpensive

Disks provides a mechanism by which the

performance and storage properties of individual

disks can be aggregated

• Group of disks appear to be a single large disks;

performance of multiple disks is better than

single disks.

• Using multiple disks helps store data in multiple

places allowing the system to continue

functioning.

• Both software and hardware raid solutions

available.

• Hardware solutions are more expensive, but

provide better performance without CPU

overhead.

• Software solutions provide various levels of

flexibility but have associated computational

overhead. 8

RAID : Key Concepts

• Variety of RAID allocation schemes :

• RAID 0 (disk striping without redundant

storage) :

– Data is striped across multiple disks.

– The result of striping is a logical storage device

that has the capacity of each disk times the

number of disks present in the raid array.

– Both read and write performances are

accelerated.

– Each byte of data can be read from multiple

locations, so interleaving reads between disks

can help double read performance.

– No Fault tolerance

– High transfer rates

– High request rates

9http://www.drivesolutions.com/datarecovery/raid.shtml

RAID : Key Concepts

• RAID 1 (disk mirroring): – Complete copies of data are stored on multiple

locations.

– Capacity of one of these RAID sets will be half of

its raw capacity. Read performance is accelerated

and is comparable to Raid 0.

– Writes are slowed down, as new data needs to be

transmitted multiple times.

• RAID 5:– Like Raid 0 data is striped across multiple disks,

with parity being distributed across the disks.

– For any block of data stored across the drives,

their parity checksum is computed and is stored on

a predetermined disk.

– Read performance of RAID 5 is reduced as the

parity data is distributed across drives, and the

write performance lags behind because of

checksum computation. 10http://www.drivesolutions.com/datarecovery/raid.shtml

11

Topics

• Introduction

• RAID







Distributed File Systems

• A distributed file system is a file system that is stored locally on one system

(server) but is accessible by processes on many systems (clients).

• Multiple processes access multiple files simultaneously.

• Other attributes of a DFS may include :

– Access control lists (ACLs)

– Client-side file replication

– Server- and client- side caching

• Some examples of DFSes:

– NFS (Sun)

– AFS (CMU)

– DCE/DFS (Transarc / IBM)

– CIFS (Microsoft)

• Distributed file systems can be used by parallel programs, but they have

significant disadvantages :

– The network bandwidth of the server system is a limiting factor on performance

– To retain UNIX-style file consistency, the DFS software must implement some form

of locking which has significant performance implications


Distributed File System : NFS

• Popular means for accessing remote file

systems in a local area network.

• Based on the client-server model , the remote

file systems are “mounted” via NFS and

accessed through the Linux virtual file system

(VFS) layer.

• NFS clients cache file data, periodically

checking with the original file for any changes.

• The loosely-synchronous model makes for

convenient, low-latency access to shared

spaces.

• NFS avoids the common locking systems used

to implement POSIX semantics.

13

Why NFS is bad for Parallel I/O

• Clients can cache data indiscriminately, and tend to

block boundaries.

• When nearby regions of a file are written by different

processes on different clients, the result is undefined

due to lack of consistency control.

• Secondly all file operations are remote operations.

Extensive file locking required to implement sequential

consistency

• Communication between client and server typically uses

relatively slow communication channels, adding to

performance degradation.

• Inefficient specification (eg. a read operation involves

two RPC operations (one for look-up of file handle and

second for reading of file data)14

15

Topics

• Introduction

• RAID







Parallel File Systems

• Parallel File System is one in which there are multiple servers as

well as clients for a given file system, equivalent of RAID across

several file systems.

• Multiple processes can access the same file simultaneously

• Parallel File Systems are usually optimized for high performance

rather than general purpose use, common optimization criterion

being :

– very large block sizes ( => 64kB)

– relatively slow metadata operations (eg. fstat()) compared to reads

and writes

– Special APIs for direct access

• Examples of Parallel file systems include :

– GPFS (IBM)

– LUSTRE (Cluster File Systems)

– PVFS2 (Clemson/ANL)


Characteristics of

Parallel File Systems

• Three Key Characteristics :

– Various hardware I/O data storage resources

– Multiple connections between these hardware

devices and compute resources.

– High-performance, concurrent access to these I/O

resources.

• Multiple physical I/O devices and paths ensures

sufficient bandwidth for the high performance

desired.

• Parallel I/O systems include both the hardware

and number of layers of software

17

Storage HardwareStorage Hardware

Parallel File SystemParallel File System

Parallel I/O (MPI I/O)Parallel I/O (MPI I/O)

High-Level I/O LibraryHigh-Level I/O Library

Parallel File Systems:

Hardware Layer• I/O Hardware is usually comprised of disks, controllers,

and interconnects for data movement.

• Hardware determines the maximum raw bandwidth

and the minimum latency of the system.

• Bisection bandwidth of the underlying transport

determines the aggregate bandwidth of the resulting

parallel I/O system.

• At the hardware level, data is accessed at the

granularity of blocks, either physical disk blocks or

logical blocks spread across multiple physical devices

such as in a RAID array.

• Parallel File Systems :

– manage data on the storage hardware,

– present this data as a directory hierarchy,

– coordinate access to files and directories in a consistent

manner

• File systems usually provide a UNIX like interface,

allowing users to access contiguous regions of files.18





Parallel File Systems :

Other Layers

• Lower level interfaces may be provided by the

file system for higher-performance access.

• Above the parallel file systems are the parallel

I/O layers provided in the form of libraries such

as MPI I/O.

• The parallel I/O layer provides a low level

interface and operations such as collective I/O.

• Scientific applications work with structured data

for which a higher level API written on top of

MPI-IO such as HDF5 or parallel netCDF are

used.

• HDF5 and parallel netCDF allow the scientists

to represent the data sets in terms closer to

those used in their applications.

19





PVFS2

• PVFS2 designed to provide :

– a modular networking and storage subsystems

– structured data request format modeled after MPI datatypes

– flexible and extensible data distribution models

– distributed metadata

– tunable consistency semantics, and

– support for data redundancy.

• Supports variety of network technologies including Myrinet,

Quadrics, and Infiniband.

• Also supports variety of storage devices including locally

attached hardware, SANs and iSCSI

• Key abstractions include :

– Buffered Message Interface (BMI) : non-blocking network interface

– Trove : non-blocking storage interface

– Flows : mechanism to specify a flow of data between network and

storage20

PVFS2 Software Architecture

• Buffered Messaging Interface (BMI)

– Non blocking interface that can be used with

many High performance network fabrics

– Currently

– TCP/IP and Myrinet (GM) networks exist

• Trove :

– Non blocking interface that can be used with

a number of underlying storage mechanisms.

– Trove storage objects consist of stream of

bytes and keyword/value pair space.

– Keyword/value pairs are convenient for

arbitrary metadata storage and directory

entries, while stream of bytes provides ideal

storage for the stream of bytes.

21

Network Disk

Client API Request

Processing

Job Sched

BMI Flo-

ws

Dist

Job Sched

BMI Flo-

ws

Dist

Tro-

ve

Client Server

PVFS2 Software Architecture

• Flows :

– Combines network and storage

subsystems by providing

mechanism to describe flow of data

between network and storage.

– Provide a point for optimization to

optimize data movement between a

particular network and storage pair

to exploit fast paths.

• The job scheduling layer provides a

common interface to interact with

BMI, Flows, and Trove and checks

on their completion

• The job scheduler is tightly

integrated with a state machine that

is used to track operations in

progress.

22

Network Disk

Client API Request

Processing

Job Sched

BMI Flo-

ws

Dist

Job Sched

BMI Flo-

ws

Dist

Tro-

ve

Client Server

The PVFS2 Components

• The 4 major components to a PVFS

system are :

– Metadata Server (mgr)

– I/O Server (iod)

– PVFS native API (libpvfs)

– PVFS Linux kernel support

• Metadata Server (mgr) :

– manages all the file metadata for PVFS

files, using a daemon which atomically

operates on the file metadata.

– PVFS avoids the pitfalls of many storage

area network approaches, which have to

implement complex locking schemes to

ensure that metadata stays consistent in

the face of multiple accesses.

23

The PVFS2 Components

• I/O daemon:

– handles storing and retrieving

file data stored on local disks

connected to a node using

traditional read(), write, etc for

access to these files.

• PVFS native API provides

user-space access to the

PVFS servers.

• The library handles the

operations necessary to

move data between user

buffers and PVFS servers.

24

metadata access data access

http://csi.unmsm.edu.pe/paralelo/pvfs/desc.html

Parallel File Systems Comparison

25

Comparison of NFS vs. GPFSFile-System Features NFS GPFS

Introduced: 1985 1998

Original vendor: Sun IBM

Example at LC: /nfs/tmpn /p/gx1

Primary role: Share files among machines Fast parallel I/O for large files

Easy to scale? No Yes

Network needed: Any TCP/IP

network

Only IBM SP

"switch"

Access control method: UNIX permission bits

(CHMOD) UNIX permission bits (CHMOD)

Block size: 256 byte 512 Kbyte (White)

Stripe width: Depends on RAID 256 Kbyte

Maximum file size: 2 Gbyte (longer with v3) 26 Gbyte

File consistency:

.....uses client buffering? Yes Yes (see diagram)

.....uses server buffering? Yes (see diagram)

.....uses locking? No Yes (token passing)

.....lock granularity? Byte range

.....lock managed by? Requesting compute node

Purged at LC? Home, No;

Tmp, Yes Yes

Supports file quotas? Yes No 26

27

Topics

• Introduction

• RAID







MPI-IO Overview

• Initially developed as a research project at the IBM T. J. Watson

Research Center in 1994

• Voted by the MPI Forum to be included in MPI-2 standard (Chapter

9)

• Most widespread open-source implementation is ANL’s ROMIO,

written by Rajeev Thakur (http://www-unix.mcs.anl.gov/romio/ )

• Integrates file access with the message passing infrastructure,

using similarities between send/receive and file write/read

operations

• Allows MPI datatypes to describe meaningfully data layouts in files

instead of dealing with unorganized streams of bytes

• Provides potential for performance optimizations through the

mechanism of “hints”, collective operations on file data, or

relaxation of data access atomicity

• Enables better file portability by offering alternative data

representations28

MPI-IO Features (I)

• Basic file manipulation (open/close, delete, space preallocation, resize,

storage synchronization, etc.)

• File views (define what part of a file each process can see and how it is

interpreted)

– Processes can view file data independently, with possible overlaps

– The users may define patterns to describe data distributions both in file and

in memory, including non-contiguous layouts

– Permit skipping over fixed header blocks (“displacements”)

– Views can be changed by tasks at any time

• Data access positioning

– Explicitly specified offsets (suffix “_at”)

– Independent data access by each task via individual file pointers (no suffix)

– Coordinated access through shared file pointer (suffix “_shared”)

• Access synchronism

– Blocking

– Non-blocking (include split-collective operations)

29

MPI-IO Features (II)

• Access coordination

– Non-collective (no additional suffix)

– Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-

collective, or “_ordered” for equivalent of shared pointer access)

• File interoperability (ensures portability of data representation)

– Native: for purely homogeneous environments

– Internal: heterogeneous environments with implementation-defined data

representation (subset of “external32”)

– External32: heterogeneous environments using data representation defined

by the MPI-IO standard

• Optimization hints (the “info” interface)

– Access style (e.g. read_once, write_once, sequential, random, etc.)

– Collective buffering components (buffer and block sizes, number of target

nodes)

– Striping unit and factor

– Chunked I/O specification

– Preferred I/O devices

• C, C++ and Fortran bindings30

MPI-IO Types

• Etype (elementary datatype): the unit of data access and

positioning; all data accesses are performed in etype units and

offsets are measured in etypes

• Filetype: basis for partitioning the file among processes: a

template for accessing the file; may be identical to or derived

from the etype

31Source: http://www.mhpcc.edu/training/workshop2/mpi_io/MAIN.html

MPI-IO File Views

A view defines the current set of data visible and accessible from an open file

as an ordered set of etypes

• Each process has its own view of the file, defined by: a displacement, an etype, and

a filetype

• Displacement: an absolute byte position relative to the beginning of file; defines

where a view begins

32

33

MPI-IO: File Open

Function: MPI_File_open()

int MPI_File_open(MPI_Comm comm, char *filename, int amode,

MPI_Info info, MPI_File *fh);

Description:Opens the file identified by filename on all processes in comm group, using access mode

specified in amode. The operation is collective; all participating processes must pass identical

values for amode and use the filename referencing the same file. Successful call returns the

open file handle in fh, which can be used to subsequently access the file.

It is possible to open file independently from other processes by passing MPI_COMM_SELF in

comm argument.

#include <mpi.h>

...

MPI_File fh;

int err;

...

/* create a writable file with default parameters */

err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY,

MPI_INFO_NULL, &fh);

if (err != MPI_SUCCESS) {/* handle error here */}

...

#include <mpi.h>

...

MPI_File fh;

int err;

...

/* create a writable file with default parameters */

err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY,

MPI_INFO_NULL, &fh);


...

34

MPI-IO: File Close

Function: MPI_File_close()

int MPI_File_open(MPI_File *fh);

Description:Synchronizes file state (equivalent to implicit invocation of MPI_File_sync), and then closes the

file associated with handle fh. The user must ensure that all oustanding non-blocking requests

and split-collective operations associated with handle fh have completed. If the file was opened

with access mode MPI_MODE_DELETE_ON_CLOSE, it is deleted from the file system.

#include <mpi.h>

...

MPI_File fh;

int err;

...

/* open a file storing the handle in fh */

/* perform file access */

...

err = MPI_File_close(&fh);


...

#include <mpi.h>

...

MPI_File fh;

int err;

...

/* open a file storing the handle in fh */

/* perform file access */

...

err = MPI_File_close(&fh);


...

35

MPI-IO: Set File View

Function: MPI_File_set_view()

int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,

MPI_Datatype filetype, char *datarep, MPI_Info info);

Description:Changes the process’ view of data file, setting the start of the view to disp, the type of file data to

etype, the distribution of file data to processes to filetype, and data representation to datarep.

Resets the individual and shared file pointers to zero. The call is collective, requiring the values

for datarep and etype extents to be identical for all processes. The data representation must be

one of: “native”, “internal” or “external32”.

#include <mpi.h>

...

MPI_File fh;

int err;

...

/* open file storing the handle in fh */

...

/* view the file as stream of integers with no header, using native data representation */

err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);

if (err != MPI_SUCCESS) {/* handle error */}

...

#include <mpi.h>

...

MPI_File fh;

int err;

...


...

/* view the file as stream of integers with no header, using native data representation */

err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);

if (err != MPI_SUCCESS) {/* handle error */}

...

36

MPI-IO: Read File with Explicit Offset

Function: MPI_File_read_at()

int MPI_File_read_at(MPI_File fh, MPI_Offset offs, void *buf, int count,

MPI_Datatype type, MPI_Status *status);

Description:Reads count elements of type type from file represented by fh at offset offs, storing them in buffer

pointed to by buf. Offset offs is expressed in etype units relative to the current view associated

with the file handle fh. Successful call returns the amount of data transferred in status.

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int buf[3], err;

...


...

MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);

/* read the third triad of integers from file */

err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);

...

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int buf[3], err;

...


...


/* read the third triad of integers from file */

err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);

...

37

MPI-IO: Write to File with

Explicit OffsetFunction: MPI_File_write_at()

int MPI_File_write_at(MPI_File fh, MPI_Offset offs, void *buf, int count,

MPI_Datatype type, MPI_Status *status);

Description:Writes count elements of type type from buffer buf to file represented by fh at offset offs. Offset

offs is expressed in etype units relative to the current view associated with the file handle fh.

Successful call returns the amount of data transferred in status.

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int err;

double dt = 0.0005;

...


...

MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);

/* store timestep as the first item in file */

err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);

...

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int err;

double dt = 0.0005;

...


...


/* store timestep as the first item in file */

err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);

...

38

MPI-IO: Read File Collectively with

Individual File PointersFunction: MPI_File_read_all()

int MPI_File_read_all(MPI_File fh, void *buf, int count, MPI_Datatype type,

MPI_Status *status);

Description:All processes in communicator group associated with the file handle fh read their respective count

elements of types type from file at the offsets determined by the current values of file pointers

cached on their file handles, storing them in buffers pointed to by buf. Successful call returns the

amount of data transferred in status.

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int buf[20], err;

...


...


/* read 20 integers at current file offset in every process */

err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);

...

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

int buf[20], err;

...


...


/* read 20 integers at current file offset in every process */

err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);

...

39

MPI-IO: Write to File Collectively

with Individual File PointersFunction: MPI_File_write_all()

int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype type,

MPI_Status *status);

Description:All processes in communicator group associated with the file handle fh write their respective

count elements of types type from buffers buf to file at the offsets determined by the current

values of file pointers cached on their file handles. Successful call returns the amount of data

transferred in status.

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

double t;

int err, rank;

...

/* open file storing the handle in fh; compute t */

...

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

/* interleave time values t from each process at the beginning of file */

MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);

err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

...

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

double t;

int err, rank;

...


...



MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);

err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

...

40

MPI-IO: File Seek

Function: MPI_File_seek()

int MPI_File_seek(MPI_File fh, MPI_Offset offs, int whence);

Description:Updates the value of the individual file pointer according to whence, which has the following

possible values:

• MPI_SEEK_SET: the pointer is set to offs

• MPI_SEEK_CUR: the pointer is set to the current value plus offs

• MPI_SEEK_END: the pointer is set to the end of file plus offs.

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

double t;

int rank;

...


...




MPI_File_seek(fh, MPI_SEEK_SET, rank);

MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

...

#include <mpi.h>

...

MPI_File fh;

MPI_Status stat;

double t;

int rank;

...


...




MPI_File_seek(fh, MPI_SEEK_SET, rank);

MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

...

MPI-IO Data Access Classification

41Source: http://www.mpi-forum.org/docs/mpi2-report.pdf

Example: Scatter to File

42Example created by Jean-Pierre Prost from IBM Corp.

Scatter Example Source

43

#include "mpi.h"

static int buf_size = 1024;

static int blocklen = 256;

static char filename[] = "scatter.out";

main(int argc, char **argv)

{

char *buf, *p;

int myrank, commsize;

MPI_Datatype filetype, buftype;

int length[3];

MPI_Aint disp[3];

MPI_Datatype type[3];

MPI_File fh;

int mode, nbytes;

MPI_Offset offset;

MPI_Status status;

/* initialize MPI */

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Comm_size(MPI_COMM_WORLD, &commsize);

#include "mpi.h"

static int buf_size = 1024;

static int blocklen = 256;

static char filename[] = "scatter.out";

main(int argc, char **argv)

{

char *buf, *p;

int myrank, commsize;

MPI_Datatype filetype, buftype;

int length[3];

MPI_Aint disp[3];

MPI_Datatype type[3];

MPI_File fh;

int mode, nbytes;

MPI_Offset offset;

MPI_Status status;

/* initialize MPI */

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Comm_size(MPI_COMM_WORLD, &commsize);

/* initialize buffer */

buf = (char *) malloc(buf_size);

memset(( void *)buf, '0' + myrank, buf_size);

/* create and commit buftype */

MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype);

MPI_Type_commit(&buftype);

/* create and commit filetype */

length[0] = 1;

length[1] = blocklen;

length[2] = 1;

disp[0] = 0;

disp[1] = blocklen * myrank;

disp[2] = blocklen * commsize;

type[0] = MPI_LB;

type[1] = MPI_CHAR;

type[2] = MPI_UB;

MPI_Type_struct(3, length, disp, type, &filetype);

MPI_Type_commit(&filetype);

/* open file */

mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

/* initialize buffer */

buf = (char *) malloc(buf_size);

memset(( void *)buf, '0' + myrank, buf_size);

/* create and commit buftype */

MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype);

MPI_Type_commit(&buftype);

/* create and commit filetype */

length[0] = 1;

length[1] = blocklen;

length[2] = 1;

disp[0] = 0;

disp[1] = blocklen * myrank;

disp[2] = blocklen * commsize;

type[0] = MPI_LB;

type[1] = MPI_CHAR;

type[2] = MPI_UB;

MPI_Type_struct(3, length, disp, type, &filetype);

MPI_Type_commit(&filetype);

/* open file */

mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

Scatter Example Source (cont.)

44

MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);

/* set file view */

offset = 0;

MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);

/* write buffer to file */

MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);

/* print out number of bytes written */

MPI_Get_elements(&status, MPI_CHAR, &nbytes);

printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);

/* close file */

MPI_File_close(&fh);

/* free datatypes */

MPI_Type_free(&buftype);

MPI_Type_free(&filetype);

/* free buffer */

free (buf);

/* finalize MPI */

MPI_Finalize();

}

MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);

/* set file view */

offset = 0;

MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);

/* write buffer to file */

MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);

/* print out number of bytes written */

MPI_Get_elements(&status, MPI_CHAR, &nbytes);

printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);

/* close file */

MPI_File_close(&fh);

/* free datatypes */

MPI_Type_free(&buftype);

MPI_Type_free(&filetype);

/* free buffer */

free (buf);

/* finalize MPI */

MPI_Finalize();

}

Data Access Optimizations

45

Data Sieving 2-phase I/O

Collective Read Implementation in ROMIO

Source: http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf

ROMIO Scaling Examples

• Bandwidths obtained for 5123 arrays (astrophysics benchmark)

on Argonne IBM SP

46

Processors Independent I/O Collective I/O

16 1.26 MB/s 64.8 MB/s

32 1.25 MB/s 69.5 MB/s

48 1.36 MB/s 70.6 MB/s

Processors Independent I/O Collective I/O

16 12.8 MB/s 68.5 MB/s

32 6.46 MB/s 82.6 MB/s

48 5.83 MB/s 88.4 MB/s

Write Operations

Read Operations

Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/astro.html

Independent vs. Collective Access

47

Collective I/O on IBM SPIndividual I/O on IBM SP

Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/upshot.html

48

Topics

• Introduction

• RAID







Introduction to HDF5

• Acronym for Hierarchical Data Format, a portable, freely distributable, and well

supported library, file format, and set of utilities to manipulate it

• Explicitly designed for use with scientific data and applications

• Initial HDF version was created at NCSA/University of Illinois at Urbana-

Champaign in 1988

• First revision in widespread use was HDF4

• Main HDF features include:

– Versatility: supports different data models and associated metadata

– Self-describing: allows an application to interpret the structure and contents of a file

without any extraneous information

– Flexibility: permits mixing and grouping various objects together in one file in a user-

defined hierarchy

– Extensibility: accommodates new data models, added both by the users and developers

– Portability: can be shared across different platforms without preprocessing or

modifications

• HDF5 is the most recent incarnation of the format, adding support for new type

and data models, parallel I/O, and streaming, and removing a number of existing

restrictions (maximal file size, number of objects per file, flexibility of type use,

storage management configurability, etc.) as well as improving the performance

49

HDF5 File Layout• Major object classes: groups and datasets

• Namespace resembles file system directory hierarchy (groups ≡ directories, datasets ≡ files)

• Alias creation supported through links (both soft and hard)

• Mounting of sub-hierachies is possible

50

User’s view

Low-level

organization

HDF5 API & Tools

Library functionality grouped by function

name prefix

• H5: general purpose functions

• H5A: attribute interface

• H5D: dataset manipulation

• H5E: error handling

• H5F: file interface

• H5G: group creation and access

• H5I: object identifiers

• H5P: property lists

• H5R: references

• H5S: dataspace definition

• H5T: datatype manipulation

• H5Z: inline data filters and

compression

51

Command-line utilities

• h5cc, h5c++, h5fc: C, C++ and

Fortran compiler wrappers

• h5redeploy: updates compiler tools

after installation in new location

• h5ls, h5dump: lists hierarchy and

contents of a HDF5 file

• h5diff: compares two HDF5 files

• h5repack, h5repart: rearranges or

repartitions a file

• h5toh4, h4toh5: converts between

HDF5 and HDF4 formats

• h5import: imports data into HDF5 file

• gif2h5, h52gif: converts image data

between gif and HDF5 formats

Basic HDF5 Concepts

• Group

– Structure containing zero or more HDF5 objects (possibly other groups)

– Provides a mechanism for mapping a name (path) to an object

– “Root” group is a logical container of all other objects in a file

• Dataset

– A named array of data elements (possibly multi-dimensional)

– Specifies the representation of the dataset the way it will be stored in HDF5 file through

associated datatype and dataspace parameters

• Dataspace

– Defines dimensionality of a dataset (rank and dimension sizes)

– Determines the effective subset of data to be stored or retrieved in subsequent file

operations (aka selection)

• Datatype

– Describes atomically accessed element of a dataset

– Permits construction of derived (compound) types, such as arrays, records,

enumerations

– Influences conversion of numeric values between different platforms or

implementations

• Attribute

– A small, user-defined structure attached to a group, dataset or named datatype,

providing additional information52

HDF5 Spatial Subset Examples

53Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf

HDF5 Virtual File Layer


• Developed to cope with large number of available storage subsystem

variations

• Permits custom file driver implementations and related optimizations

Overview of Data Storage Options


Simultaneous Spatial and Type

Transformation Example


Simple HDF5 Code Example

57

/* Writing and reading an existing dataset. */

#include "hdf5.h"

#define FILE "dset.h5"

int main() {

hid_t file_id, dataset_id; /* identifiers */

herr_t status;

int i, j, dset_data[4][6];

/* Initialize the dataset. */

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

dset_data[i][j] = i * 6 + j + 1;

/* Open an existing file. */

file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);

/* Open an existing dataset. */

dataset_id = H5Dopen(file_id, "/dset");

/* Write the dataset. */

status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

/* Close the dataset. */

status = H5Dclose(dataset_id);

/* Close the file. */

status = H5Fclose(file_id);

}

/* Writing and reading an existing dataset. */

#include "hdf5.h"

#define FILE "dset.h5"

int main() {

hid_t file_id, dataset_id; /* identifiers */

herr_t status;

int i, j, dset_data[4][6];

/* Initialize the dataset. */

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

dset_data[i][j] = i * 6 + j + 1;

/* Open an existing file. */

file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);

/* Open an existing dataset. */

dataset_id = H5Dopen(file_id, "/dset");

/* Write the dataset. */

status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

/* Close the dataset. */

status = H5Dclose(dataset_id);

/* Close the file. */

status = H5Fclose(file_id);

}

Parallel HDF5

• Relies on MPI-IO as the file layer driver

• Uses MPI for internal communications

• Most of the functionality controlled through property lists

(requires minimal HDF5 interface changes)

• Supports both individual and collective file access

• Three raw data storage layouts: contiguous, chunking and

compact

• Enables additional optimizations through derived MPI datatypes

(esp. for regular collective accesses)

• Limitations

– Chunked storage with overlapping chunks (results non-

deterministic)

– Read-only compression

– Writes with variable length datatypes not supported

58

59

Topics

• Introduction

• RAID



• Parallel I/O Libraries (MPI IO, ROMIO)

• Parallel File Formats (HDF5..)



General Parallel File System (GPFS)

• Brief history:– Based on the Tiger Shark parallel file system developed at the IBM Almaden Research Center in

1993 for AIX

• Originally targeted at dedicated video servers

• The multimedia orientation influenced GPFS command names: they all contain “mm”

– First commercial release was GPFS V1.1 in 1998

– Linux port released in 2001; Linux-AIX interoperability supported since V2.2 in 2004

• Highly scalable– Distributed metadata management

– Permits incremental scaling

• High-performance– Large block size with wide striping

– Parallel access to files from multiple nodes

– Deep prefetching

– Adaptable mechanism for recognizing access patterns

– Multithreaded daemon

• Highly available and fault tolerant– Data protection through journaling, replication, mirroring and shadowing

– Ability to recover from multiple disk, node and connectivity failures (heartbeat mechanism)

– Recovery mechanism implemented in all layers 60

GPFS Features (I)

61Source: http://www-03.ibm.com/systems/clusters/software/gpfs.pdf

GPFS Features (II)

62

GPFS Architecture

63Source: http://www.redbooks.ibm.com/redbooks/pdfs/sg245610.pdf

Components Internal to GPFS

Daemon

• Configuration Manager (CfgMgr)– Selects the node acting as Stripe Group Manager for each file system

– Checks for the quorum of nodes required for the file system usage to continue

– Appoints successor node in case of failure

– Initiates and controls recovery procedure

• Stripe Group Manager (FSMgr, aka File System Manager)– Strictly one per each GPFS file system

– Maintains availability information of disks comprising the file system (physical storage)

– Processes modifications (disk removals and additions)

– Repairs file system and coordinates data migration when required

• Metanode– Manages metadata (directory block updates)

– Its location may change (e.g. a node obtaining access to the file may become the metanode)

• Token Manager Server– Synchronizes concurrent access to files and ensures consistency among caches

– Manages tokens, or per-object locks

• Mediates token migration when another node requests token conflicting with the existing token (token

stealing)

– Always located on the same node as Stripe Group Manager64

GPFS Management Functions &

Their Dependencies

65Source: http://www.redbooks.ibm.com/redbooks/pdfs/sg246700.pdf

Components External to GPFS

Daemon

• Virtual Shared Disk (VSD, aka logical volume)– Enables nodes in one SP system partition to share disks with the other nodes in the same

system partition

– VSD node can be a client, a server (owning a number of VSDs, and performing data reads

and writes requested by client nodes), or both at the same time

• Recoverable Virtual Shared Disk (RVSD)– Used together with VSD to provide high availability against node failures reported by Group

Services

– Runs recovery scripts and notifies client applications

• Switch (interconnect) Subsystem– Starts switch daemon, responsible for initializing and monitoring the switch

– Discovers and reacts to topology changes; reports and services status/error packets

• Group Services– Fault-tolerant, highly available and partition-sensitive service monitoring and coordinating

changes related to another subsystem operating in the partition

– Operates on each node within the partition, plus the control workstation for the partition

• System Data Repository (RSD)– Location where the configuration data are stored

66

Read Operation Flow in GPFS

67

Write Operation Flow in GPFS

68

Token Management in GPFS

69

• First lock request for an object requires a message from node N1 to the token manager

• Token server grants token to N1 (subsequent lock requests can be granted locally)

• Node N2 requests token for the same file (lock conflict)

• Token server detects conflicting lock request and revokes token from N1

• If N1 was writing to file, the data is flushed to disk before the revocation is complete

• Node N2 gets the token from N1

GPFS Write-behind and Prefetch

70

• As soon as application’s write buffer is copied

into the local pagepool, the write is operation is

complete from client’s perspective

• GPFS daemon schedules a worker thread to

finalize the request by issuing I/O calls to the

device driver

• GPFS estimates the number of blocks to read

ahead based on disk performance and rate at

which application is reading the data

• Additional prefetch requests are processed

asynchronously with the completion of the

current read

Some GPFS Cluster Models

71Joined (AIX and Linux) modelMixed (NSD and direct attached) model

Network Shared Disk (NSD) with dedicated server model Direct attached model

Comparison of GPFS to Other

File Systems

72

73

Topics

• Introduction

• RAID



• Parallel I/O Libraries (MPI IO, ROMIO)

• Parallel File Formats (HDF5..)



74

Summary – Material for the Test

• Need for Parallel I/O (slide 6)

• RAID concepts (slides 8-10)

• Distributed File System Concepts NFS (slides 12, 13)

• Why NFS is bad for parallel I/O (slide 14)

• Parallel File System Concepts (slides 16-19)

• PVFS (slides 20-24)

• MPI-IO concepts & features (slides 29-32)

• MPI-IO API & functionalities (slides 33-41)

parallel i/o : file systems and libraries · parallel file systems : other layers • lower level...

Documents