on evaluating gpfs research work that has been done at hlrs by alejandro calderon

On evaluating GPFS

Research work that has been done at HLRS byAlejandro Calderon

acaldero @ arcos.inf.uc3m.es

HPC-Europa (HLRS) 2

On evaluating GPFS

Short description

Metadata evaluation fdtree

Bandwidth evaluation Bonnie Iozone

IODD IOP


HPC-Europa (HLRS) 3

GPFS description http://www.ncsa.uiuc.edu/UserInfo/Data/filesystems/index.html

General Parallel File System (GPFS) is a parallel file system package developed by IBM.

History: Originally developed for IBM's AIX operating system then ported to Linux Systems.

Features: Appears to work just like a traditional UNIX file system from the user application level. Provides additional functionality and enhanced performance when accessed via parallel interfaces such as MPI-I/O. High performance is obtained by GPFS by striping data across multiple nodes and disks. Striping is performed automatically at the block level. Therefore, all files (larger than the designated block size) will be striped. Can be deployed in NSD or SAN configurations. Clusters hosting a GPFS file system can allow other clusters at different geographical locations to mount that file system.


HPC-Europa (HLRS) 4

GPFS (Simple NSD Configuration)


HPC-Europa (HLRS) 5

GPFS evaluation (metadata)

fdtree Used for testing the metadata performance of a file system Create several directories and files, in several levels

Used on: Computers:

noco-xyz Storage systems:

Local, GPFS


HPC-Europa (HLRS) 6

fdtree [local,NFS,GPFS]

0

500

1000

1500

2000

2500

Op

era

tio

ns

/Se

c.

Directory creates persecond

File creates per second File removals persecond

Directory removals persecond

./fdtree.bash -f 3 -d 5 -o X

/gpfs

/tmp

/mscratch


HPC-Europa (HLRS) 7

fdtree on GPFS (Scenario 1)ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...

Scenario 1: several nodes, several process per node, different subtrees, many small files

P1 Pm…

nodex

…

… …


HPC-Europa (HLRS) 8

fdtree on GPFS (scenario 1)ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...

0

100

200

300

400

500

600

1n-1p 4n-4p 4n-8p 4n-16p 8n-8p 8n-16p

Op

erat

ion

s/S

ec.

Directory creates per second

File creates per second

File removals per second

Directory removals per second


HPC-Europa (HLRS) 9

fdtree on GPFS (Scenario 2)ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...

Scenario 2: several nodes, one process per node, same subtree, many small files

P1 Px…

…

nodex


HPC-Europa (HLRS) 10

fdtree on GPFS (scenario 2)ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...

0

5

10

15

20

25

30

35

40

45

1 2 4 8

number of process (1 per node)

File

s c

rea

tes

pe

r s

ec

on

d

working in the same directory

working in different directories



Metadata cache on GPFS ‘client’

hpc13782 noco186.nec 304$ time ls -als | wc -l894

real 0m0.466suser 0m0.010ssys 0m0.052s

Working in a GPFS directory with 894 entries ls –las need to get each file attribute from GPFS

metadata server

In a couple of seconds, the contents of the cache seams disappear









fdtree results

Main conclusions

Contention at directory level: If two o more process from a parallel application need to write

data, please be sure each one use different subdirectories from GPFS workspace

Better results than NFS (but lower that the local file system)



GPFS performance (bandwidth)

Bonnie Read and write a 2 GB file Write, rewrite and read

Used on: Computers:

Cacau1 Noco075

Storage systems: GPFS



Bonnie on GPFS [write + re-write]

0

20

40

60

80

100

120

140

160

180

ban

dw

idth

(M

B/s

ec.)

write

rewrite

write 51,86 164,69

rewrite 3,43 36,35

cacau1-GPFS noco075-GPFS

GPFS over NFS



Bonnie on GPFS [read]

0

50

100

150

200

250

ban

dw

idth

(M

B/s

ec.)

read

read 75,85 232,38

cacau1-GPFS noco075-GPFS

GPFS over NFS



GPFS performance (bandwidth)

Iozone Write and read with several file size and access size Write and read bandwidth

Used on: Computers:

Noco075 Storage systems:

GPFS



64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

16

38

4

32

76

8

65

53

6

13

10

72

26

21

44

52

42

88

4

32

256

2048

16384

0,00

200,00

400,00

600,00

800,00

1000,00

1200,00

Ba

nd

wid

th (

MB

/s)

File size (KB)

RecLen

(byt

es)

Write on GPFS

1000,00-1200,00

800,00-1000,00

600,00-800,00

400,00-600,00

200,00-400,00

0,00-200,00

Iozone on GPFS [write]



Iozone on GPFS [read]

64

128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

884

16

64

256

10244096

16384

0,00

500,00

1000,00

1500,00

2000,00

2500,00

Ba

nd

wid

th (

MB

/s)

File size (KB)

RecLen

(byt

es)

Read on GPFS

2000,00-2500,00

1500,00-2000,00

1000,00-1500,00

500,00-1000,00

0,00-500,00



GPFS evaluation (bandwidth)

IODD Evaluation of disk performance by using several nodes:

disk and networking A dd-like command that can be run from MPI

Used on: 2, and 4 nodes,

4, 8, 16, and 32 process (1, 2, 3, and 4 per node) that write a file of 1, 2, 4, 8, 16, and 32 GB

By using both, POSIX interface and MPI-IO interface

next ->



How IODD works…

a b .. n a b .. n a b .. n

P1 P2 Pm…

…

nodex = 2, 4 nodes processm = 4, 8, 16, and 32 process (1, 2, 3, 4 per node)

file sizen = 1, 2, 4, 8, 16 and 32 GB

nodex



IODD on 2 nodes [MPI-IO]

1 2 4 816

32

8

2

0

20

40

60

80

100

120

140

160

180GPFS (writing, 2 nodes)bandwidth (MB/sec.)

process per node

file size (GB)

160-180140-160120-140100-12080-10060-8040-6020-400-20



1 2 4 816

32

8

2

0

20

40

60

80

100

120

140

160

180GPFS (writing, 4 nodes)

bandwidth (MB/sec.)

process per node

file size (GB)

160-180140-160120-140100-12080-10060-8040-6020-400-20

IODD on 4 nodes [MPI-IO]



Differences by using different APIs

12

48

1632

16

8

4

21

0

20

40

60

80

100

120

140

160


bandwidth (MB/sec.)

process per node

file size (GB)

160-180140-160120-140100-12080-10060-8040-6020-400-20

GPFS (2 nodes, POSIX) GPFS (2 nodes, MPI-IO)

1 2 48

32

16

84

21

0

10

20

30

40

50

60


bandwidth (MB/sec.)

process per node

file size (GB)

60-7050-6040-5030-4020-3010-200-10



IODD on 2 GB [MPI-IO, = directory]

1

2

4

8

16

32

0

20

40

60

80

100

120

140

160

Number of nodes

GPFS (writing, 1-32 nodes, same directory)bandwidth (MB/sec.)



IODD on 2 GB [MPI-IO, ≠ directory]

1

2

4

8

16

32

0

20

40

60

80

100

120

140

160

Number of nodes

GPFS (writing, 1-32 nodes, different directories)bandwidth (MB/sec.)



IODD results

Main conclusions

The bandwidth decrease with the number of processes per node Beware of multithread application with medium-high I/O

bandwidth requirements for each thread

It is very important to use MPI-IO because this API let users get more bandwidth

The bandwidth decrease with more than 4 nodes too With large files, the metadata management seams not to be the main

bottleneck



GPFS evaluation (bandwidth)

IOP Get the bandwidth obtained by writing and reading in parallel

from several processes The file size is divided between the process number so each

process work in an independent part of the file

Used on: GPFS through MPI-IO (ROMIO on Open MPI) Two nodes writing a 2 GB files in parallel

On independent files (non-shared) On the same file (shared)



How IOP works…

2 nodes m = 2 process (1 per node)

n = 2 GB file size

a a .. b b .. x x ..

P1 P2 Pm…

…

File per process (non-shared)

a b .. x a b .. x … a b .. x

P1 P2 Pm…

Segmented access (shared)

n n



IOP: Differences by using shared/non-sharedwriting on file(s) over GPFS

0

20

40

60

80

100

120

140

160

180

1KB

2KB

4KB

8KB

16KB

32KB

64KB

128K

B

256K

B

512K

B1M

B

access size

Ban

dw

idth

(M

B/s

ec.)

NON-shared

shared



reading on file(s) over GPFS

0

20

40

60

80

100

120

140

160

180

200

1KB

2KB

4KB

8KB

16KB

32KB

64KB

128K

B

256K

B

512K

B1M

B

access size

Ban

dw

idth

(M

B/s

ec.)

NON-shared

shared

IOP: Differences by using shared/non-shared



GPFS writing in non-shared files

GPFS writing in a shared file



0

20

40

60

80

100

120

1401

MB

51

2 K

B

25

6 K

B

12

8 K

B

64

KB

32

KB

16

KB

8 K

B

4 K

B

2 K

B

1 K

B

access size

ban

dw

ith

(M

B/s

ec)

writereadRreadBread

GPFS writing in shared file:the 128 KB magic number



IOP results

Main conclusions

If several process try to write to the same file but on independent areas then the performance decrease

With several independent files results are similar on several tests, but with shared file are more irregular

Appears a magic number: 128 KBSeams that at that point the internal algorithm changes and it increases the bandwidth

on evaluating gpfs research work that has been done at hlrs by alejandro calderon

Documents

gpfs slide

gpfs scenario

hpceuropa hlrs

gpfs file system

o gpfs

gpfs directory

s user 0m0

s sys 0m0