a comparative experimental study of parallel file systems for … · 2019-02-25 · a comparative...

23
A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) Heraklion, Crete, Greece

Upload: others

Post on 26-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

A Comparative Experimental Study of

Parallel File Systems for Large-Scale Data

Processing

Z. Sebepou, K. Magoutis, M. Marazakis, A. Bilas

Institute of Computer Science (ICS)

Foundation for Research and Technology Hellas (FORTH)

Heraklion, Crete, Greece

Page 2: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 2

Evolution

• Distributed file systems

– NFS versions 2, 3, 4; AFS; etc.

• Shared-disk parallel file systems

– Frangipani/Petal; GPFS; GFS; etc.

• Separating data/metadata paths to object storage

– NASD; pNFS; Panassas; Lustre; PVFS; etc.

Page 3: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 3

NASD-style Parallel File Systems

Open, Lookup, etc.Read, Write

Page 4: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 4

Lustre, PVFS

• Open-source systems, following the NASD paradigm

– Lustre: Cluster File Systems, Inc.; acquired (2007) by Sun

– PVFS: Clemson University, ANL, OSC

• Targeted for large-scale data processing

– LLNL, ORNL, ANL, CERN

• Representative of different approaches to filesystem design

– Client caching

– Statelessness

– Consistency and file access semantics

– Portability

Page 5: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 5

Lustre Architecture

Linux ext3 (modified)Linux ext3 (modified)

Intent-based locking

POSIX semantics

Pre-allocation of MD objs

active/backup

Stripping

Data/MD

CachingTCP/IP, RDMA, etc.

Kernel-based

Client Structure

Page 6: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 6

PVFS2 Architecture

Stripping

No Data Caching

DBPF (“database plus files”)Any Linux file system

No file locking;

“non-conflicting writes”

multiple active

User-level

Client Structure

TCP/IP, RDMA, etc.

Page 7: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 7

Benchmarks

• Streaming; one client, many servers

– IOZone

• Streaming: many clients, many servers

– Parallel I/O (MPI)

• Metadata-intensive

– Cluster PostMark

• Near-random I/O, optionally with data overlap

– Tile I/O (MPI)

• User-perceived response time

– ls –lR on Linux kernel tree

Page 8: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 8

Experimental Testbed

PVFS 2.6.3

Lustre 1.6.0.1

MPICH 2Metadata Server

Object Server Object Server Object Server Object Server

Linux ext3 or

modified ext3

Lustre stripe: 64KB, 256KB, 1MB (default)

PVFS2 stripe: 64KB (default)

Page 9: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 9

Streaming: One Client, Many Servers

(IOZone Benchmark)

PVFS 2.6.3

Lustre 1.6.0.1

MPICH 2Metadata Server

Object Server Object Server Object Server Object Server

Linux ext3 or

modified ext3

Lustre stripe: 64KB, 256KB, 1MB (default)

PVFS2 stripe: 64KB (default)

1, 4 threads

Block sizes 1KB, 64KB, 1MB, 4MB

Page 10: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 10

Streaming: One Client, Many Servers

(IOZone Benchmark) – Lustre

4 Threads

1 Thread

1 client, 12 servers

95% CPU

70% CPU

Page 11: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 11

Streaming: One Client, Many Servers

(IOZone Benchmark) – PVFS

1 Thread

4 Threads

1 client, 12 servers

50% CPU

20% CPU

Page 12: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 12

Streaming: Many Clients, Many Servers

(Parallel I/O Benchmark)

PVFS 2.6.3

Lustre 1.6.0.1

MPICH 2Metadata Server

Object Server Object Server Object Server Object Server

Linux ext3 or

modified ext3

Lustre stripe: 64KB, 256KB, 1MB (default)

PVFS2 stripe: 64KB (default)

Block sizes 64KB-16MB

1. Writes to separate files

2. Reads from single file

3. Read-modify-write to single file

Page 13: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 13

Streaming: Many Clients, Many Servers

(Parallel I/O Benchmark)

Lustre

PVFS2

Writes to Separate Files; 12 clients, 12 servers

Page 14: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 14

Streaming: Many Clients, Many Servers

(Parallel I/O Benchmark)

Lustre

PVFS2

Reads from Single File; 12 clients, 12 servers

Lustre

PVFS2

Page 15: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 15

Cluster PostMark

• Three configurations:

(a) Few (100), large (10-100MB) files

(b) Moderate number (800) of medium-size (1-10MB) files

(c) Many (8000), small (4-128KB) files

Page 16: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 16

Cluster PostMark – Bandwidth

Block Size

12 clients, 12 servers

Page 17: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 17

Cluster PostMark – Transactions

Block Size

12 clients, 12 servers

Page 18: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 18

Near-random I/O, optionally with data overlap

(Tile I/O)

block element tile

• two-dimensional logical structure overlaid on a single file

• each tile assigned to a separate client process

Page 19: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 19

Tile I/O - Reads

Page 20: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 20

Tile I/O - Writes

Page 21: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 21

User-Perceived Response Time

80PVFS2

58Lustre

5.5Linux ext3 (local)

Response Time (sec)File System

ls –lR on Linux 2.6.12 kernel tree (~25,000 files)

Page 22: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 22

Conclusions

• Scalable I/O bandwidth is achievable through

parallel I/O paths to file servers

• Lustre’s efficient metadata management is critical

for metadata-intensive applications

• Lustre’s consistency semantics are useful to some

applications but cause unnecessary overhead to

others that do not require them

Page 23: A Comparative Experimental Study of Parallel File Systems for … · 2019-02-25 · A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing Z. Sebepou,

FORTH-ICS First USE�IX LaSCo Workshop 2008 23

Thank You!