home | argonne leadership computing facility - a method for … · 2017. 6. 26. · initial work...

22
Cray Data Virtualization Service a Method for Heterogeneous File System Connectivity David Wallace and Stephen Sugiyama

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray Data Virtualization Service

a Method forHeterogeneous File System Connectivity

David Wallaceand

Stephen Sugiyama

Page 2: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray Data Virtualization Service

� A little background and history� Cray DVS Support in CLE 2.1� What if…

April 14, 2008 Cray Inc. Confidental Slide 2

Page 3: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

DVS: Background and history

� Concept derived from Cray T3E system call forwarding• Focused on I/O forwarding aspects

� Initial work focused on clusters� Some design objectives

• Provide a low(er) cost alternative to having HBAs on all nodes in a cluster

• Utilize bandwidth and capacity of a cluster’s high speed network• Utilize bandwidth and capacity of a cluster’s high speed network• Provide global access to file systems resident on I/O nodes• Provide high performance, parallel file system access• Target I/O patterns of High Performance Computing applications

� Very large block sizes� Sequential access� Low data re-use

April 14, 2008 Cray Inc. Confidental Slide 3

Page 4: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

n1/dvs

n2/dvs

n3/dvs

n4/dvs

n5/dvs

Serial DVS : Multiple Clients to a Single Server

High Speed Network(Ethernet, Quadrics, Myrinet)

April 14, 2008 Cray Inc. Confidental Slide 4

IO1

/dvs SAN

DirectAttached

/dev/sda1

� DVS uses a client-server model• The DVS server(s) stacks on the

local file system(s) managing the disks

� Serial DVS• All requests are routed through a

single DVS server• Provides similar functionality as

NFS

EXT3, XFS

Page 5: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Client/dvs

Single DVS Server to a Single I/O Device

Open, read, write request passed to VFS and intercepted by DVS client. DVS forwards request to DVS server

April 14, 2008 Cray Inc. Confidental Slide 5

IO1

/dvs SAN

DirectAttached

/dev/sda1

� On server, request passed to local file system� Meta data, locking operations are local� Data is read/written to disk

� Uses standard Linux buffer management� Local cache� I/O readahead/write behind EXT3, XFS

Page 6: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

ClientA ClientB ClientC ClientD

DVS

DVS Storage Network(virtualized file system)

DVS: Multiple I/O node support

DVS DVS DVS

SIO-0

DVS

/dvs

SIO-3

DVS

/dvs

SIO-2

DVS

/dvs

SIO-1

DVS

/dvs

Page 7: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

ClientD

DVS

DVS: Parallel file system

DVS DVS DVS

write(‘/x/y/z’)

A B C D I/O request

ClientCClientA ClientB

SIO-0

DVS

/dvs

SIO-3

DVS

/dvs

SIO-2

DVS

/dvs

SIO-1

DVS

/dvs

write(‘/x/y/z’) write(‘/x/y/z’)write(‘/x/y/z’)write(‘/x/y/z’)

Aggregate:-Buffer cache-Readahead- Write behind

Page 8: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

SO, WHERE ARE WE TODAY?

April 14, 2008 Cray Inc. Confidental Slide 8

Page 9: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray XT Scalable Lustre I/O: Direct AttachedCray XT Supercomputer

� Compute nodes� Login nodes� Lustre OSS� Lustre MDS� NFS Server� Boot/SDB node

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

April 14, 2008 Cray Inc. Confidental Slide 9

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

10 GigE10 GigE10 GigE10 GigEBackupBackupBackupBackup

andandandandArchive Archive Archive Archive ServersServersServersServers

Lustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel Filesystem

� Each compute node runs a Lustre client� The NFS server would allow to export the Lustre filesystem to other

systems in the network

Page 10: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray XT accessing Lustre and Remote Filesystems

Cray XT Supercomputer� Compute nodes� Login nodes� Lustre OSS� Lustre MDS� NFS Clients� Boot/SDB node

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

April 14, 2008 Cray Inc. Confidental Slide 10

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

Lustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel Filesystem

� Each Compute and Login node runs a Lustre client• Lustre is accessable by every node within the Cray XT system

� Each Login node imports the remote filesystem• Filesystem only accessable from Login nodes• For global accessability, login nodes need to copy files into Lustre

Remote SystemNFS server exports the

remote filesystem

Page 11: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Why not use NFS throughout?Cray XT Supercomputer

� Compute nodes� Login nodes� Lustre OSS� Lustre MDS� NFS Clients� Boot/SDB node

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

April 14, 2008 Cray Inc. Confidental Slide 11

1 GigE Backbone1 GigE Backbone1 GigE Backbone1 GigE Backbone

Lustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel FilesystemLustre global parallel Filesystem

� Each Compute and Login node runs a Lustre client• Lustre is accessable by every node within the XT system

� Each Login AND Compute node imports the remote filesystem• The remote filesystem is accessable from all nodes• No copies required for global accessability

Remote SystemNFS server exports the

remote filesystem

Page 12: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Issues with NFS

� Cray XT systems have thousands of compute nodes• A single NFS server typically cannot manage more than 50 clients• Could cascade NFS servers

� There only can be a single primary NFS server� The other servers would run as clients on the incoming side and

servers on the outbound side

April 14, 2008 Cray Inc. Confidental Slide 12

� Cascaded servers would have to run in user space• Complicates system administration

� Performance impacts• NFS clients run as daemons, thus introducing OS jitter on the

compute nodes• The primary NFS server will be overwhelmed• TCP/IP protocol less efficient than native protocol within the Cray

SeaStar network

Page 13: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Immediate Need

� Access to NFS mounted file systems (/home on login nodes)• For customers migrating from CVN, equivalent functionality to YOD

I/O

April 14, 2008 Cray Inc. Confidental Slide 13

Page 14: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

ClientA ClientB ClientC ClientD

DVS DVS DVS DVS

DVS Storage Network(virtualized file system)

DVS: Support for NFS in CLE 2.1

NFS

SIO-0SIO-3 SIO-2 SIO-1

DVS DVS DVS DVS

Page 15: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray DVS – from the admin point of view

ComputeNodes

DVSClient

ComputeNodes

DVSClient

ComputeNodes

DVSClient

HSN

SIONode

DVSServer

NFS mount

CRAY XT4

October 2007 v6 Cray Proprietary Information Slide 15

� Admin mounts file systems in fstab per usual

mount -t dvs /nfs/user/file4

NFSServer

Page 16: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

From Users Point-of-View

Application

Open()Read()Write()Close()

April 14, 2008 Cray Inc. Confidental Slide 16

/nfs/home/dbw/file

Compute Node

File system appears to be local

Page 17: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

So what if …

� Users have data on one or more file systems (e.g. EXT3, XFS, GPFS, Panasas, NFS) on servers in the data center which they want access from the compute nodes on the Cray XT system WITHOUT having to copy files?

Then …

April 14, 2008 Cray Inc. Confidental Slide 17

Then …

� You install Cray DVS (Data Virtualization Service) server software on each of the existing file servers and you install DVS client software on each of the compute nodes and this allows the admin to “mount –t dvs” the file systems

Page 18: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

DVS DVS DVS

ComputeNodes

DVSClient

ComputeNodes

DVSClient

ComputeNodes

DVSClient

SeaStarInterconnect

DVS

CRAY XT4

DVS Concept in a Cray XT Environment

April 14, 2008 Cray Inc. Confidental Slide 18

SIONode

DVSServer

SIONode

DVSServer

SIONode

DVSServer

SIONode

DVSServer

NFSServer

NFS Client

SAM QFSServer

PanFS Client

GPFSServer

GPFS Client

DAS

EXT3/4 or XFS

Page 19: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray DVS - Summary� No file copying required!

� Simplicity of DVS client allows for larger scale than NFS or cluster file systems

� DVS can amplify the scale of compute nodes serviced to O(10000)• Can project a file system beyond limit of underlying clustered file system

� GPFS on Linux is limited to 512 clients

� The seek, latency and transfer time (physical disk I/O) for every I/O node is overlapped (mostly parallel)

April 14, 2008 Cray Inc. Confidental Slide 19

� Every I/O node does read-ahead and write aggregation (in parallel)

� The effective page cache size is the aggregate size of all of the I/O nodes page caches

� Allows the interconnects to be utilized more fully:• multiple I/O nodes can drive a single app node interconnect at it's maximum speed

• multiple app nodes can drive all of the I/O node interconnects at their maximum speed

� Takes advantage of RDMA for those interconnects that support it (Cray SeaStar, Quadrics, Myricom)

Page 20: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Cray DVS – Initial Customer Usage

� ORNL • Began field trial in December 2007• Installed in production on ~7200 XT3 cores• Replacement for Catamount YOD-I/O functionality• Mounting NFS mounted /home file systems on XT compute nodes

� CSC – Finland• Working with Cray to test DVS with GPFS

April 14, 2008 Cray Inc. Confidental Slide 20

• Working with Cray to test DVS with GPFS• Installed on TDS and undergoing development and testing

� CSCS • Begin early access testing 2Q2008

Page 21: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

Acknowledgements

� David Henseler, Cray Inc� Kitrick Sheets, KBS Software� Jim Harrell, Cray Inc� Chris Johns, Cassatt Corp� The DVS Development Team

• Stephen Sugiyama• Stephen Sugiyama• Tim Cullen• Brad Stevens• And others

April 14, 2008 Cray Inc. Confidental Slide 21

Page 22: Home | Argonne Leadership Computing Facility - a Method for … · 2017. 6. 26. · Initial work focused on clusters Some design objectives • Provide a low ... April 14, 2008 Cray

April 14, 2008 Cray Inc. Confidental Slide 22

Thank You