bulk data transfer activities we regard data transfers as “first class citizens,” just like...

1
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB to UniTree using 3 different pipeline configurations. The pipelines are built using Condor and Stork scheduling technologies. The whole process is managed by DAGMan. 2 We used the experimental DiskRouter tool instead of Globus GridFTP for cache-to-cache transfers. We obtained an end-to-end throughput (from SRB to UniTree) of 20 files per hour (5.95 MB/sec). Unitree not responding Diskrouter reconfigured and restarted 1 We used the native file transfer mechanisms for each underlying system: SRB, Globus GridFTP, and UniTree for the transfers. We described each data transfer with a five stage pipeline, resulting in a 5x2611 node workflow (DAG) managed by DAGMan. We obtained an end-to-end throughput (from SRB to UniTree) of 11 files per hour (3.2 MB/sec). SRB Server UniTree Server SDSC Cache NCSA Cache SRB get Globus-url- copy MSS put Submit Site A B C D Move X from C to D Move X from A to B Remove X from A Move X from B to C Remove X from B Move X from C to D Move X from A to B Remove X from A Move X from B to C Remove X from B Move X from C to D Move X from A to B Remove X from A Move X from B to C Remove X from B Move X from C to D Move X from A to B Remove X from A Move X from B to C Remove X from B DAG File SRB Server UniTree Server NCSA Cache SRB get MSS put Submit Site A C D 3 We skipped the SDSC cache, and performed direct SRB transfers from SRB server to NCSA cache. We described each data transfer with a three stage pipeline, resulting in a 3x2611 node workflow (DAG). We obtained an end-to-end throughput (from SRB to UniTree) of 17 files per hour (5.00 SRB server problem DAG File Move X from A to C Move X from C to D Remove X from C Move X from A to C Move X from C to D Remove X from C Move X from A to C Move X from C to D Remove X from C Move X from A to C Move X from C to D Remove X from C Unitree maintenance PDQ Expedition Condor is a specialized workload management system for compute-intensive jobs. Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Condor chooses when and where to run jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. http://www.cs.wisc.edu/ condor What a batch system means for computational jobs, Stork means the same for data placement activities (ie. transfer, replication, reservations, staging) in Grid: it schedules, runs, monitors data placement jobs and ensures that they complete. Stork can interact with heterogeneous middleware and end-storage systems easily and recover from failures successfully. Stork makes data placement a first class citizen of Grid computing. http://www.cs.wisc.edu/ condor/stork DAGMan DAGman (Directed Acyclic Graph Manager) is a meta- scheduler for Condor. It manages dependencies between jobs at a higher level than the Condor Scheduler. DAGMan can now also interact with Stork. http://www.cs.wisc.edu/condor/dagman DiskRouter Moves large amounts of data efficiently (on the order of terabytes) Uses disk as a buffer to aid in large data transfers Performs application level routing Increases network throughput by using multiple sockets and setting tcp buffer sizes explicitly http://www.cs.wisc.edu/condor/ diskrouter GridFTP: High performance, secure, reliable data transfer protocol from Globus http://www.globus.org/ datagrid/gridftp.html SRB: Storage Resource Broker Client-Server middleware that provides a uniform interface for connecting to heterogeneous data resources http://www.npaci.edu/DICE/ SRB UniTree: NCSA’s High-speed, high- capacity mass storage system http://www.ncsa.uiuc.edu/ Divisions/CC/HPDM/unitree SRB server maintenance SDSC cache reboot & UW CS Network outage

Upload: charles-pearson

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB to UniTree using 3 different pipeline configurations. The pipelines are built using Condor and Stork scheduling technologies. The whole process is managed by DAGMan.

2 We used the experimental DiskRouter tool instead of Globus GridFTP for cache-to-cache transfers. We obtained an end-to-end throughput (from SRB to UniTree) of 20 files per hour (5.95 MB/sec).

Unitree not responding

Diskrouter reconfigured and restarted

1 We used the native file transfer mechanisms for each underlying system: SRB, Globus GridFTP, and UniTree for the transfers. We described each data transfer with a five stage pipeline, resulting in a 5x2611 node workflow (DAG) managed by DAGMan. We obtained an end-to-end throughput (from SRB to UniTree) of 11 files per hour (3.2 MB/sec).

SRB Server UniTree Server

SDSC Cache NCSA Cache

SRB get

Globus-url-copy

MSS put

Submit Site

A

B C

D

Move X from C to D

Move X from A to B

Remove X from A

Move X from B to C

Remove X from B

Move X from C to D

Move X from A to B

Remove X from A

Move X from B to C

Remove X from B

Move X from C to D

Move X from A to B

Remove X from A

Move X from B to C

Remove X from B

Move X from C to D

Move X from A to B

Remove X from A

Move X from B to C

Remove X from B

DAG File

SRB Server UniTree Server

NCSA Cache

SRB getMSS put

Submit Site

A

C

D

3 We skipped the SDSC cache, and performed direct SRB transfers from SRB server to NCSA cache.

We described each data transfer with a three stage pipeline, resulting in a 3x2611 node workflow (DAG). We obtained an end-to-end throughput (from SRB to UniTree) of 17 files per hour (5.00 MB/sec).

SRB server problem

DAG File

Move X from A to C

Move X from C to D

Remove X from C

Move X from A to C

Move X from C to D

Remove X from C

Move X from A to C

Move X from C to D

Remove X from C

Move X from A to C

Move X from C to D

Remove X from C

Unitree maintenancePDQ Expedition

Condor is a specialized workload management system for compute-intensive jobs. Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Condor chooses when and where to run jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. http://www.cs.wisc.edu/condor

What a batch system means for computational jobs, Stork means the same for data placement activities (ie. transfer, replication, reservations, staging) in Grid: it schedules, runs, monitors data placement jobs and ensures that they complete.Stork can interact with heterogeneous middleware and end-storage systems easily and recover from failures successfully.Stork makes data placement a first class citizen of Grid computing.http://www.cs.wisc.edu/condor/stork

DAGManDAGman (Directed Acyclic Graph Manager) is a meta-scheduler for Condor. It manages dependencies between jobs at a higher level than the Condor Scheduler. DAGMan can now also interact with Stork.http://www.cs.wisc.edu/condor/dagman

DiskRouterMoves large amounts of data efficiently (on the order of terabytes)Uses disk as a buffer to aid in large data transfersPerforms application level routingIncreases network throughput by using multiple sockets and setting tcp buffer sizes explicitlyhttp://www.cs.wisc.edu/condor/diskrouter

GridFTP:High performance, secure, reliable data transfer protocol from Globushttp://www.globus.org/datagrid/gridftp.html

SRB: Storage Resource Broker Client-Server middleware that provides a uniform interface for connecting to heterogeneous data resourceshttp://www.npaci.edu/DICE/SRB

UniTree:NCSA’s High-speed, high-capacity mass storage systemhttp://www.ncsa.uiuc.edu/Divisions/CC/HPDM/unitree

SRB server maintenance

SDSC cache reboot & UW CS Network outage