bulk data transfer activities we regard data transfers as “first class citizens,” just like...
TRANSCRIPT
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB to UniTree using 3 different pipeline configurations. The pipelines are built using Condor and Stork scheduling technologies. The whole process is managed by DAGMan.
2 We used the experimental DiskRouter tool instead of Globus GridFTP for cache-to-cache transfers. We obtained an end-to-end throughput (from SRB to UniTree) of 20 files per hour (5.95 MB/sec).
Unitree not responding
Diskrouter reconfigured and restarted
1 We used the native file transfer mechanisms for each underlying system: SRB, Globus GridFTP, and UniTree for the transfers. We described each data transfer with a five stage pipeline, resulting in a 5x2611 node workflow (DAG) managed by DAGMan. We obtained an end-to-end throughput (from SRB to UniTree) of 11 files per hour (3.2 MB/sec).
SRB Server UniTree Server
SDSC Cache NCSA Cache
SRB get
Globus-url-copy
MSS put
Submit Site
A
B C
D
Move X from C to D
Move X from A to B
Remove X from A
Move X from B to C
Remove X from B
Move X from C to D
Move X from A to B
Remove X from A
Move X from B to C
Remove X from B
Move X from C to D
Move X from A to B
Remove X from A
Move X from B to C
Remove X from B
Move X from C to D
Move X from A to B
Remove X from A
Move X from B to C
Remove X from B
DAG File
SRB Server UniTree Server
NCSA Cache
SRB getMSS put
Submit Site
A
C
D
3 We skipped the SDSC cache, and performed direct SRB transfers from SRB server to NCSA cache.
We described each data transfer with a three stage pipeline, resulting in a 3x2611 node workflow (DAG). We obtained an end-to-end throughput (from SRB to UniTree) of 17 files per hour (5.00 MB/sec).
SRB server problem
DAG File
Move X from A to C
Move X from C to D
Remove X from C
Move X from A to C
Move X from C to D
Remove X from C
Move X from A to C
Move X from C to D
Remove X from C
Move X from A to C
Move X from C to D
Remove X from C
Unitree maintenancePDQ Expedition
Condor is a specialized workload management system for compute-intensive jobs. Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Condor chooses when and where to run jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. http://www.cs.wisc.edu/condor
What a batch system means for computational jobs, Stork means the same for data placement activities (ie. transfer, replication, reservations, staging) in Grid: it schedules, runs, monitors data placement jobs and ensures that they complete.Stork can interact with heterogeneous middleware and end-storage systems easily and recover from failures successfully.Stork makes data placement a first class citizen of Grid computing.http://www.cs.wisc.edu/condor/stork
DAGManDAGman (Directed Acyclic Graph Manager) is a meta-scheduler for Condor. It manages dependencies between jobs at a higher level than the Condor Scheduler. DAGMan can now also interact with Stork.http://www.cs.wisc.edu/condor/dagman
DiskRouterMoves large amounts of data efficiently (on the order of terabytes)Uses disk as a buffer to aid in large data transfersPerforms application level routingIncreases network throughput by using multiple sockets and setting tcp buffer sizes explicitlyhttp://www.cs.wisc.edu/condor/diskrouter
GridFTP:High performance, secure, reliable data transfer protocol from Globushttp://www.globus.org/datagrid/gridftp.html
SRB: Storage Resource Broker Client-Server middleware that provides a uniform interface for connecting to heterogeneous data resourceshttp://www.npaci.edu/DICE/SRB
UniTree:NCSA’s High-speed, high-capacity mass storage systemhttp://www.ncsa.uiuc.edu/Divisions/CC/HPDM/unitree
SRB server maintenance
SDSC cache reboot & UW CS Network outage