ucsd data transfer in bandwidth challenge of...

UCSD Data Transfer in Bandwidth Challenge of SuperComputing 2009

Haifeng Pi, Frank Wuerthwein, Terrence Martin, Igor Sfiligoi, Abhisheck Rana, James Letts University of California, San Diego

1. Introduction

In the SuperComputing 2009 (SC09) [1], a group of institutions led by California Institute of Technology (Caltech) participated Bandwidth Challenge [2] and wan the award. This exhibit demonstrated several important applications and technologies for massive data transfer that plays crucial roles in the distributed computing strategy for the Large Hadron Collider (LHC) [3] at CERN and other scientific data-intensive experiments.

The primary demonstration is under the program of “Moving Towards Terabit/Sec Transfers of Scientific Datasets: The LHC Challenge” [4], which achieved throughput of 119 gigabits per second (Gb/s) with dataflow between the show floor and institutions including Caltech, University of Michigan, University of California, San Diego (UCSD), University of Florida (UFL), Fermilab, Brookhaven, CERN, Rio de Janeiro State University (UERJ), São Paulo State University (UNESP), Kyungpook National University (KISTI), National Institute of Chemical Physics and Biophysics, Estonia (NICPB) and National University of Sciences and Technology, Pakistan (NUST).

“The focus of the exhibit was the HEP team's record-breaking demonstration of storage-to-storage data transfer over wide area networks from two racks of servers and a network switch-router on the exhibit floor... By setting new records for sustained data transfer among storage systems over continental and transoceanic distances using simulated LHC datasets, the HEP team demonstrated its readiness to enter a new era in the use of state-of-the-art cyber infrastructure to enable physics discoveries at the high energy frontier.” [5]

This paper summaries the experience of UCSD participation of SC09 Bandwidth Challenge (SC09 BWC). The technical attention was paid to several aspects in which we were primarily interested:

The setup of the test cluster with Fast Data Transfer (FDT) [6] client and server. A portion of production cluster at UCSD Tier-2 was devoted to the SC09.

The configuration of local network and 10 Gb/s links between the SC09 show floor and UCSD test cluster. Two 10 Gb/s links were enabled for the massive data transfers.

Benchmark of Hadoop distributed file system (HDFS) [7] under the condition of massive data transfer using FDT. UCSD Tier-2 cluster uses Hadoop distributed file system (HDFS) for the storage. As a matter of fact, the interfacing between network and storage systems has shown a major area for possible improvements in the performance of data transfer throughput. This lies on tuning the configuration of applications and improving the architecture of the system.

It should be emphasized that this paper doesn't attempt to provide a “complete” technical evaluation of HDFS's I/O performance and FDT tool in lieu of bandwidth challenge, partially because both HDFS and FDT are still evolving. Although most our activities did aim at characterizing the performance of the system and finding clues to explain the results, which are helpful for the future improvements for the similar cases or the generic use of related technologies and tools.

The rest of the paper is organized as following: Section 2 summarizes the setup of the UCSD system for SC09; Section 3 describes the major test activities; Section 4 discusses the results of bandwidth challenge with respect to UCSD cluster and comparison with data transfers based on alternative storage system and conducted by other institutions; Section 5 covers some follow-up test activities after SC09, to give a better understanding about the characteristics of system and tools involved; Section 6 provides the overall summary of our experience and future plan towards increasing throughput of the system.

2. Network, System, and Software Setup for SC09 at UCSD

UCSD has two 10 Gb/s connections, as part of the total fifteen links, to the SC09 BWC show floor at Caltech Booth:

A Layer 2 connection via Science Data Network (SDN) provided by Energy Sciences Network (ESnet) [8]. The switch of UCSD Physics Tier-2 cluster has a trunked 10 Gb/s fiber to the San Diego ESnet router located at UCSD Super Computer Center (SDSC). Two existing Virtual LANs (VLANs) are available with one of them in working condition that provides a Border Gateway Protocol (BGP) routed path to the router at Fermilab. This path is used for transferring CMS experiment datasets between Fermilab as CMS Tier-1 and UCSD Physics Tier-2 exclusively.

Changes in the configuration are needed to allow a portion of the network traffic from predefined UCSD cluster nodes to send (receive) data to (from) SC09 show floor. A point-to-point (P2P) connection was established via a static routing using source and destination IP addresses. A temporary VLAN was configured between two routers, one at SC09 Caltech booth and one at physics department where UCSD Physics Tier-2 locates.

This link has maximal 10 Gb/s in bandwidth in one direction. It is shared with other sites that uses ESnet. Our previous experience shows that more than 75% of bandwidth for one test is achievable.

A Layer 3 connection via National Lambdarail (NLR) FrameNet [9] and The Corporation for Education Network Initiative in California (CENIC) [10]. The connection is divided into two segments: one between UCSD and Caltech through CENIC, and another between Caltech and SC09 through NLR FrameNet. The second part is configured to be P2P with specified IP addresses.

A bandwidth of 5-10 Gb/s is expected to be acquired with this link.

The networking equipment on the SC09 show floor is primarily a Cisco Nexus 7010 with 4x32 port 10 Gb/s cards and 2x48 port 1 Gb/s cards. A Cisco 6509 with 128 port 1 Gb/s is on the UCSD Tier-2 cluster.

According to two independent network links available for data transfer, the test servers at UCSD and SC09 BWC were allocated as following:

For ESnet SDN link, 15 cluster nodes at UCSD ran FDT server and client, one Linux server ran FDT client and one Sun storage server ran FDT server at SC09 BWC.

For CENIC-NLR link, 10 cluster nodes at UCSD ran FDT server and client, one Linux server ran FDT client and one Sun storage server run FDT server at SC09 BWC.

10 backup clusters nodes at UCSD ran FDT client for either ESnet or CERNIC-NLR link.

The cluster nodes at UCSD are 2U twin linux servers with dual Quad-Core processors and 10 GB RAM. 1 Gpbs uplink is available for in- and out-bound network. The cluster nodes also ran Hadoop datanode services for the storage at UCSD.

The linux servers at SC09 BWC are dual quad-core with 12 GB RAM and 10 Gb/s uplink. The Sun storage servers are Thumper x4540 with 32 GB RAM and 10 Gb/s uplink.

The FDT server was adapted to the Hadoop distributed file system (HDFS), because HDFS doesn't allow random write. Multiple streams of transfer for the same file must be buffered before writing to HDFS sequentially. The FDT server at UCSD was configured to use up to 4.9 GB memory for the buffer. If memory is exhausted, the data will be written to temporary files with 8MB each in the local disk, then a separate process will read the file, assemble the data, and write to HDFS.

In the following, several important factors of the system architecture at UCSD are listed, which have non-trivial impact on the performance of the data transfer:

The first factor is about the nature of HDFS. The throughput for a single node to write to HDFS is ~80-120 MB/s based on our experience. The HDFS normally write the file block to local disk, but as a distributed file system and requirement on the redundancy of the data, the automatic replication in HDFS will write a second copy the file block to other nodes (in our HDFS configuration, 2 replication for any data is set as default).

The second factor is that we use HDFS datanode to run the FDT server and client. Ideally FDT server mainly deal with network issue, while FDT needs to buffer the data in the local disk for massive data transfer, which introduces a lot of system I/O and CPU utilization, and has negative impact on the throughput of the cluster, especially the performance of HDFS.

3. Test Activities

Some tests were conducted before bandwidth test to tune the system and understand the benchmark performance with respect to FDT protocol and configuration. The WAN tests were part of the pre-bandwidth challenge organized and coordinated together with other institutions, which primarily involved bi-directional transfers between the SC09 BWC and institutions. Each institution was assigned dedicated machines on the SC09 BWC show floor for running the FDT servers/clients, so the performance of the transfer wasn't affected by the transfer from other links. The testing times were mostly chosen in non-working hours, which hopefully minimized the competition of network bandwidth by other non-SC09 traffic.

3.1 Local FDT Test

The test was based on non-HDFS supported FDT and LAN condition. The files were transferred to the machine's local disk which runs FDT server. The tests recorded 920-950 Mb/s (500-700 Mb/s) for memory-to-memory (disk-to-disk) transfers. Increasing the number of streams doesn't significantly increase the throughput.

3.2 WAN FDT Memory-to-Memory Test

In the initial test, 3 FDT clients were run at SC09 WBC show floor to write to 3 UCSD servers. During the test of half an hour, 2 Gb/s throughput was observed. The number was consistent with that of full scale test as shown later, because memory-to-memory test doesn't involve extra system I/O and storage file system. Each FDT transfer (between the server and client) can be treated individually without needing to consider its impact on the load of the system and other running data transfers.

In the the full scale test, one Linux server at SC09 BWC ran 15 FDT clients writing to 15 FDT servers at UCSD via ESnet-SDN link, and another linux server ran 10 FDT clients writing to 10 FDT servers at UCSD via CENIC-NLR link. In the reverse direction, 15 FDT clients at UCSD wrote to one Sun data server at SC09 BWC via ESnet-SDN link, and 10 FDT clients at UCSD wrote to another Sun data server via CENIC-NLR link.

For the direction from SC09 BWC to UCSD, ESnet-SDN and CENIC-NLR reported 6 Gb/s and 8 Gb/s rate respectively, which sustained for at least half an hour. For the direction from UCSD to SC09 BWC, slightly lower throughput was seen.

The number of streams per transfer were changed from 4 to 32, no significant difference in the throughput was observed.

3.3 WAN FDT Disk-to-disk Test

The FDT with HDFS support was used for the disk-to-disk transfer. Several important technical issues were addressed:

If very large file of the size more than 30 GB was transferred, the FDT server would run our of local disk space because a substantial amount of data needed to be buffered to the local disk first before being written to HDFS. Later the tests only used file of ~1GB in size and total number of files to be transferred together less than 30.

While transferring a list of files in one session with total amount of 20-30 GB, the amount of data buffered to the local disk would go up to 50% or more of the total data volume. In this case, we observed that even the FDT client finished the transfer in x minutes, the FDT server would spend similar amount of time to move buffered files in the local disk to HDFS. From the client side and network monitoring prospective, the FDT transfer finished in X minutes, while the actual throughput of the transfer must take into account the total time used by server.

If more than one transfer was scheduled for a client, i.e. 10 transfers with each 30 GB of data, we had to separate each transfer with 10 minutes of waiting period to allow the server to finish processing the data at the local buffer. Overall the monitoring system would give the correct throughput of the transfer.

Due to the fact that buffering transferred data to local files was very I/O and CPU intensive, it was reasonable to reduce the number of streams per transfer to minimize the effect of data buffering. We found some reduction in CPU utilization with less streams per transfer. From the point view of overall throughput, the data buffering didn't increase the throughput if the data transfer was massive and lasted more than 10 minutes.

When machines also ran as worknode of the computing farm, the performance of FDT may be degraded. The performance on the node of 8 cores and 10 GB memory running 6-8 CPU intensive jobs may let throughput go down 60-80%, although we limited the total amount RAM that can be used for FDT 4.9 GB per machine. We also observed a single FDT server normally used 1-2 cores for the data transfer and other related tasks.

3 FDT clients at SC09 BWC and 3 FDT servers at UCSD were tested with ESnet-SDN link. Each client transfered 30 GB of data with 30 files in file list, 1 GB per 1 file, and 4 streams per transfer. A sustained 2 Gb/s throughput from 3 transfers was achieved for a period of 10 minutes..

If the cluster node fully ran computing jobs (normally 8 jobs per node), the transfer rate was ~100 Mb/s. If the computing jobs were removed and the node was devoted to the FDT, the transfer rate went up to ~400 Mb/s.

Three tests were conducted using similar full scale test architecture as that of memory-to-memory test mentioned earlier.

The ESnet-SDN link achieved a throughput of 6.5 Gb/s for the SC09-to-UCSD direction, and 2 Gb/s for the UCSD-to-SC09 direction. The rates of both direction sustained for ~1 hour.

The ESnet-SDN and CENIC-NLR links achieved a combined 10-11 Gb/s throughput for the SC09-to-UCSD direction. No transfer was conducted in the UCSD-to-SC09 direction. The rate sustained for ~ 1 hour.

ESnet-SDN link reported a ~3 Gb/s thoughput for the UCSD-to-SC09 direction. No transfer was conducted in the SC09-to-UCSD direction. The rate sustained for ~10 minutes.

4. UCSD data transfer in the SC09 Bandwidth Challenge

The UCSD data transfers in the SC09 BWC used the similar full scale test architecture except 10 more machines were used for FDT clients transferring data from UCSD to SC09 BWC show floor. The report from UCSD local monitoring system (CACTI) was consistent with the SC09 monitoring based on MonALISA.

The configuration of the FDT includes:

4 streams per transfer to reduced the overhead of local data buffering

Between each transfer job of 30 GB data, there was a 5-minute sleep period to reduce the possibility of running out of local disk space and system failure, which also made the throughput of networking monitoring consistent with the actual throughput of data being written to the storage.

The FDT server and clients running at UCSD eventually involved 35 nodes including 25 nodes

running FDT server and client together, which was roughly 40% of the whole cluster. The network traffic recorded by local monitoring system was shown in Fig.1 and Fig.2 for incoming and outgoing rate respectively. The SC09 BWC lasted 1 hour.

The peak of the transfer rate from SC09 BWC to UCSD occurred at the beginning: 6.8 Gb/s (ESnet-SDN) and 5.9 Gb/s (CENIC-NLR). In the end, the rate reduced to 4.6 Gb/s (ESnet-SDN) and 4.3 Gb/s (CENIC-NLR).

The transfer of both links was conducted by total 25 server-client conncetions. The decrease in the performance related to efficiency of FDT server putting data into HDFS. This was confirmed by increasingly accumulated buffered data at the local disk for each FDT server. The average system load for each server was close to 200 at the end of the BWC.

The HDFS automatic replication also had impact on the file transfer. Once a massive amount of data was in the HDFS, the replication of same data to other nodes negatively influenced the data transfer.

The peak of the transfer rate from UCSD to SC09 BWC also occurred at the beginning with ~2Gb/s for the combined rate of ESnet-SDN and CENIC-NLR. At the end, the rate reduced to 1 Gb/s.

The transfer rate from UCSD to SC09 BWC was roughly 6-8 times lower than that of the reverse direction. This related to the HDFS and FDT server at UCSD heavily loaded for handling the incoming data. The same nodes running the FDT client gave lower priority to the processes that read data from HDFS and sent data out. The consequence was that the UCSD-to-SC09 transfer (run by the FDT clients) was suppressed by the massive write process run by the FDT servers.

In the middle of the hour of bandwidth challenge, 10 new FDT clients were started in 10 nodes to transfer files to SC09 BWC. Those new added nodes didn't run FDT servers to minimize the impact of FDT server causing very high load and I/O consumption. Initially the combined transfer rate of 10 new nodes could reach 2 Gb/s, but it dropped to less than 10% in ~2 minutes. A standalone test was conducted on copying the file from HDFS to local disk. The speed of reading file from HDFS was fast which was consistent with the observation that the reading process was able to achieve very high throughput if the reading didn't last longer than a few minutes.

Overall the UCSD test cluster with 25 FDT servers and 35 FDT clients contributed ~8-9% of the bi-directional traffic of the SC09 BWC.

In the following, we did some investigation on the throughput of UCSD test system compared to Caltech's performance, since HDFS is also used there.

The biggest difference in Caltech HDFS architecture is that several large disk servers were used for FDT. Each server has 12 disks and 10 Gb/s Ethernet interface. To increase the throughput of the server, the raid0 was split from one single large 12-disk to two 6-disk radi0. The aggregate throughput was almost doubled assuming the raid controller was not the bottleneck. The effect of slow under-performing disk, which could slow down the performance of the whole raid0 partition, was also reduced.

At UCSD, 4-disk raid0 for each worker node is used. Each server only has 1 Gb/s Ethernet interface. In terms of the network capacity, the UCSD test cluster was only 1/3-1/4 of the size the Caltech. And the disk servers at Caltech were exclusively used for transferring data from Caltech to SC09 BWC, which might explain why the throughput of Caltech reading-transferring data from HDFS was 10+ times more than the UCSD performance in the same direction.

Caltech tuned the number of files to be transferred in parallel. HDFS is a distributed file system, which benefits multiple concurrent reading processes. 16 files per transfer was used which gives 1.9 Gb/s for one disk server. For some data node with 1 Gb/s link at Caltech, 940 Mb/s transfer rate was observed with 8 streams and 8 parallel files per transfer. This performance was similar to the results at UCSD with the local test. But we always saw the pattern that initially the transfer rate was higher, it would slowly degraded with time and roughly stabilized at ~65-75% of the original rate. During the data transfer of SC09 BWC, we tuned the number of files to be transferred in parallel, no significant difference is observed though. But this might be overwhelmed by the heavy writing processes in the system.

In a short summary, the scalability in FDT transfer of two sites are consistent if the capacity of hardware of each site is taken into account, although the architecture difference between UCSD and Caltech is significant. To fully explain some difference in results about the performance of FDT transfer, further studies are needed on

How the replication of HDFS affects the throughput of FDT transfer. How the resources are taken by replication for large number of nodes with smaller disk space, or smaller number of nodes with large disk space for each node.

How the large file server behave under the stressed environment with mass incoming and outgoing data, especially more than one raid0 is deployed for each machine.

How memory configuration and data buffering affect the throughput.

Number of streams and files to be transferred in parallel, which can only be optimized according to a specific architecture.

Fig.1 SC09 Bandwidth Challenge Network Traffic for incoming data at UCSD

Fig.2 SC09 Bandwidth Challenge Network Traffic for outgoing data at UCSD

5. Tests after SC09

After SC09, further tests were conducted in the local cluster to further characterize the FDT transfer under the HDFS environment.

5.1 One stream and three stream for FDT

It has been mentioned earlier that FDT transfers might not benefit from multiple streams in the massive data transfer due to overhead in the data buffering scheme. Here the test used one and three streams to show how a single transfer of 40 GB data performed.

Fig.3 Network rate for FDT test of 40 GB data with one stream (left) and three stream (right) The transfer of one stream (left) started at 15:30 and lasted 10 minutes, The transfer of three streams (right) started at 15:15 and lasted 10 minutes.

The network transfer rate for FDT with one stream and three streams are shown in Fig.3.

No big difference is observed for incoming rate which characterized the FDT transfer. The benchmark performance for local FDT transfer rate was largely limited by the local Ethernet bandwidth and write rate of HDFS client.

The outgoing rate reflected the replication activities of HDFS. In various tests, the replication involved from 30% to 100% of data that was transferred to the machine.

The load of the system for both transfers were similar, around 20-40. The CPU consumption was shown in Fig.4, which indicated that FDT transfer with three stream used significantly high fraction of the CPU utilization.

Fig.4 CPU consumption of FDT test with one stream (left) and three stream (right) The transfer of one stream (left) started at 15:30 and lasted 10 minutes, The transfer of three streams (right) started at 15:15 and lasted 10 minutes. The FDT was run by an user account, so it was identified as an USER CPU utilization, while other user processes are temporarily assigned a lower priority (for one stream) or stopped (for three stream).

5.2 Standard FDT transfer

The test on the FDT without supporting HDFS is an important aspect to characterize the FDT performance, because handling how the data is written to the storage system a complicated issue, which may be optimized separately. In this test, multiple configurations for number of streams of FDT, 1 stream, 3 streams and 6 streams, were used. The test of various configurations was made sequentially to make comparison easier. The results showed:

The transfer rate in local test environment were not sensitive to the number of streams used in FDT configuration.

Different streams gave similar CPU utilization.

The load of the system was higher for larger number of streams used.

The comparison to FDT with HDFS supported as discussed in the earlier showed:

There was no significant difference between the benchmark transfer rate of standard FDT and FDT with HDFS-supported.

Standard FDT gave better performance in load of the system and memory consumption.

Fig.5 Network transfer rate for FDT to local disk with one stream (left), three streams (middle), and six streams (right).

Fig.6 CPU utilization for FDT to local disk with one stream (left), three streams (middle), and six streams (right).

Fig.7 System load for FDT to local disk with one stream (left), three streams (middle), and six streams (right).

5.3 Large data transfer

14 FDT clients were used to transfer data to 14 FDT servers. There was no overlap in the machine that ran FDT server or client.

Two type of tests were performed:

Standard FDT was used. In this case, data was written in the local disk. Since the file system to store the data were in 15 independent machines. The throughput of the transfer rate was simple the aggregation of 15 transfers.

The throughput in the first 30 second was 11.2 Gb/s. The average throughout was 7.3 Gb/s., which was a reduction of 1/3 from the original one.

FDT with HDFS supported was used. In this case, data was written to HDFS. The replication of a second copy of the data was automatically done by HDFS. The performance of HDFS had impact on the throughput of the transfer. The factors that affected the performance include: the capacity of each FDT server and the capacity of the whole HDFS system.

The throughput in the first 30 second was 9.8 Gb/s. The average throughput was 6.8 Gb/s, which was a reduction of 1/3 from the original one. The overall performance of FDT with HDFS supported was roughly 90% of the standard FDT.

The result was consistent with SC09 BWC as we observed.

6. Conclusion and Future Plan

The FDT transfer of UCSD in SC09 BWC was studied and characterized with some benchmark measurement

Various tests gave reasonably consistent results for the performance of the system.

The comparison between standard FDT and FDT with HDFS supported showed no surprise, because a 10-20% decrease in the writing for transfered data to a distributed file system than to local disk is “expected”, since eventually all the data ended up with in the disk (either local or remote cluster node), the overhead of HDFS to move the data to another node was expected to be the main factor that led to a ~10% decrease in throughput.

Under the situation of massive data transfer, we found

The load of the node that ran FDT server and served as data node of HDFS was higher than that of the standard FDT server.

The replication of HDFS might have impact on the throughput if a large number of nodes are involved.

The buffering of data to local disk didn't bring increase in throughput for mass data transfer. It might introduce more overhead in the I/O of the system.

Several issues remain to be understood which will be taken as the next step of the study:

The reduction in the throughput of the large FDT transfer. Both standard FDT and FDT with HDFS-supported show the similar pattern.

The suppression of “read” processes in the large FDT transfer in the HDFS if the system is heavily loaded with “write” process of FDT transfer.

More comparison between UCSD and Caltech's HDFS architecture and performance may be helpful to further understand the FDT performance and improve the FDT deployment strategy.

Improvement in the configuration of FDT with respect to a distributed file system. For example, number of streams, number of files to be transferred in parallel.

Reference

[1] Super Computing 2009, http://sc09.supercomputing.org/, November 14-20, 2009, Portland, Oregon[2] Bandwidth Challenge in SC09, http://sc09.supercomputing.org/?pg=challenges.html[3] The Large Hadron Collider, http://lhc.web.cern.ch/lhc/[4] Moving Towards Terabit/Sec Transfers of Scientific Datasets, http://supercomputing.caltech.edu/[5] Press Release of SuperComputing'09 by Caltech [6] Fast Data Transfer, http://monalisa.cern.ch/FDT/[7] Hadoop Distributed File System, http://hadoop.apache.org/[8] Energy Science Network, http://www.es.net/[9] National LambdaRail, http://www.nlr.net/[10] The Corporation for Education Network Initiatives in California, http://www.cenic.org

http://sc09.supercomputing.org/

http://www.cenic.org/

http://www.nlr.net/

http://www.es.net/

http://hadoop.apache.org/

http://monalisa.cern.ch/FDT/

http://supercomputing.caltech.edu/

http://lhc.web.cern.ch/lhc/

http://sc09.supercomputing.org/?pg=challenges.html

ucsd data transfer in bandwidth challenge of...

Documents