[communications in computer and information science] informatics engineering and information science...

A. Abd Manaf et al. (Eds.): ICIEIS 2011, Part IV, CCIS 254, pp. 369–382, 2011. © Springer-Verlag Berlin Heidelberg 2011

Point-to-Point Communication on Gigabit Ethernet and InfiniBand Networks

Roswan Ismail1,2, Nor Asilah Wati Abdul Hamid2, Mohamed Othman2, Rohaya Latip2, and Mohd Azizi Sanwani3

1 Faculty of Art, Computing & Creative Industry, Universiti Pendidikan Sultan Idris, Tanjong Malim, Perak, Malaysia

2 Faculty of Computer Science & Information Technology, Universiti Putra Malaysia, Selangor, Malaysia

[email protected], {asila,mothman,rohaya}@fsktm.upm.edu.my

3 Centre for Diploma Programme, Multimedia University Cyberjaya, Selangor, Malaysia

[email protected]

Abstract. This paper presents the measurements of the MPI point-to-point communication performances on Razi and Haitham clusters by using SKaMPI, IMB and MPBench benchmark programs. These measurements were done on clusters with identical configurations in order to compare and analyze the MPI implementation performance over different interconnect technology; Gigabit Ethernet and InfiniBand. The comparison and analysis of the results from all benchmark programs used were then provided. It revealed that different MPI benchmark programs rendered different results for different interconnect. The results for both technologies were then compared to the experiment’s results that were done on cluster with Opteron quad-core processor. The comparison concluded that besides type of interconnect, the architecture of the clusters itself might also affected the results.

Keywords: MPI benchmarks, parallel computer, IBM Blade HS21 Server, multi-core processors, SKaMPI, IMB, MPBench, Gigabit Ethernet, Infiniband.

1 Introduction

The emerging trend of using cluster as High Performance Computing (HPC) had led many researches in this field particularly the standard utilized for communication between nodes; Message Passing Interface (MPI). Thus, numerous protocols has been designed and proposed to maximize the MPI standard implementation such as Myrinet, InfiniBand and Ethernet.

At present, the prevalent type of interconnect employed for High Performance Computers (HPC) are Gigabit Ethernet and InfiniBand technology. According to Top 500 supercomputers list, Gigabit Ethernet came on top while InfiniBand was the close second with 46.4% and 41.2% respectively [1]. As most clusters use these two types of

370 R. Ismail et al.

interconnect for communicating data between the nodes, the analysis and evaluation of the MPI routines performance on these clusters are indispensable.

This paper discuss the results of MPI message passing communication on Razi (Gigabit Ethernet) and Haitham (InfiniBand) clusters obtained from SKaMPI, IMB and MPBench applications. The results from these applications were then compared and analyzed for validation. The outcome would be beneficial for further research on the Open MPI library implementation and the analysis of MPI implementation on multi-core architecture.

2 Related Works

Previous works provided performance evaluation on clusters with ccNUMA nodes [2], [3], [4], multi-core architecture such as dual-core Opteron nodes [5], [6], [7], [8] and quad-core Opteron nodes [9]. Other work provides performance analysis of MPI communications on cluster using Ethernet and Myrinet [10], [11], and Myrinet and InfiniBand [12], [13]. Other related studies provide a comparison of MPI Benchmark programs on shared and distributed memory machines for point-to-point communication [2] while [14] demonstrated the analysis of the optimum change-over points in algorithm selection for collective communication with MPICH for Ethernet and Myrinet.

However, there were no studies on the comparison and measurement of MPI communication for point-to-point and collective communication from different MPI Benchmark programs for Gigabit Ethernet and InfiniBand technology, particularly on cluster with dual quad-core nodes which used Open MPI.

Unlike previous related works, the work presented in this article provides the measurement of the MPI communication performance on clusters with Intel Xeon dual quad-core processor using two different type of interconnect; Gigabit Ethernet and InfiniBand. Nevertheless, this paper would only discuss the results of Open MPI for point-to-point communication. The results were then compared with the previous work on MPI point-to-point communication on Opteron dual quad-core nodes for Gigabit Ethernet and InfiniBand interconnects [9].

3 Experiments on Razi and Haitham

There are several benchmark programs that can be used to measure the performance of MPI on parallel supercomputers. The most commonly used MPI benchmark programs are SKaMPI [15], Mpptest [16], IMB [17], MPBench [18] and the most recently developed, MPIBench [19]. However, this paper would only discuss the results acquired for MPI message passing communication on clusters from SKaMPI, IMB and MPBench.

The experiments in this paper were conducted on Razi and Haitham clusters, two of the three HPC Clusters of BIRUNI GRID. It was a project commissioned by UPM

Point-to-Point Communication on Gigabit Ethernet and InfiniBand Networks 371

as part of the HPC clusters for A-Grid [20]. The project which started in 2008 and funded by EuAsiaGrid was developed and managed by Infocomm Development Centre (iDEC) of UPM where it was entirely configured and deployed. The only part done by the supplier were hardware racking and during the initial power up stage.

Fig. 1 represents the deployment scheme for Biruni GRID which consists of three clusters; Khaldun, Razi and Haitham. As illustrated in Fig. 1, Khaldun cluster consists of six worker nodes, Razi (20) and Haitham (8). Khaldun cluster served as an experimental grid where it was mainly used as a platform for research, application development and grid tutorial. Conversely, Razi and Haitham clusters were designated as production grid. However, the experimental grid’s environment was almost indistinguishable with the production grid. Each node in the clusters had dual Intel Xeon quad-core processors E5405, 2 GHz with 8 GB RAMs. All nodes were connected together using switch employing star topology.

Fig. 1. Biruni Grid Deployment Scheme [20]

Each node in all clusters had its own ID i.e. for Khaldun it began with wn-sb-001 and incremented for all successive nodes. Similarly for Razi and Haitham, the ID started off with wn-gb-001 and wn-ib-001 respectively. The detail configurations for the clusters are listed on Table 1 while Fig. 2 and 3 represent the block diagram of Intel Xeon dual quad-core processor E5405. This paper provides the comparison of results on Razi and Haitham clusters as both used the same version of MPI implementation.


Table 1. Khaldun Configurations [20]

Khaldun Razi Haitham

Number of nodes 6 20 8

Machine IBM Blade HS21 Servers

CPU 2 x Intel Xeon Quad Core 2 GHz Processors (8 cores per node)

RAM 8 GB

Storage capacity Each node has 2 x 147 GB (only 1 x 147 GB opened, the rest is reserved for future use (multilayer grid))

O.S Scientific Linux 5.4 64 bit

Compiler GCC compiler

Interconnect Gigabit Ethernet (1GB/s) InfiniBand (10 GB/s)

MPI Implementation OPEN-MPI-1.3.3 OPEN-MPI-1.4.3

Fig. 2. Block Diagram of Intel® Xeon® Processor E5405 (I) [21]


Fig. 3. Block Diagram of Intel® Xeon® Processor E5405 (II) [21]

4 Methodology

The experiments involved installation and functionality test of SKaMPI, IMB and MPBench applications on Razi and Haitham clusters. Common procedures applied for all tests such as the size of data, type of MPI routines and identical number of iterations in order to standardize the experiments. All tests for both applications were run multiple times to ensure that the results obtained were consistent. Any abnormalities observed were scrutinized and experiments retested to ascertain that there were no isolated factors that can influence the results.

Before measurements were taken, the transfer message sizes were set from 4 bytes up to 4 Mbytes. The number of repetitions for MPI operations was set to 1000 as a default setting for IMB. MPI routines selected to be measured, compared and reported in this article were MPI_Send/MPI_Recv and MPI_Sendrecv.

All measurements on the experiments were run with exclusive access to the nodes; hence there was no other process that could affect the results. The total nodes used were up to four nodes since the measurements were tested on 2, 4, 8, 16 and 32 cores only. The data obtained from the experiment were then recorded, compared and analyzed.

4.1 Communication Method on Razi and Haitham

All the MPI Benchmark applications used provide measurements for basic send/receive operation using MPI_Send/MPI_Recv routine [15], [17], [18]. The main difference is the communication pattern [2]. Nevertheless for send/receive operations, SKaMPI and MPBench have similar communication pattern. Fig. 4 and 5 illustrate the communication patterns for point to point communication of intra-node and inter-node communication for SKaMPI, IMB and MPBench. An intra-node communication refers to communication on 2, 4, and 8 cores while inter-node communication refers to 16 and 32 cores.

The green line on each figure represents the core selected by IMB as communicating partner to core 0 (sender) while red line refers to the selected core if the measurements were done using SKaMPI and MPBench. Blue lines indicate the


measurement process taken by all benchmark applications in order to find the fastest and the slowest core to be selected as the partner of the sender. In this case, SKaMPI and MPBench performed a short test on all cores to find which core has the slowest communication with the sender while IMB did the opposite by finding the fastest.

As IMB default point to point communication pattern is to find the fastest core to communicate with, it posed problem for accurate measurements as send and receive operation using IMB would always occur as intra-node by default. Therefore, to measure communication time between cores on different nodes, the location of the communicating cores were required to be assigned in advance in order to force the communication to take place. This was done for point to point communication on 16 and 32 cores by using PBS command file option.

Fig. 4. Intra-Node Communication Method on 8 cores

Fig. 5. Inter-Node Communication Method on 16 cores

5 Results

SKaMPI uses Pingpong_Send_Recv and Pingpong_Sendrecv function to measure point to point communication [15]. It returns the average time needed for one full message roundtrip i.e. it returns time for one ‘ping’ plus time for one ‘pong’. In contrast with IMB and MPBench which returns only half of the time. Accordingly, the latency results of SKaMPI were calculated by dividing the roundtrip time into half before it can be compared with latency results from IMB and MPBench.


5.1 Send/Receive (Latency)

Fig. 6 and 7 illustrate Gigabit Ethernet and InfiniBand latency with some additional overhead for point to point (send/recive) communications using 16 cores on Razi and Haitham clusters. The results on Fig. 6 were obtained from SKaMPI and MPBench while results on Fig. 7 were from IMB. Fig. 6 shows that MPBench rendered the lowest results for Haitham cluster while the highest was the SKaMPI result for Razi. However from both benchmark applications, Haitham was clearly had the lower latency due to the the InfiniBand architecture.

Fig. 7 shows that the lowest time on Razi were obtained from IMB using default setting (IMB-default) while the highest time were from IMB with predefined communicating cores (IMB-selected node). Likewise, the pattern was replicated in Haitham cluster in which IMB-default provided lower latency as compared to IMB-selected node. As noted, the default configuration of IMB on both clusters provided identical results regardless of the interconnect technology. These were due to IMB-default only measured communication time between cores within the same node (intra-node communication). Once the communicating cores were determined, Haitham delivered lower latency as compared to Razi. This, again enforced the result acquired from Fig 6.

Fig. 6. Comparison between SKaMPI and MPBench for point to point (send/receive) communication on 16 cores on Razi and Haitham

Fig. 8 and 9 show SKaMPI results for latency on different number of cores of intra-node and inter-node communication for Razi and Haitham. Both figures clearly indicate that communication time for send/receive operation increased consistently over message length. Fig. 8 shows that the latency for send/receive operation for 2, 4 and 8 cores on Razi and Haitham were similar to each other. For intra-node communication, the variance was not substantial since SKaMPI just measured the communication performance on the nodes rather than performance of the communication network connecting the nodes.


Fig. 7. Comparison between IMB (default) and IMB (selected node) for point to point (send/receive) communication on 16 cores on Razi and Haitham

Fig. 8. MPI_Send/MPI_Recv from SKaMPI on different number of cores of intra-node communication for Razi and Haitham

For send/receive operation for inter-node communication in Fig. 9 both clusters returned identical results for 16 and 32 cores using SKaMPI. However, the latency on Razi was higher as compared to Haitham. This again can be attributed to InfiniBand architecture technology.


Fig. 9. MPI_Send/MPI_Recv from SKaMPI on different number of cores of inter-node communication for Razi and Haitham

5.2 Send/Receive (Bandwidth)

Bandwidth results were calculated for all message sizes for IMB and MPBench. However, for MPBench the results were only provided for collective communication but not for point-to-point. In other hand, SKaMPI does not provide any bandwidth result [2]. Thus in this section, the result presented were taken from IMB only. IMB- selected node was chosen to ensure that bandwidth for send/receive communication for 16 and 32 cores were measured as inter-node communications.

Table 2 shows IMB-selected node bandwidth results from 2 to 32 cores on Razi and Haitham clusters using 1 Mbytes (1048576 Bytes) and 4 Mbytes (4194304 Bytes) message sizes. The results for 2, 4, and 8 cores on both clusters were fairly identical with negligible difference. However for 16 and 32 cores, the difference was substantial as the bandwidth provided by Razi cluster was far surpassed by Haitham’s.

Furthermore, comparing the result of the intra-node and inter-node communications on Razi cluster established that bandwidth availability decreased considerably for inter-node communications (16 and 32 cores). However, this phenomenon did not occur in Haitham cluster where the bandwidth increased marginally for inter-node communications.

Fig. 10 shows IMB-selected node bandwidth results for 16 and 32 cores on Razi and Haitham which indicate that bandwidth results for 16 and 32 cores were identical on both clusters. However, the maximum bandwidth achieved on Haitham was considerably higher with 1000 Mbytes/sec as compared to 100 Mbytes/sec on Razi. As expected, the bandwidth result for send/receive communication on Haitham corroborated previous results on latency as it was higher as compared to Razi’s.


Table 2. IMB-selected node Bandwidth results in MBytes/s for various numbers of processors on Razi and Haitham

Number of cores

Razi Haitham

IMB-selected node IMB-selected node

1 MB 4 MB 1 MB 4 MB

2 885 738 883 754

4 886 740 886 759

8 889 738 874 756

16 109 111 912 922

32 109 111 912 922

Fig. 10. IMB-selected node bandwidth results for intra-node communication Razi and Haitham

For comparison, IMB results for latency and bandwidth for 16 cores on Razi and Haitham were then compared to the IMB results from experiments that were done on a cluster with 2.3 Ghz quad-core Opteron processors [9] where each node in the cluster had 16 GB memory with two quad-core processors and was configured with Gigabit Ethernet and InfiniBand networks connected to a Cisco switch. Once again, the latency and bandwidth results for both clusters were taken from IMB-selected node to ensure inter-node communication.

Table 3 presents the comparison of latency and bandwidth of point to point communication for 16 cores on different processor architecture with different interconnects technology. It demonstrates that the latency for send/receive


communication on Razi and Haitham (Intel Xeon) were higher as compared to Barcelona (Opteron) for both type of interconnects. In addition, the bandwidth result also substantiated the previous finding where the result for Barcelona also surpassed the result of Razi and Haitham on both Gigabit Ethernet and InfiniBand.

The reasons for higher latency and lower bandwidth on Razi and Haitham clusters were due to facts that Razi and Haitham machines only had 2 GHz processor and 8 GB RAM whereas Barcelona has slightly faster 2.3 GHz processor and larger, 16 GB RAM. Therefore, the performance of MPI point-to-point message passing on Barcelona was marginally better than Razi and Haitham although both are using the same interconnect type due to the processor’s speed and the size of the memory.

Table 3. Latency and Bandwidth of MPI_Send/MPI_Recv Communication on 16 cores for Different Processor with Different Interconnects

Cluster Interconnect Bandwidth (MB/s) Latency (µs)

Razi Gigabit Ethernet (Intel Xeon processor)

111.39 48.11

Barcelona Gigabit Ethernet (Opteron processor)

112.5 [9] 46.52 [9]

Haitham InfiniBand (Intel Xeon processor)

921.95 4.19

Barcelona InfiniBand (Opteron processor)

1466 [9] 2.01 [9]

5.3 Combine Send and Receive

Only SKaMPI and IMB provide measurements for MPI_Sendrecv, accordingly the results produced were obtained from these two benchmark applications. For MPI_Sendrecv both SKaMPI and IMB use different techniques as compared to MPI_Send/MPI_Recv; each process sends to the right and receives from the left neighbour in a chain of N processors. Most communication networks are capable of providing similar bandwidth if messages are sent simultaneously in both directions. Thus, MPI_Sendrecv provides a suitable way of testing that the MPI implementation can indeed provide this bidirectional bandwidth [2].

Figure 11 shows the measurements for MPI_Sendrecv for SKaMPI and IMB on Razi and Haitham. For Razi, the result for SKaMPI was initially higher than IMB but as the message sizes were increased, they converged. In contrast, the latency result for Haitham were similar at first for both applications and remain lower as compared to Razi’s but as the message sizes were increased, the result for IMB application approached convergence with Razi’s. SKaMPI result remained the lowest until the end.


Fig. 11. Comparison between SKaMPI and IMB for point to point (combine send and receive) communication on 16 cores on Razi and Haitham

Table 4 presents comparisons of latency results for MPI_Send/MPI_Recv and MPI_Sendrecv on 16 cores on Razi and Haitham for message size from 4 to 4194304

Table 4. Latency of MPI_Send/MPI_Recv and MPI_Sendrecv Communication on 16 cores for Different Processor with Different Interconnects

Message size MPI_Send / MPI_Recv (µs) MPI_Sendrecv (µs)

Razi Haitham Razi Haitham

4 55 6 55 3

16 55 7 56 3

64 55 7 56 3

256 58 8 58 5

1024 86 11 91 7

4096 149 17 72 11

16384 313 90 184 18

65536 1123 148 1082 45

262144 3142 373 3085 146

1048576 10010 1260 9980 557

4194304 36718 4807 36625 2221


bytes. The results shown have been rounded to the closest decimal. The latency results for MPI_Sendrecv on Haitham were lower for all message sizes as compared to MPI_Send/MPI_Recv. However, on Razi the result was fairly identical except for message size of 4096 until 16384 bytes when MPI_Sendrecv were distinctly lower. Hence, it can be concluded that the technique used for MPI_Sendrecv for message passing communication can accelerate the process of data transfer as it used ring pattern which enabled it to send and receive data simultaneously in both direction. The result shown that both Haitham and Razi clusters indeed provided facilities for bidirectional bandwidth.

6 Conclusions

As conclusion, the performance of MPI Message Passing routines on cluster depends on measurement techniques applied by MPI benchmark programs and how the communication was being synchronized. Different MPI benchmarks would give a different result for point to point communication. In this case, SKaMPI, IMB and MPBench gave different results since they used different measurement technique and broadcast synchronization for communication. However, from the results obtained, InfiniBand had distinctly better performance in term of latency and throughput.

The architecture of parallel supercomputer cluster itself also affected the result of the communication as the measurements of MPI message passing communication routines on different cluster with different machines rendered different result. Another factor was the mode of communication as inter-node communication of Razi and Haitham provided high latency and low bandwidth whereas intra-node communication gave low latency and high bandwidth.

References

1. Top 500 Supercomputer, http://www.top500.org/stats/list/37/connfam 2. Hamid, N.A.W.A., Coddington, P.: Comparison of MPI Benchmark Programs on Shared

Memory and Distributed Memory Machines (Point-to-Point Communication). International Journal of High Performance Computing Applications (November 7, 2010)

3. Kayi, A., Kornkven, E., El-Ghazawi, T., Newby, G.: Application Performance Tuning for Clusters with ccNUMA Nodes. In: 11th IEEE International Conference on Computational Science and Engineering (2008)

4. Hamid, N.A.W.A., Coddington, P., Vaughan, F.: Performance Analysis of MPI Communications on the SGI Altix 3700. In: Proc. Australian Partnership for Advanced Computing Conference (APAC 2005), Gold Coast, Australia (September 2005)

5. Alam, S.R., Barrett, R.F., Kuehn, J.A., Roth, P.C., Vetter, J.S.: Characterization of Scientific Workloads on Systems with Multi-Core Processors. In: International Symposium on Workload Characterization (2006)

6. Kayi, A., Yao, Y., El-Ghazawi, T., Newby, G.: Experimental Evaluation of Emerging Multi-core Architectures. In: 21st IEEE International Parallel & Distributed Processing Symposium PMEO-PDS Workshop Proceedings, Long Beach, CA (March 2007)


7. Milfeld, K., Goto, K., Purkayastha, A., Guiang, C., Schulz, K.: Effective Use of Multi-Core Commodity Systems in HPC. In: 8th LCI International Conference on High Performance Clustered Computing. Lake Tahoe, CA (May 2007)

8. Barret, R.F., Alam, S.R., Vetter, J.S.: Performance Evaluation of the Cray XT3 Configured with Dual Core Opteron Processors. In: SIGPLAN 2005 (June 2005)

9. Kandadai, S.N., He, X.: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM (2007)

10. Hamid, N.A.W.A., Coddington, P.: Averages, Distribution and Scalability of MPI Communication Times for Ethernet and Myrinet Networks. In: Proceedings of the 25th IASTED International Multi-Conference: Parallel and Distributed Computing and Networks, Austria (2007)

11. Majumder, S., Rixner, S.: Comparing Ethernet and Myrinet for MPI Communication. In: Proceedings of the Seventh Workshop on Languages, Compilers, and Run-time Support for Scalable Systems (LCR 2004), pp. 83–89 (October 2004)

12. Liu, J., Chandrasekaran, B., Yu, W., Wu, J., Buntinas, D., Kini, S.P., Wyckoff, P., Panda, D.K.: Micro-benchmark performance comparison of high-speed cluster interconnects. IEEE Micro 24(1) (January/February 2004)

13. Rashti, M.J., Afsahi, A.: 10-Gigabit iWARP Ethernet: comparative performance analysis with InfiniBand and Myrinet-10G. In: 7th IEEE Workshop on Communication Architecture for Clusters, CAC 2007 (2007)

14. Hamid, N.A.W.A., Coddington, P.: Analysis of Algorithm Selection for Optimizing Collective Communication with MPICH for Ethernet and Myrinet Networks. In: Proc. Of the 8th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007), Adelaide, Australia (December 2007)

15. SKaMPI, http://liinwww.ira.uka.de/~skampi/ 16. Mpptest, http://www.mcs.anl.gov/research/projects/mpi/mpptest/ 17. Pallas MPI Benchmark, http://www.pallas.de/pages/pmbd.htm 18. MPBench, http://icl.cs.utk.edu/projects/llcbench/mpbench.html 19. MPIBench, http://www.dhpc.edelaide.edu.au/projects/MPIBench 20. Biruni Grid. InfoComm Development Centre (iDEC) of University Putra Malaysia (UPM),

http://biruni.upm.my/ 21. Intel, http://ark.intel.com/

[communications in computer and information science] informatics engineering and information science...

Documents