virtualizing modern high-speed interconnection networks with performance and scalability institute...
TRANSCRIPT
![Page 1: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/1.jpg)
Virtualizing Modern High-Speed Interconnection Networks with
Performance and Scalability
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
Presenter: Xiang Zhang [email protected]
![Page 2: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/2.jpg)
Introduction
• Virtualization is now one of the enabling technologies of Cloud Computing
• Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include:– Penguin's POD– IBM's Computing On Demand service– R Systems' dedicated hosting service– Amazon’s EC2
![Page 3: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/3.jpg)
Introduction:
Virtualizing HPC clouds?• Pros:– good manageability– proactive fault tolerance– performance isolation– online system maintenance
• Cons:– Performance gap
• Lack low latency interconnects, which is important to tightly-coupled MPI applications
• VMM-bypass has been proposed to relieve the worry
![Page 4: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/4.jpg)
Introduction:
VMM-bypass I/O Virtualization• Xen split device driver model only used to setup necessary
user access points• data communication in the critical path bypasses both the
guest OS and the VMM
Application
OS
OS-bypass I/O device
Application
OS
IDD VM
Guest Module
Backend Module
Privileged Module
Privileged AccessVMM-bypass Access VMM-Bypass I/O (courtesy [7])
![Page 5: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/5.jpg)
Introduction:
InfiniBand Overview• InfiniBand is a popular
high-speed interconnect– OS-bypass/RDMA– Latency: ~1us– BW: 3300MB/s
• ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems
June 2010
Source: http://www.top500.org
![Page 6: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/6.jpg)
P5 P6 P7 P8
P1 P2 P3 P4
XRC domain
XRC domainnode1
node2
XRC in InfiniBand
P7 P8P5 P6
P3 P4P1 P2
RC in InfiniBand
node1
node2RQ SRQ
Introduction:
InfiniBand Scalability Problem• Reliable Connection (RC)
– Queue Pair (QP), Each QP consists of SQ and RQ– QPs require memory
• Shared Receive Queue (SRQ)• eXtensible Reliable Connection (XRC)
– XRC domain & SRQ-based addressing
Conns/Process:(N-1)×C
Conns/Process:(N-1)
SRQ5 SRQ6 SRQ7 SRQ8
N: node countC: cores per node
![Page 7: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/7.jpg)
Problem Statement
• Does scalability gap exist between native and virtualized environments?– CV: cores per VM
XRCD
VM
XRCD
P7 P8P5 P6 VM
VM
XRCD XRCD
VM
XRCDVM
XRC domain
P7 P8
XRC domain
XRC domain
XRC domain
P5 P6
P1VM 2
VM 1
VM 4
VM 3
XRCD
P1VM
VM
XRCD XRCDVM
XRC in VMs (Cv=2) XRC in VMs (Cv=1)
Transport QPs per Process QPs per Node
Native RC (N-1)×C (N-1)×C2
XRC (N-1) (N-1)×C
VM RC (N-1)×C (N-1)×C2
XRC (N-1)×(C/CV) (N-1)×(C2/CV)
Scalability gap exists!
![Page 8: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/8.jpg)
Presentation Outline
• Introduction• Problem Statement• Proposed Design• Evaluation• Conclusions and Future Work
![Page 9: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/9.jpg)
Proposed Design:
VM-proof XRC design• Design goal is to eliminate the scalability gap– Conns/Process: (N-1)×(C/CV) (N-1)
P7 P8
Shared XRC domain
P5 P6
P1 VM
VM
VM
VM
Shared XRC domain
![Page 10: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/10.jpg)
Proposed Design:
Design Challenges• VM-proof sharing of XRC domain
– A single XRC domain must be shared among different VMs within a physical node
• VM-proof connection management– With a single XRC connection, P1 is able to
send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in
High-Speed Interconnection Network
Xen Hypervisor
Abstraction Device Interface (ADI)
Internal MPI Architecture
VM-proof CM
Channel Interface
InfiniBand OS-bypass I/O
MPI Library
CommunicationDevice APIs
MPI Application
Core InfiniBand Modules
Front-end DriverVM-proof
XRCD sharing
Resource Management
Core InfiniBand Modules
Back-end Driver
VM-proof XRCD sharing
Resource Management
UserspaceKernel
Native HCA Driver
Device Mananger and Control Software
IDD Guest Domain
Device Channel
Event ChannelP7 P8
Shared XRC domain
P5 P6
P1 VM
VM
VM
VM
Shared XRC domain
![Page 11: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/11.jpg)
Proposed Design:
Implementation• VM-proof sharing of XRCD– XRCD is shared by opening the same XRCD file– guest domains and IDD have dedicated, non-
shared filesystem– pseudo XRCD file and real XRCD file
• VM-proof CM– Traditionally IP/hostname was used to identify
a node– LID of the HCA is used instead
![Page 12: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/12.jpg)
Proposed Design:
Discussions• safe XRCD sharing– unauthorized applications from other VMs may share
the XRCD • the isolation of the sharing of XRCD could be guaranteed by
the IDD– isolation between VMs running different MPI jobs
• By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other
• XRC migration– main challenge: XRC connection is a process-to-node
communication channel.• Future work
![Page 13: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/13.jpg)
Presentation Outline
• Introduction• Problem Statement• Proposed Design• Evaluation• Conclusions and Future Work
![Page 14: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/14.jpg)
Evaluation:
Platform• Cluster Configuration:– 128-core InfiniBand Cluster– Quad Socket, Quad-Core Barcelona 1.9GHz– Mellanox DDR ConnectX HCA, 24-port MT47396
Infiniscale-III switch• Implementation– Xen 3.4 with Linux 2.6.18.8– OpenFabrics Enterprise Edition (OFED) 1.4.2– MVAPICH-1.1.0
![Page 15: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/15.jpg)
Evaluation:
Microbenchmark• The bandwidth results are
nearly the same• Virtualized IB performs ~0.1us
worse when using blueframe mechanism.– memory copy of the sending data
to the HCA's blueframe pageIB verbs latency using doorbell
IB verbs latency using blueframe MPI latency using blueframe
Explanation: Memory copy operations under virtualized case would
include interactions between the guest
domain and the IDD.
![Page 16: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/16.jpg)
Evaluation:
VM-proof XRC Evaluation• Configurations– Native-XRC: Native environment running XRC-
based MVAPICH.– VM-XRC (CV=n): VM-based environment running
unmodified XRC-based MVAPICH. The parameter CV denotes the number of cores per VM.
– VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design.
![Page 17: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/17.jpg)
Evaluation:
Memory Usage• 16 cores/node cluster fully
connected– The X-axis denotes the
process count– ~12KB memory for each
QP• 16x less memory usage– 64K processes will
consume 13GB/node with the VM-XRC (CV=1) configuration
– The VM-proof XRC design reduces the memory usage to only 800MB/node
Better
800MB
13GB
![Page 18: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/18.jpg)
Evaluation:
MPI Alltoall Evaluation
• a total of 32 processes• 10%~25% improvement for messages < 256B
Better
VM-proof XRC
![Page 19: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/19.jpg)
Evaluation:
Application Benchmarks• VM-proof XRC performs
nearly the same as Native-XRC– Except BT and EP
• Both are better than VM-XRC
Better
Better• little variation for
different CV values• Cv=8 is an exception• Memory allocation not
NUMA-aware guaranteed
VM-proof XRC
![Page 20: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/20.jpg)
Evaluation:
Application Benchmarks (Cont’d)
~15.9x less conns
~14.7x less conns
![Page 21: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/21.jpg)
Conclusion and Future Work
• VM-proof XRC design converges two technologies– VMM-bypass I/O virtualization– eXtensible Reliable Connection in modern high speed interconnection
networks (InfiniBand)
• the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design– ~16x scalability improvement is seen in 16-core/node clusters
• Future work– evaluations on different platforms with increased scale – add VM migration support to our VM-proof XRC design– extend our work to the newly SRIOV-enabled ConnectX-2 HCAs
![Page 22: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/22.jpg)
Questions?
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
![Page 23: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/23.jpg)
Backup Slides
![Page 24: Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,](https://reader035.vdocuments.net/reader035/viewer/2022062716/56649de95503460f94ae3aa1/html5/thumbnails/24.jpg)
OS-bypass of InfiniBand
OpenIB Gen2 stack