unified pci express network - universitetet i oslo · x8 ipass cable connectors 2 meter copper...
TRANSCRIPT
CONFIDENTIAL Copyright 2017 All rights reserved. 1
Unified PCI Express NetworkHigh Performance Applications
Roy Nordstrøm
CONFIDENTIAL - Copyright 2017 All rights reserved. 2
Who is Dolphin?
▪ Norwegian Company started in 1992
▪ Dolphin has more than two decades of multi-host computing and clustering experience
▪ Developed a complete software and hardware infrastructure for multi-host computing and IO
– Software
– Host Adapter cards
– Switches
PCI Express Network
CONFIDENTIAL - Copyright 2017 All rights reserved. 3
What is a PCIe Networking or PCIe over Cable?
▪ Extend PCIe between systems and I/O using cables or backplanes
▪ Supports copper and fiber cables– Copper cables up to 9 meters*
– Fiber cables between 10-100 meters*
▪ Two types of bridging models– Transparent bridging (NT) to I/O devices – no software
needed supported in hardware
– Non-transparent bridging (NTB) used to connect two or more root complexes such as processor or GPUs. software required to transfer data.
▪ No changes to PCIe protocol – standard PCIetransactions
• * Copper and fiber cable lengths vary based on boards, switches, and speed of interconnect
CONFIDENTIAL - Copyright 2017 All rights reserved. 4
PCIe is the dominant IO bus technology in computers today, and is also gaining traction as a high-bandwidth low-latency interconnect
PCI-SIG. PCI Express 3.1 Base Specification, 2010. http://www.eetimes.com/document.asp?doc_id=1259778
0
5
10
15
20
25
30
35
Gen 2 Gen 3 Gen 4
Gig
abyte
s p
er
second (
GB/s
)
PCIe x4
PCIe x8
PCIe x16
CONFIDENTIAL - Copyright 2017 All rights reserved. 5
Goal of PCI Express Network
▪ Reduce network latency and overhead to accelerate applications
▪ Take advantage of standardization, technology and performance of PCI Express to develop an efficient powerful local network – Gen2, Gen3 and beyond
▪ Take advantage of the features and functions within PCI Express
▪ Provide a low cost solution that leverages the PCI Express eco-system
▪ Combined host to host and host to I/O network and sharing
CONFIDENTIAL - Copyright 2017 All rights reserved. 6
Unified PCI Express Network
▪ Combination of two elements– PCIe Clustering
– PCIe SmartIO technology
▪ PCIe Clustering– Designed for tightly coupled distributed
systems Low latency
High throughput
– Scale-out capability Node scaling from 2 nodes to 128 nodes (128
nodes based on new technology)
Performance scaling Gen3 x4 PCIe to x16 PCIe
▪ PCIe SmartIO technology– Create pool of devices
Device lending enables devices to be shared in a PCIe Cluster
Direct peer-to-peer communication
By-pass local CPU and system memory
– Enhance capabilities Create MR-IOV capabilities with SR-IOV devices
CONFIDENTIAL - Copyright 2017 All rights reserved. 7
PCIe Network Markets
High Performance Markets that benefit from Low Latency
Electronic Trading Applications
• Trading Desks
• High Availability systems
Storage
• NVMe drive interconnect
• Replication
• Low Latency Storage
Real-time Applications
• Military/ Aerospace
• Medical imaging
• Test Equipment
• Video + Rendering
Simulation
• GPU based simulation
• Reflective memory
Clustered File and storage Systems
• Gluster/GFS
• Hadoop
• DRBD Replication
Parallel Computation
• Reflective memory systems
• HPC Libraries
• CUDA applications
File SystemsElectronicTrading
Real-time
Applications
Parallel Computing
Simulation Storage
CONFIDENTIAL - Copyright 2017 All rights reserved. 9
CLUSTERING TECHNOLOGY
Using Dolphin
CONFIDENTIAL - Copyright 2017 All rights reserved. 10
PCI Express Network Gen3 Hardware
PXH810/812 HOST ADAPTER
▪ PCI Express Small form factor
▪ Gen3 x8 switch
▪ 64 Gbps bidirectional throughput, 0.54ms
▪ x8 Edge connector
▪ x8 iPass cable connector
▪ 5 Meter copper cables
▪ 100 Meter optical cables
▪ Gen1 and Gen 2 support
▪ Compliant with Dolphin Software
▪ Transparent and non-transparent bridging
▪ Host and target support
▪ Transparent only version: PXH812
▪ Available Now
PXH830/832 HOST ADAPTER
▪ PCI Express low profile half length form factor
▪ Gen3 x16 switch
▪ 128 Gbps bidirectional throughput, 0.54ms
▪ x16 Edge connector
▪ 4 – x4 Cable Ports, SFF-8644
▪ PCI-SIG Ext. Cable 3.0 or MiniSAS-HD
▪ 1-x16 port or 2- x8 ports
▪ 9 Meter copper cables
▪ 100 Meter optical cables
▪ Compliant with Dolphin Software
▪ Transparent and non-transparent bridging
▪ Host and target support
▪ Transparent only version: PXH832
▪ Available Now
CONFIDENTIAL - Copyright 2017 All rights reserved. 11
Dolphins Switchtec PCIe Gen3 Adapter
MXH832 PCIe HOST ADAPTER
▪ PCI Express low profile, half length form factor
▪ Gen3 32 lane Microsemi Switchtec chipset
▪ 128 Gbps bidirectional throughput,
▪ Host to host latency 0.5us*
▪ x16 Edge connector
▪ 4 – x4 Cable Ports, SFF-8644
▪ PCI-SIG Ext. Cable 3.0 or MiniSAS-HD
▪ 9 Meter copper cables*
▪ 100 Meter optical cables
▪ Configurations
▪ 1-x16 port,
▪ 2- x8 ports
▪ 4- x4 ports
▪ Transparent
▪ Host and target support
▪ Available Q4 2017
▪*)Project target, may change
CONFIDENTIAL - Copyright 2017 All rights reserved. 12
Dolphins/Samtec PCIe Gen3 Fiber Adapter
PXH840/PXH842 PCIe HOST ADAPTER
▪ PCI Express low profile, half length form factor
▪ Gen3 32 lane Broadcom Switch chipset
▪ 128 Gbps bidirectional throughput,
▪ Host to host latency 0.5us
▪ x16 Edge connector
▪ Up to 4 – x4 Firefly optical engines
▪ 100 Meter optical cable Support
▪ MTP connector support
▪ Compliant with Dolphin Software
▪ Transparent and non-transparent bridging
▪ Host and target support
▪ Transparent only version: PXH842
▪ Available Q4-2017
CONFIDENTIAL - Copyright 2017 All rights reserved. 13
Dolphin Gen 3 Switch Hardware
IXS600 8 port PCIe SWITCH
▪ 1U 19 inch rackmount switch
▪ Gen3 64 lane Chipset
▪ 8 Ports
▪ x8 iPass cable connectors
▪ 2 Meter copper cables
▪ 64 Gbps bidirectional throughput per port
▪ 200ns port to port latency
▪ Supports transparent or non-transparent switching / reflective memory
▪ Gen1 and Gen2 backward compatible
▪ Ethernet management, firmware upgrade and monitoring
▪ Available now
MXS824 24 port PCIe SWITCH
▪ 1U 19 inch rackmount switch
▪ Gen3 96 lane Microsemi Switchtec Chipset
▪ 24 – x4 Cable Ports, SFF-8644
▪ PCI-SIG Ext. Cable 3.0 / MiniSAS-HD cables
▪ 9 Meter copper cables*
▪ 100 Meter optical cables
▪ 32 Gbps bidirectional throughput per port
▪ < 200ns port to port latency
▪ Flexible port merging x4, x8, x16
▪ Supports transparent or non-transparent switching / reflective memory
▪ Cascadeable to larger systems – 64 / 128 nodes
▪ Ethernet management, firmware upgrade and monitoring
▪ Available Q1 2018
CONFIDENTIAL - Copyright 2017 All rights reserved. 14
IXH620 + IXS600 configuration
CONFIDENTIAL - Copyright 2017 All rights reserved. 15
PXH – IXH Technology Comparison
Feature PXH830 PXH810 IXH610 / IXH620
PCIe Technology x16 Gen3 x8 Gen3 x8 Gen2
Connector SFF-8644 iPass iPass
Latency 0.54 us 0.54 us 0.74 us
PIO Throughput 10 Gigabytes/s 5.3 Gigabytes/s 2.9 Gigabytes/s
DMA Throughput 11 Gigabytes/s 6.6 Gigabytes/s 3.5 Gigabytes/s
Multicast groups 16 (default 4) 16 (default 4) 4
Max multicast PIO performance
10 Gigabytes/s 5.3 Gigabytes/s 2.9 Gigabytes/s
Max nodes 3 (switch 2017) 8 20 (56 multicast)
Max cable length 9 m copper100 meter fiber
5 m copper100 meter fiber
7 m copper300 meter fiber
1) System limitations applies2) Scaling dependent on system resources available
CONFIDENTIAL - Copyright 2017 All rights reserved. 16
eXpressWare Software Components
Dolphin Software Components
Standard Software Components
SISCI Shared Memory APISuperSockets Berkeley Sockets APIOptimized TCP / IP Driver
Network ManagementIRM – Interconnect Resource Manager
CONFIDENTIAL - Copyright 2017 All rights reserved. 17
eXpressWare Software portability
OSIF API – Operating System dependent code in separate libraries
IDT Driver
Microsemi
Hardware
Linux OSIF Lib
IRM – Interconnect Resource Manager
Intel NTB
Hardware
PLX PCIe
Hardware
IDT PCIe
Hardware
PLX Driver NTB Driver PFX Driver
Windows OSIF Lib
VxWorks OSIF Lib
PAL API – Hardware dependent code in separate libraries
GENIF API – Interface to other drivers
RTX OSIF Lib
CONFIDENTIAL - Copyright 2017 All rights reserved. 18
Berkeley Sockets API Compliant
• PIO for low latency, RDMA for low CPU utilization
• Adaptive protocols to reduce system load
• Failover to Ethernet
Implementation optimized over shared memory
No changes to applications –plug and play
SuperSockets
CONFIDENTIAL - Copyright 2017 All rights reserved. 19
SuperSockets Availability
Windows
Linux
Windows XP - Windows 10 + Server Editions- Layered Service Provider/WinSock2 API- Support for TCP
Linux 2.6/4.x Distributions-Dynamic Transparent fail-over and fail-back to Ethernet- Support for TCP/RDSv1- UDP / UDP multicast- Compliant with Linux Kernel Space
Socket API
CONFIDENTIAL - Copyright 2017 All rights reserved. 20
Halfway Ping Pong Latency – SISCI API
▪ Half way roundtrip latency
▪ PCIe Gen3 Starts at 0.54 µs
▪ PCIe Gen2 Starts at 0.74 µs
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4 8 16 32 64 128 256 512 1024 2048 4096 8192
LATEN
CY I
N µ
s
MESSAGE SIZE
SCIPP latency
PXH830 x16 Gen3 PXH810 x8 Gen3 IXH610 x8 Gen2
CONFIDENTIAL - Copyright 2017 All rights reserved. 21
SISCI – PIO throughput
▪ Streaming PIO bandwidth
0
2000
4000
6000
8000
10000
12000
4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 65K 131K 262K 524K
TH
RO
UG
HPU
T I
N M
Bps
MESSAGE SIZE
SCIBench Throughput
PXH810 IXH610 PXH830
CONFIDENTIAL - Copyright 2017 All rights reserved. 22
DMA Throughput
0
2000
4000
6000
8000
10000
12000
TH
RO
UG
HPU
T I
N M
Bps
MESSAGE SIZE
DMA Bench Throughput
PXH830 x16 Gen3 PXH810 x8 Gen3 IXH610 x8 Gen2
CONFIDENTIAL - Copyright 2017 All rights reserved. 23
DEVICE LENDINGDolphin SmartIO Technology
CONFIDENTIAL - Copyright 2017 All rights reserved. 24
PCIe Device Lending
PCIe Gen3 Link
▪ All PCIe devices connected to separate server are logically available at one server
– No changes to device drivers
Physical Connection Logical View
CONFIDENTIAL - Copyright 2017 All rights reserved. 25
Resource Pool with Device Lending
▪ Hosts on a PCIe network can borrow regular PCIe devices attached to remote hosts or attached to a central switch– Lend PCIe devices between systems
▪ Supports GPUs, FPGAs, NVMe drivers, and other PCIe devices
▪ Scale out to multiple systems with PCIe switches
▪ No Linux Kernel patches
▪ No application software modifications necessary
▪ Virtually no performance difference between local and remote resources
▪ Supports Hot Pluggable devices
▪ Supports run-time re-configuration and bring-up. Now power on sequencing required between systems
GPU
PCIe Switch
NVMe
CONFIDENTIAL - Copyright 2017 All rights reserved. 26
PCIe Device lending
▪ Lending and borrowing software on multiple hosts
▪ Lending system makes borrowing system aware of available devices
▪ Borrowing system borrows devices. New device is hot added to borrowing system.
▪ Supports MSI and MSI-X Interrupts
▪ No changes to device drivers, standard transparent drivers used with Dolphin Smart-IO setup
▪ Devices look like part of borrowing system and acts like an attached device