bridge chip composing a pcie switch over ethernet to … · pcie doesn’t reach dc-scale....
TRANSCRIPT
Page 1 © NEC Corporation 2014 NEC Group Internal Use Only
Problem Statement
▌ Cloud data center must provide computer platforms along
with diversified user requests quickly, in reasonable price
One of the solutions is a “disaggregated platform” which has
resource pools of CPUs, storages, GPUs, and other devices.
▌ PCIe is appropriate interconnection
Simple protocol, vast hard/soft assets.
▌ However, the transport layer of
PCIe doesn’t reach DC-scale.
Connection distance and port
count are limited to in-box scale.
Coupling of host and I/O devices
are too tight to disaggregate them
into shared resource pool.
Architecture 𝟏 𝟐 : Distributed PCIe switch
▌ To be PCIe compliant, Ethernet must be transparent.
Internal bus of PCIe switch chip is extended by Ethernet.
ExpEther appears as a single hop PCIe switch for OS/software.
Utilize commodity device, OS, device driver w/o modification.
Performance
▌ Storage : Block I/O read/write to PCIe SSD shows 92/74%
of direct-attached device’s performance @4KB size
x7 and x2 performance of those of iSCSI, and iSCSI/ToE
▌ GPU Application : 3D graphic benchmark7 performance is
not so much degraded, marking about 80% to 97% score of
those direct attached case.
Proposal
▌ Using Ethernet as a transport of PCI Express to realize
100-times port-count and connection-distance
Seamless expansion of PCIe to realize a big single computer
with many CPU/IO resources on a single network.
Software defined configuration to realize function/performance
along with user’s diversified requirements.
Architecture 𝟐 𝟐 : Reliable Ethernet
▌ To maintain PCIe connection, Ethernet must be reliable.
Prevent packet loss by congestion control and retry
xN bandwidth and redundancy by multipath transport
Utilize conventional Ethernet switch w/o modification
Disaggregated Resource Management
▌ To make PCIe connection, ExpEther chip has group ID and
MAC address
I/O devices are attached to host which has the same GID.
Packet routing is autonomously performed by Ethernet layer.
▌ GID can be set remotely by management software. Software defined configuration by OpenStack based software.
Conclusion
▌ Distributed PCIe switch architecture and reliable Ethernet
realize LAN-scale disaggregated computer w/ resource pool
PCIe compliant : Utilize low-cost commodity soft/hardware.
Comparable performance with that of direct attached device.
Software defined configuration achieving required performance.
Technology Proven : Already in service.
Implementation
▌ To have comparable performance with PCIe switch, all
functions are implemented in a chip w/o S/W stack.
Main functional blocks are PCIe Ethernet Bridge (PEB) and
Ethernet Forwarding Engine (EFE).
External interface are standard; PCI Express and Ethernet.
Seamless Disaggregated Computer
▌ To make a computer providing user-required performance
and function, necessary devices can be selected and
attached from resource pool.
▌ ExpEther provides a singe-hop PCIe connection for all CPU
/ devices in the resource pool over Ethernet.
Bridge Chip Composing a PCIe Switch over Ethernet to Make a Seamless Disaggregated Computer in Data-Center Scale Takashi Yoshikawa1, Jun Suzuki1, Yoichi Hidaka2, Junichi Higuchi2, and Shinji Abe3 1Green Platform Res. Labs, 2System Device Division, 3IT Platform Division, NEC
Port
Co
unt
Distance (m)
0.1 1 10 100 1k 10k
10k
1k
100
10
1
Data Center Scale
Rack-scale Disaggregation
video
storage
flexible
Multi- tenant
Rack Campus Board
Rigid
Stanadrd PCIe Switch
Upstream Port (PCI Bridge)
Downstream Port (PCI Bridge)
Downstream Port (PCI Bridge)
I/O
Device
PCI Express
PCI Express
Ethernet Switch
I/O
Device
I/O
Device
CPU
(RC)
I/O
Device
PCI Express
PCI Express
Ethernet
ExpEther Host Chip (upstream)
ExpEther IO Chip (downstream)
ExpEther IO Chip (downstream)
Internal bus
Implemented in
a same chip
Ethernet region
(transparent)
Ethernet Switch
CPU
(RC)
Monitoring RTT to control packet rate
2 ExpEther packet
5 4 3 2 1
ACK packet
Retry Ethernet-1
ExpEther
Bridge
ExpEther
Bridge
PCIe
Device
PCIe
Device
Ethernet-2
Restoration at Path Error
xN performance using N paths
Congestion Control and retry Multi-path
ExpEther
Bridge
ExpEther
Bridge
MD
IO
Inte
rfa
ce I2C Master
I/F
PEB
I2C Slave I/F
EFE Register Bus
Control Register
Reset I/F
Virtual Wire I/F
Eth
ern
et
MA
C
Retr
y
Contr
ol
Congestion C
ontr
ol
Multi-path
Forw
ard
er
PC
Ie S
witch
Routing C
ore
Virtual Register
Encap /
Decapsula
tion
Retry Buffer
Eth
ern
et
MA
C
UP
/Dow
n S
tream
PC
Ie B
ridge C
ore
16-bit CPU I/F
Total SM2.0 HDR/SM3.0 CPU
1
0.8
0.6
0.4
0.2
0
¥
ExpEther 2G x8 Direct 4G x16 0
10
20
30
40
50
Direct
ExpEther
512 4k 16k 256k 4M
Access Size
IOP
S(k
)
Random Read
Host
B D G I
PCIe Switch
Host
A J
PCIe Switch
Host
C E H
PCIe Switch
Host
F
PCIe Switch
Group#1 Group#2 Group#3 Group#4
Logical View
Host Host Host Host 1 2 3 4
A B C D E F G H I J 1 1 1 1 2 2 3 3 3 4
*Note that this figure omits the EE chips
Multi-rack resource pool system w/
More than100 server / IO devices
(GPU, Storage (HDD, PCIe Flash),
VDI GRID, PCoIP) Service in Apr. 2014.
Virtual PCI Express
Switch over LAN
Dev. A
Dev. F
Dev. J
・・・・
CPU・Mem.
App./OS
CPU/Mem.
App./OS PCI Express:
In-box I/O Bus
Conventional computer
& network system
CPU/Mem.
Dev. A
App./OS
Dev. F Dev. G Dev. J.
CPU/Mem.
App./OS
IP/Ether/IB/Storage Network
ExpEther:
Out-of-box I/O Bus
Plug-in I/Os
In Network
PCIe w/ LAN scale distance,
port count, dis/connectivity
ExpEther:PCI Express over Ethernet
1 10 100 1000 0.1
0.1
1
10
100
Distance (m)
Late
ncy (
usec)
PCIe switch
ExpEther
Rack floor campus building
HPC
I/O
memory
Flash
HPC Copro.
CPU
Pool
Server Module Server
Module Server Module Server
Module
Chipset Xeon
Xeon
uServer Module
uServer Module uServer Module
uServer Module
Avoton
Hi-Performance Low-Performance
Storage
Pool
MemCache
NVMe Drive
Network
Pool
Legacy HDD Module Legacy HDD Module
Legacy HDD Module
Legacy HDD Module
SAS Ctrl
HDD
HDD
NIC w/ DPI NIC w/ DPI
NIC w/ DPI NIC w/ DPI
Hi-end NIC
NIC w/ DPI NIC w/ DPI
NIC w/ DPI Normal NIC
NIC
Special
IO Pool
HPC Copro.
Xeon Phi
HPC Copro. HPC Copro.
GPGPU
Security
Encryption
Storage GPU Application Management
Server -Set GID
-Resource Monitor
Resource pool proto-type w/ ExpEther
on mother board.
Each PCIe slot can be used for
both CPU module and IO devices
Latency Estimation
(for 10GE-FPGA)
-EE chip ~ 0.7us
-Ether SW ~ 0.2us
-Cable 5ns/m
CPU pool system w/ OpenStack Control
PC
I E
xp
ress
Eth
ern
et
chip ASIC FPGA Structured
ASIC
PCIe
Gen2.1 Gen2.1 Gen2.1
1 lane 8 lane 8 lane
@ 2 Gbps @2 Gbps @2 Gbps
Ethernet 2 x GbE 2 x 10GE 2 x 10GE
Power(RMS) 1.0W 6.7 W 3.5 W
Dimension 19 x 19mm 33 x 33 mm 29 x 29 mm
Pin Count w/
Power/GND 388 pin 780 pin 780 pin
Process 130 nm 40 nm 40 nm
Latency 1050 ns 700 ns 700 ns
Internal
Clock 125 MHz 125/156 MHz 125/156 MHz
Detailed Specification of chips